Part III: Programming in R

30 Lists and Attributes

The next important data type in our journey through R is the list. Lists are quite similar to vectors—they are ordered collections of data, indexable by index number, logical vector, and name (if the list is named). Lists, however, can hold multiple different types of data (including other lists). Suppose we had three different vectors representing some information about the plant Arabidopsis thaliana.

III.5_1_r_86_athal_data

We can then use the list() function to gather these vectors together into a single unit with class "list".

III.5_2_r_87_athal_list

Graphically, we might represent this list like so:

Here, the [1] syntax is indicating that the elements of the list are vectors (as in when vectors are printed). Like vectors, lists can be indexed by index vector and logical vector.

III.5_4_r_88_athal_sublist

Both of the above assign to the variable sublist a list looking like:

This seems straightforward enough: subsetting a list with an indexing vector returns a smaller list with the requested elements. But this rule can be deceiving if we forget that a vector is the most basic element of data. Because 2 is the length-one vector c(2), athal[2] returns not the second element of the athal list, but rather a length-one list with a single element (the vector of ecotypes).

III.5_6_r_89_athal_eco_list

A graphical representation of this list:

We will thus need a different syntax if we wish to extract an individual element from a list. This alternate syntax is athal[[2]].

III.5_8_r_90_athal_eco_list

If we wanted to extract the second ecotype directly, we would need to use the relatively clunky second_ecotype <- athal[[2]][2], which accesses the second element of the vector (accessed by [2]) inside of the of the second element of the list (accessed by [[2]]).

III.5_9_r_91_athal_list_print

When we print a list, this structure and the double-bracket syntax is reflected in the output.

III.5_10_r_92_athal_list_print_out

§ Named Lists, Lists within Lists

Like vectors, lists can be named—associated with a character vector of equal length—using the names() function. We can use an index vector of names to extract a sublist, and we can use [[]] syntax to extract individual elements by name.

III.5_11_r_93_athal_list_name_extract1

We can even extract elements from a list if the name of the element we want is stored in another variable, using the [[]] syntax.

III.5_12_r_94_athal_list_name_extract2

As fun as this double-bracket syntax is, because extracting elements from a list by name is a common operation, there is a shortcut using $ syntax.

III.5_13_r_95_athal_list_name_extract3

In fact, if the name doesn’t contain any special characters (spaces, etc.), then the quotation marks can be left off.

III.5_14_r_96_athal_list_name_extract4

This shortcut is widely used and convenient, but, because the quotes are implied, we can’t use $ syntax to extract an element by name if that name is stored in an intermediary variable. For example, if extract_name <- "ecotypes", then athal$extract_name will expand to athal[["extract_name"]], and we won’t get the ecotypes vector. This common error reflects a misunderstanding of the syntactic sugar employed by R. Similarly, the $ syntax won’t work for names like "# Chromosomes" because that name contains a space and a special character (for this reason, names of list elements are often simplified).

Frequently, $ syntax is combined with vector syntax if the element of the list being referred to is a vector. For example, we can directly extract the third ecotype, or set the third ecotype.

III.5_15_r_97_athal_list_name_dollar_vector

Continuing with this example, let’s suppose we have another list that describes information about each chromosome. We can start with an empty list, and assign elements to it by name.

III.5_16_r_98_chrs_list

This list of two elements relates to A. thaliana, so it makes sense to include it somehow in the athal list. Fortunately, lists can contain other lists, so we’ll assign this chrs list as element of the athal list.

III.5_17_r_99_athal_structure_full

Lists are an excellent container for general collections of heterogeneous data in a single organized “object.” (These differ from Python objects in that they don’t have methods stored in them as well, but we’ll see how R works with methods in later chapters.) If we ran print(athal) at this point, all this information would be printed, but unfortunately in a fairly unfriendly manner:

III.5_18_r_100_athal_structure_full_print

This output does illustrate something of interest, however. We can chain the $ syntax to access elements of lists and contained lists by name. For example, lengths <- athal$ChrInfo$Lengths extracts the vector of lengths contained in the internal ChrInfo list, and we can even modify elements of these vectors with syntax like athal$ChrInfo$GeneCounts[1] <- 7079 (perhaps a new gene was recently discovered on the first chromosome). Expanding the syntax a bit to use double-brackets rather than $ notation, these are equivalent to lengths <- athal[["ChrInfo"]][["Lengths"]] and athal[["ChrInfo"]][["GeneCounts"]][1] <- 7079.

§ Attributes, Removing Elements, List Structure

Lists are an excellent way to organize heterogeneous data, especially when data are stored in a Name Value association,[1] making it easy to access data by character name. But what if we want to look up some information associated with a piece of data but not represented in the data itself? This would be a type of “metadata,” and R allows us to associate metadata to any piece of data using what are called attributes. Suppose we have a simple vector of normally distributed data:

III.5_19_r_101_attr_norm_vec

Later, we might want know what type of data this is: is it normally distributed, or something else? We can solve this problem by assigning the term "normal" as an attribute of the data. The attribute also needs a name, which we’ll call "disttype". Attributes are assigned in a fashion similar to names.

III.5_20_r_102_attr_norm_vec_attr

When printed, the output shows the attributes that have been assigned as well.

III.5_21_r_103_attr_norm_vec_attr_out

We can separately extract a given attribute from a data item, using syntax like sample_dist <- attr(sample, "disttype"). Attributes are used widely in R, though they are rarely modified in day-to-day usage of the language.[2]

To expand our A. thaliana example, let’s assign a “kingdom” attribute to the species vector.

III.5_22_r_104_athal_attribute

At this point, we’ve built a fairly sophisticated structure: a list containing vectors (one of which has an attribute) and another list, itself containing vectors, with the various list elements being named. If we were to run print(athal), we’d see rather messy output. Fortunately, R includes an alternative to print() called str(), which nicely prints the structure of a list (or other data object). Here’s the result of calling str(athal) at this point.

III.5_23_r_105_athal_str

Removing an element or attribute from a list is as simple as assigning it the special value NULL.

III.5_24_r_106_athal_delete

The printed structure reveals that this information has been removed.

III.5_25_r_107_athal_delete_out

What is the point of all this detailed list making and attribute assigning? It turns out to be quite important, because many R functions return exactly these sorts of complex attribute-laden lists. Consider the t.test() function, which compares the means of two vectors for statistical equality:

III.5_26_r_108_ttest_print

When printed, the result is a nicely formatted, human-readable result.

III.5_27_r_109_ttest_print_out

If we run str(tresult), however, we find the true nature of tresult: it’s a list!

III.5_28_r_110_ttest_str

Given knowledge of this structure, we can easily extract specific elements, such as the p value with pval <- tresult$p.value or pval <- tresult[["p.value"]].

One final note about lists: vectors (and other types) can be converted into a list with the as.list() function. This will come in handy later, because lists are one of the most general data types in R, and we can use them for intermediary data representations.

III.5_29_r_110_2_as_list

§ Exercises

  1. The following code first generates a random sample called a, and then a sample called response, wherein each element of response is an element of a times 1.5 plus some random noise:III.5_30_r_110_3_linear_dependNext, we can easily create a linear model predicting values of response from a: III.5_31_r_110_4_lm_modelWhen printed, the output nicely describes the parameters of the model.III.5_32_r_110_5_lm_model_outWe can also easily test the significance of the parameters with the anova() function (to run an analysis of variance test on the model).III.5_33_r_110_6_anovaThe output again shows nicely formatted text:III.5_34_r_110_7_anova_outFrom the model, extract the coefficient of a into a variable called a_coeff (which would contain just the number 1.533367 for this random sample).

    Next, from vartest extract the p value associated with the a coefficient into a vector called a_pval (for this random sample, the p value is 2.2e-16).

  2. Write a function called simple_lm_pval() that automates the process above; it should take two parameters (two potentially linearly dependent numeric vectors) and return the p value associated with the first (nonintercept) coefficient.
  3. Create a list containing three random samples from different distributions (e.g., from rnorm(), runif(), and rexp()), and add an attribute for "disttype" to each. Use print() and str() on the list to examine the attributes you added.
  4. Some names can be used with $ notation without quotation marks; if l <- list(values = c(20, 30)), then print(l$values) will print the internal vector. On the other hand, if l <- list("val-entries" = c(20, 30)), then quotations are required as in print(l$"val-entries"). By experimentation, determine at least five different characters that require the use of quotation marks when using $ notation.
  5. Experiment with the is.list() and as.list() functions, trying each of them on both vectors and lists.

  1. R lists are often used like dictionaries in Python and hash tables in other languages, because of this easy and effective Name  Value lookup operation. It should be noted that (at least as of R 3.3), name lookups in lists are not as efficient as name lookups in Python dictionaries or other true hash tables. For an efficient and more idiomatic hash table/dictionary operation, there is also the package hash available for install with install.packages("hash").
  2. For example, the names of a vector are stored as an attribute called “names”—the lines names(scores) <- c("Student A", "Student B", "Student C") and attr(scores, "names") <- c("Student A", "Student B", "Student C") are (almost) equivalent. Still, it is recommended to use specialized functions like names() rather than set them with attr() because the names() function includes additional checks on the sanity of the names vector.