Part III: Programming in R

31 Data Frames

In chapter 28, “Vectors,” we briefly introduced data frames as storing tables of data. Now that we have a clear understanding of both vectors and lists, we can easily describe data frames. (If you hastily skipped chapters 28 and 30 to learn about data frames, now’s the time to return to them!) Data frames are essentially named lists, where the elements are vectors representing columns. But data frames provide a few more features than simple lists of vectors. They ensure that the component column vectors are always the same length, and they allow us to work with the data by row as well as by column. Data frames are some of the most useful and ubiquitous data types in R.

While we’ve already covered using the read.table() function to produce a data frame based on the contents of a text file, it’s also possible to create a data frame from a set of vectors.


When printed, the contents of the column vectors are displayed neatly, with the column names along the top and row names along the left-hand side.

As with read.table(), the data.frame() function takes an optional stringsAsFactors argument, which specifies whether character vectors (like ids) should be converted to factor types (we’ll cover these in detail later). For now, we’ll disable this conversion.

Running str(gene_info) reveals the data frame’s list-like nature:


Like elements of lists, the columns of data frames don’t have to have names, but not having them is uncommon. Most data frames get column names when they are created (either by read.table() or data.frame()), and if unset, they usually default to V1, V2, and so on. The column names can be accessed and set with the names() function, or with the more appropriate colnames() function.


To highlight the list-like nature of data frames, we can work with data frames by column much like lists by element. The three lines in the following example all result in sub_info being a two-column data frame.


An expression like gene_info[2] thus would not return a numeric vector of lengths, but rather a single-column data frame containing the numeric vector. We can use [[]] syntax and $ syntax to refer to the vectors contained within data frames as well (the latter is much more common).


We can even delete columns of a data frame by setting the element to NULL, as in gene_info$lengths <- NULL.

The real charm of data frames is that we can extract and otherwise work with them by row. Just as data frames have column names, they also have row names: a character vector of the same length as each column. Unfortunately, by default, the row names are "1", "2", "3", and so on, but when the data frame is printed, the quotation marks are left off (see the result of print(gene_info) above). The row names are accessible through the rownames() function.

Data frames are indexable using an extended [] syntax: [<row_selector>, <column_selector>], where <row_selector> and <column_selector> are vectors. Just as with vectors and lists, these indexing/selection vectors may be integers (to select by index), characters (to select by name), or logicals (to select logically). Also as with vectors, when indexing by index position or name, the requested order is respected.


Here’s the resulting output, illustrating that "3" and "1" were the row names, which now occur at the first and second row, respectively. [1]


If you find this confusing, consider what would happen if we first assigned the row names of the original data frame to something more reasonable, before the extraction.


Now, when printed, the character nature of the row names is revealed.


Finally, if one of <row_selector> or <column_selector> are not specified, then all rows or columns are included. As an example, gene_info[c(3,1), ] returns a data frame with the third and first rows and all three columns, while gene_info[, c("lengths", "ids")] returns one with only the "lengths" and "ids" columns, but all rows.

§ Data Frame Operations

Because data frames have much in common with lists and rows—and columns can be indexed by index number, name, or logical vector—there are many powerful ways we can manipulate them. Suppose we wanted to extract only those rows where the lengths column is less than 200, or the gcs column is less than 0.3.


This syntax is concise but sophisticated. While gene_info$lengths refers to the numeric vector named "lengths" in the data frame, the < logical operator is vectorized, with the single element 200 being recycled as needed. The same process happens for gene_info$gcs < 0.3, and the logical-or operator | is vectorized, producing a logical vector later used for selecting the rows of interest. An even shorter version of these two lines would be selected <- gene_info[gene_info$lengths < 200 | gene_info$gcs < 0.3, ]. The printed output:


If we wished to extract the gcs vector from this result, we could use something like selected_gcs <- selected$gcs. Sometimes more compact syntax is used, where the $ and column name are appended directly to the [] syntax.


Alternatively, and perhaps more clearly, we can first use $ notation to extract the column of interest, and then use [] logical indexing on the resulting vector.


Because subsetting data frame rows by logical condition is so common, there is a specialized function for this task: subset(). The first parameter is the data frame from which to select, and later parameters are logical expressions based on column names within that data frame (quotation marks are left off). For example, selected <- subset(gene_info, lengths < 200 | gcs < 0.3). If more than one logical expression is given, they are combined with & (and). Thus subset(gene_info, lengths < 200, gcs < 0.3) is equivalent to gene_info[gene_info$lengths < 200 & gene_info$gcs < 0.3 ,].

While the subset() function is convenient for simple extractions, knowing the ins and outs of [] selection for data frames as it relates to lists and vectors is a powerful tool. Consider the order() function, which, given a vector, returns an index vector that can be used for sorting. Just as with using order() to sort a vector, we can use order() to sort a data frame based on a particular column.


The result is a data frame ordered by the lengths column:


Because data frames force all column vectors to be the same length, we can create new columns by assigning to them by name, and relying on vector recycling to fill out the column as necessary. Let’s create a new column called gc_categories, which is initially filled with NA values, and then use selective replacement to assign values "low" or "high" depending on the contents of the gcs column.


While there are more automated approaches for categorizing numerical data, the above example illustrates the flexibility and power of the data frame and vector syntax covered so far.


One final note: while the head() function returns the first few elements of a vector or list, when applied to a data frame, it returns a similar data frame with only the first few rows.

§ Matrices and Arrays

Depending on the type of analysis, you might find yourself working with matrices in R, which are essentially two-dimensional vectors. Like vectors, all elements of a matrix must be the same type, and attempts to mix types will result in autoconversion. Like data frames, they are two dimensional, and so can be indexed with [<row_selector>, <column_selector>] syntax. They also have rownames() and colnames().


There are a number of interesting functions for working with matrices, including det() (for computing the determinant of a matrix) and t() (for transposing a matrix).

Arrays generalize the concept of multidimensional data; where matrices have only two dimensions, arrays may have two, three, or more. The line a <- array(c(1,1,1,1,2,2,2,2,3,3,3,3), dim = c(2,2,3)), for example, creates a three-dimensional array, composed of three two-by-two matrices. The upper left element can be extracted as a[1,1,1].

§ Exercises

  1. Read the states.txt file into a data frame, as discussed in Chapter 28, Vectors. Extract a new data frame called states_name_pop containing only the columns for name and population.
  2. Using subset(), extract a data frame called states_gradincome_high containing all columns, and all rows where the income is greater than the median of income or where hs_grad is greater than the median of hs_grad. (There are 35 such states listed in the table.) Next, do the same extraction with [<row_selector>, <column_selector>] syntax.
  3. Create a table called states_by_region_income, where the rows are ordered first by region and second by income. (The order() function can take multiple parameters as in order(vec1, vec2); they are considered in turn in determining the ordering.)
  4. Use order(), colnames(), and [<row_selector>,<column_selector>] syntax to create a data frame states_cols_ordered, where the columns are alphabetically ordered (i.e., hs_grad first, then income, then murder, then name, and so on).
  5. The nrow() function returns the number of rows in a data frame, while rev() reverses a vector and seq(a,b) returns a vector of integers from a to b (inclusive). Use these to produce states_by_income_rev, which is the states data frame but with rows appearing in reverse order of income (highest incomes on top).
  6. Try converting the states data frame to a matrix by running it through the as.matrix() function. What happens to the data, and why do you think that is?

  1. Unfortunately, when printed, quotation marks are left off of row names, commonly leading programmers to think that the row indexes are what is being displayed. Similarly, quotation marks are left off of columns that are character vectors (like the ids column in this example), which can cause confusion if the column is a character type with elements like "1", "2", and so on, potentially leading the programmer to think the column is numeric. In a final twist of confusion, using [<row_selector>, <column_selector>] syntax breaks a common pattern: [] indexing of a vector always returns a vector, [] indexing of a list always returns a list, and [<row_selector>, <column_selector>] indexing of a data frame always returns a data frame, unless it would have only one column, in which case the column/vector is returned. The remedy for this is [<row_selector>, <column_selector>, drop = FALSE], which instructs R not to “drop” a dimension.