Part III: Programming in R

33 Split, Apply, Combine

Although we touched briefly on writing functions, the discussion has so far been largely about the various types of data used by R and the syntax and functions for working with them. These include vectors of various types, lists, data frames, factors, indexing, replacement, and vector recycling.

At some point won’t we discuss analyzing a meaningful data set? Yes, and that point is now. We’ll continue to work with the gene expression data from chapter 32, “Character and Categorical Data,” after it’s been processed to include the appropriate individual columns using the functions in the stringr library. Here’s the code from the last chapter which pre-processed the input data:

III.8_1_r_152_processed_expressions

As we continue, we’ll likely want to rerun our analysis script in different ways, without having to rerun all of the code above, as it takes a few minutes to run this preprocessing script. One good strategy would be to write this “cleaned” data frame out to a text file using write.table(), and read it back in to an analysis-only script with read.table(). Just for variety, though, let’s use the save() function, which stores any R object (such as a list, vector, or data frame) to a file in a compressed binary format.

III.8_2_r_154_processed_expressions_save

The traditional file extension for such an R-object file is .Rdata. Such data files are often shared among R programmers.

The first few lines of the saved data frame are all readings for the same gene. There are ~360,000 lines in the full data frame, with readings for over 11,000 genes. (The script above, which processes the expr_long_coded.txt file, can be found in the file expr_preprocess.R.) The figure below shows the results of print(head(expr_long_split)):

III.8_3_r_153_processed_expressions_head

Let’s start a new analysis script that loads this data frame before we begin our analysis of it. For this script we may also need the stringr library, and we’re going to include the dplyr library as well (which we assume has been separately installed with install.packages("dplyr") in the interactive console).

III.8_4_r_155_processed_expressions_load

The load() function creates a variable with the same name that was used in save(), and for convenience we’ve created a copy of the data frame with a shorter variable name, expr. While these sorts of R data files are convenient for R users, the benefit of a write.table() approach would be that other users and languages would have more direct access to the data (e.g., in Python or even Microsoft Excel).

This data set represents a multifactor analysis of gene expression[1] in two genotypes of a plant, where a control treatment (water) and a chemical treatment (pesticide) have been applied to each. Further, three tissue types were tested (A, B, and C, for leaf, stem, and root, respectively) and two or three replicates were tested for each combination for statistical power. These data represent microarray readings and have already been normalized and so are ready for analysis.[2] (These data are also anonymized for this publication at the request of the researchers.)

For now, we’ll focus on the genotype and treatment variables. Ignoring other variables, this gives us around 24 expression readings for each gene.

III.8_5_experiment_design_expressions

This sort of design is quite common in gene expression studies. It is unsurprising that gene expression will differ between two genotypes, and that gene expression will differ between two treatments. But genotypes and treatments frequently react in interesting ways. In this case, perhaps the L4 genotype is known to grow larger than the C6 genotype, and the chemical treatment is known to be harmful to the plants. Interestingly, it seems that the chemical treatment is less harmful to the L4 genotype.

The natural question is: which genes might be contributing to the increased durability of L4 under chemical exposure? We can’t simply look for the differences in mean expression between genotypes (with, say, t.test()), as that would capture many genes involved in plant size. Nor can we look for those with mean expression differences that are large between treatments, because we know many genes will react to a chemical treatment no matter what. Even both signals together (left side of the image below) are not that interesting, as they represent genes that are affected by the chemical and are involved with plant size, but may have nothing at all to do with unique features of L4 in the chemical treatment. What we are really interested in are genes with a differential response to the chemical between two the two genotypes (right side of the image below).

Does that mean that we can look for genes where the mean difference is large in one genotype, but not the other? Or within one treatment, but not that other? No, because many of these interesting scenarios would be missed:

For each gene’s expression measurements, we need a classic ANOVA (analysis of variance) model. After accounting for the variance explained by the genotype difference (to which we can assign a p value indicating whether there is a difference) as well as the variance explained by the treatment (which will also have a p value), is the leftover variance—for the interaction/differential response—significant?

§ A Single Gene Test, Formulas

To get started, we’ll run an ANOVA for just a single gene.[3] First, we’ll need to isolate a sub-data frame consisting of readings for only a single ID. We can use the unique() function to get a vector of unique IDs, and then the %in% operator to select rows matching only the first of those.

III.8_9_r_156_extract_one_gene

This data frame contains rows for a single gene and columns for expression, genotype, treatment, and others (we could verify this with a simple print() if needed). We wish to run an ANOVA for these variables, based on a linear model predicting expression values from genotype, treatment, and the interaction of these terms.

The next step is to build a simple linear model using the lm() function, which takes as its only required parameter a “formula” describing how the predictor values should be related to the predicted values. Note the format of the formula: response_vector ~ predictor_vector1 + predictor_vector2 + predictor_vector1 : predictor_vector2, indicating that we wish to determine how to predict values in the response vector using elements from the predictor vectors.

III.8_10_r_157_lm_one_a

If the vectors or factors of interest exist within a data frame (as is the case here), then the lm() function can work with column names only and a data = argument.

III.8_11_r_158_lm_one_b

We can then print the linear model returned with print(lm1), as well as its structure with str(lm1), to see that it contains quite a bit of information as a named list:

III.8_12_r_159_lm_one_print_str

Before we continue, what is this expression ~ genotype + treatment + genotype : treatment “formula”? As usual, we can investigate a bit more by assigning it to a variable and using class() and str().

III.8_13_r_160_formula
III.8_14_r_160_2_formula_out

The output indicates that the class is of type "formula" and that it has an ".Environment" attribute. It’s a bit opaque, but a formula type in R is a container for a character vector of variable names and a syntax for relating them to each other. The environment attribute specifies where those variables should be searched for by default, though functions are free to ignore that (as in the case of data = for lm(), which specifies that the variables should be considered column names in the given data frame). Formulas don’t even need to specify variables that actually exist. Consider the following formula and the all.vars() function, which inspects a formula and returns a character vector of the unique variable names appearing in the formula.

III.8_15_r_161_formula_b

Anyway, let’s return to running the statistical test for the single gene we’ve isolated. The lm1 result contains the model predicting expressions from the other terms (where this “model” is a list with a particular set of elements). To get the associated p values, we’ll need to run it through R’s anova() function.

III.8_16_r_162_gene_one_anova

Printing the result shows that the p values (labeled "Pr(>F)") for the genotype, treatment, and interaction terms are 0.97, 0.41, and 0.56, respectively. If we want to extract the p values individually, we’ll need to first inspect its structure with str(), revealing that the result is both a list and a data frame—unsurprising because a data frame is a type of list. The three p values are stored in the "Pr(>F)" name of the list.

III.8_17_r_163_printout

We can thus extract the p values vector as pvals1 <- anova1$"Pr(>F)"; notice that we must use the quotations to select from this list by name because of the special characters ((, >, and )) in the name. For the sake of argument, let’s store these three values in a data frame with a single row, and column names "genotype", "treatment", and "genotype:treatment" to indicate what the values represent.

III.8_18_r_164_pvals_df

Output:

III.8_19_r_165_pvals_df_print

This isn’t bad, but ideally we wouldn’t be “hard coding” the names of these p values into the column names. The information is, after all, represented in the printed output of the print(anova1) call above. Because anova1 is a data frame as well as a list, these three names are actually the row names of the data frame. (Sometimes useful data hide in strange places!)

So, given any result of a call to anova(), no matter the formula used in the linear model, we should be able to computationally produce a single-row p-values data frame. But we’ve got to be clever. We can’t just run pvals_df1 <- data.frame(pvals1), as this will result in a single-column data frame with three rows, rather than a single-row data frame with three columns. Rather, we’ll first convert the p-values vector into a list with as.list(), the elements of which will become the columns of the data frame because data frames are a type of list. From there, we can assign the rownames() of anova1 to the colnames() of the pvals_df1 data frame.[4]

III.8_21_r_166_pvals_df_2

The programmatically generated output is as follows:

III.8_22_r_167_pvals_df_2_print

This time, there’s an extra column for the residuals, but that’s of little significance.

This whole process—taking a sub–data frame representing expression values for a single gene and producing a single-row data frame of p values—is a good candidate for encapsulating with a function. After all, there are well-defined inputs (the sub–data frame of data for a single gene) and well-defined outputs (the p-values data frame), and we’re going to want to run it several thousand times, once for each gene ID.

III.8_23_r_168_pvals_df_function

To get the result for our expr1 sub–data frame, we’d simply need to call something like pvals_df1 <- sub_df_to_pvals_df(exp1). The next big question is: how are we going to run this function not for a single gene, but for all 11,000 in the data set? Before we can answer that question, we must learn more about the amazing world of functions in R.

§ Exercises

  1. Generate a data frame with a large amount of data (say, one with two numeric columns, each holding 1,000,000 entries). Write the data frame to a text file with write.table() and save the data frame to an R data file with save(). Check the size of the resulting files. Is one smaller than the other? By how much?
  2. The example we’ve covered—generating a small data frame of results (p values) from a frame of data (expression values) by using a function—is fairly complex owing to the nature of the experimental design and the test we’ve chosen. Still, the basic pattern is quite common and very useful.

    A default installation of R includes a number of example data sets. Consider the CO2 data frame, which describes CO2 uptake rates of plants in different treatments ("chilled" and "nonchilled"; see help(CO2) for more detailed information). Here’s the output of print(head(CO2)):
    III.8_24_r_168_2_co2_dataframeThe conc column lists different ambient CO2 concentrations under which the experiment was performed. Ultimately, we will want to test, for each concentration level, whether the uptake rate is different in chilled versus nonchilled conditions. We’ll use a simple t test for this.

    Start by writing a function that takes a data frame with these five columns as a parameter and returns a single-row, single-column data frame containing a p value (reported by t.test()) for chilled uptake rates versus nonchilled uptake rates. (Your function will need to extract two vectors of update rates to provide to t.test(), one for chilled values and one for nonchilled.)

    Next, extract a sub–data frame containing rows where conc values are 1000, and run your function on it. Do the same for a sub–data frame containing conc value of 675.

§ Attack of the Functions!

While R has a dedicated function for computing the interquartile range (the range of data values between the 25th percentile and the 75th), let’s write our own so that we can explore functions more generally. We’ll call it ipercentile_range, to note that it computes an interpercentile range (where the percentiles just happen to be quartiles, for now).

III.8_25_r_169_ipercentile_range

Next, let’s make a list containing some samples of random data.

III.8_26_r_170_three_samples_list

To compute the three interquartile ranges, we could call the function three times, once per element of the list, and store the results in another list.

III.8_27_r_171_three_samples_list_call

Not too bad, but we can do better. Notice the declaration of the function—we’ve used the assignment operator <- to assign to ipercentile_range, just as we do with other variables. In fact, ipercentile_range is a variable! And just as with other data, we can check its class and print it.

III.8_28_r_172_print_function_contents

When it is printed, we see the source code for the function:

III.8_29_r_173_print_function_contents_print

In R, functions are a type of data just like any other. Thus R is an example of a “functional” programming language.[5] One of the important implications is that functions can be passed as parameters to other functions; after all, they are just a type of data. This is a pretty tricky idea. Let’s explore it a bit with the lapply() function. This function takes two parameters: first, a list, and second, a function to apply to each element of the list. The return value is a list representing the collated outputs of each function application. So, instead of using the four lines above to apply ipercentile_range() to each element of samples, we can say:

III.8_30_r_174_lapply_ipercentile_range

The resulting list will have the same three values as above, and the names of the elements in the output list will be inherited from the input list (so sample_irqs$s1 will hold the first range or 5.44, for this random sample anyway).

This is an example of the powerful “split-apply-combine” strategy for data analysis. In general, this strategy involves splitting a data set into pieces, applying an operation or function to each, and then combining the results into a single output set in some way.

When the input and output are both lists, this operation is also known as a “map.” In fact, this paradigm underlies some parallel processing supercomputer frameworks like Google’s MapReduce and the open-source version of Hadoop (although these programs aren’t written in R).

Because lists are such a flexible data type, lapply() is quite useful. Still, in some cases—like this one—we’d prefer the output to be a vector instead of a list. This is common enough to warrant a conversion function called unlist() that extracts all of the elements and subelements of a list to produce a vector.

III.8_32_r_175_unlist

Be warned: because lists are more flexible than vectors, if the list given to unlist() is not simple, the elements may be present in the vector in an odd order and they will be coerced into a common data type.

The lapply() function can also take as input a vector instead of a list, and it is but one of the many “apply” functions in R. Other examples include apply(), which applies a function to each row or column of a matrix,[6] and sapply() (for “simple apply”), which detects whether the return type should be a vector or matrix depending on the function applied. It’s easy to get confused with this diversity, so to start, it’s best to focus on lapply() given its relatively straightforward (but useful) nature.

§ Collecting Parameters with …

The help page for lapply() is interesting. It specifies that the function takes three parameters: X (the list to apply over), FUN (the function to apply), and ..., which the help page details as “optional arguments to FUN.”

III.8_33_r_176_dotdotdot

The ... parameter is common in R, if a bit mysterious at times. This “parameter” collects any number of named parameters supplied to the function in a single unit. This collection can then be used by the function; often they are passed on to functions called by the function.

To illustrate this parameter, let’s modify our ipercentile_range() function to take two optional parameters, a lower and upper percentile from which the range should be computed. While the interquartile range is defined as the range in the data from the 25th to the 75th percentile, our function will be able to compute the range from any lower or upper percentile.

III.8_34_r_177_ipercentile_range_expanded

Now we can call our function as ipercentile_range(sample$s1) to get the default interquartile range of the first sample, or as ipercentile_range(samples$s2, lower = 0.05, upper = 0.95), to get the middle 90th percentile range. Similarly, we have the ability to supply the optional parameters through the ... of lapply(), which will in turn supply them to every call of the ipercentile_range function during the apply step.

III.8_35_r_178_lapply_ipercentile_dotdotdot

This syntax might seem a bit awkward, but something like it is necessary if we hope to pass functions that take multiple parameters as parameters to other functions. This also should help illuminate one of the commonly misunderstood features of R.

§ Functions Are Everywhere

Although mostly hidden from view, most (if not all) operations in R are actually function calls. Sometimes function names might contain nonalphanumeric characters. In these cases, we can enclose the function name in backticks, as in iqr1 <- `ipercentile_range`(samples$s1). (In chapter 27, “Variables and Data,” we learned that any variable name can be enclosed in backticks this way, though it’s rarely recommended.)

What about something like the lowly addition operator, +? It works much like a function: given two inputs (left-hand-side and right-hand-side vectors), it returns an output (the element-by-element sum).

III.8_37_r_179_plus_func_1

In fact, + really is a function that takes two parameters, but to call it that way we need to use backticks.

III.8_38_r_180_plus_func_2

This might be considered somewhat amazing—the “human-readable” version is converted into a corresponding function call behind the scenes by the interpreter. The latter, functional representation is easier for the interpreter to make use of than the human-friendly representation. This is an example of syntactic sugar: adjustments to a language that make it “sweeter” for human programmers. R contains quite a lot of syntactic sugar. Even the assignment operator <- is a function call! The line a <- c(1, 2, 3) is actually syntactic sugar for `<-`("a", c(1, 2, 3)).

Because functions are just a type of data and function names are variables, we can easily reassign new functions to old names. Perhaps a colleague has left her R session open, so we decide to redefine + to mean “to the power of”:

III.8_39_r_181_plus_redefine

Ok, maybe that’s a bit too evil. Another feature of R is that function names that begin and end with % and take two parameters signals that the function can be used as an infix function via syntactic sugar.

III.8_40_r_182_equalish_infix

§ Sugary Data Frame Split-Apply-Combine

Although the basic units of data in R are vectors, and lists are used to store heterogeneous data, data frames are important as a storage mechanism for tabular data. Because lapply() can apply a function to each element of a list, it can also apply a function to each column of a data frame. This fact can be used, for example, to extract all numeric columns from a data frame with an application of lapply() and is.numeric() (the latter returns TRUE when its input is a numeric type; we’ll leave the details as an exercise).

For most tabular data, though, we aren’t interested in independently applying a function to each column. Rather, we more often wish to apply a function to different sets of rows grouped by one or more of the columns. (This is exactly what we want to do with the sub_df_to_pvals() function we wrote for our gene expression analysis.) The dplyr package (install.packages("dplyr")) provides this ability and is both powerful and easy to use, once we get used to its specialized syntactic sugar.

Initially, we’ll cover two functions in the dplyr package, group_by() and do(): group_by() adds metadata (as attributes) to a data frame indicating which categorical columns define groups of rows and returns such a “grouped” data frame, while do() applies a given function to each group of a grouped data frame, and returns a data frame of results with grouping information included.

To illustrate, we’ll first create a simple data frame on which to work, representing samples of fish from one of two different lakes: Green Lake or Detroit Lake.

III.8_41_r_183_fish_df

The nicely formatted printout:

III.8_42_r_184_fish_df_print

First, we’ll group the data frame by the species column using group_by(). This function takes a data frame (or column-named matrix or similar) and a listing of column names (without quotes) that define the groups as additional parameters. As with using the data = parameter for lm(), the function looks within the data frame for the columns specified.

III.8_43_r_185_fish_df_group

Is this data frame different from the original? Yes, but mostly in the attributes/metadata. When printed, we get some information on the “source” (where the data are stored—dplyr can access data from remote databases) and the groups:

III.8_44_r_186_fish_df_group_print

Handily, unlike regular data frames, “grouped” data frames print only the first few rows and columns, even if they are many thousands of rows long. Sometimes this is an undesirable feature—fortunately, running data.frame() on a grouped data frame returns an ungrouped version. The class() of a grouped data frame returns "data.frame" as well as "tbl", "tbl_df", and "grouped_df". Because a grouped data frame is also a regular data frame (and is also a list!), we can still do all of the fancy indexing operations on them covered in previous chapters.

The do() function applies a function to each group of rows of a grouped data frame using a split-apply-combine strategy. The function applied must take as its first parameter a data frame, and it must return a data frame. Let’s write a function that, given a data frame or sub-data frame (group) of the fish data, returns a single-row data frame with columns for mean_weight and sd_weight. Both the mean() and sd() functions can take an na.rm argument to strip out any potential NA values in their inputs, so perhaps our function should take a similar optional parameter.

III.8_45_r_187_mean_sd_weight

This function takes as input a data frame with a column for weight and returns a data frame with summary statistics. We can run it on the original data frame:

III.8_46_r_187_2_mean_sd_weight_run
III.8_47_r_187_3_mean_sd_weight_run_out

Alternatively, we can call do() on the grouped data frame, telling it to run mean_sd_weight() on each group (sub-data frame). The syntax for do() differs slightly from that of lapply() in that we specify a . for the positional argument representing the sub-data frame.

III.8_48_r_187_mean_sd_weight_do

The result of the do() is another grouped data frame, made up of the rows returned by the applied function. Notice that the grouping columns have been added, even though we didn’t specify them in the ret_df inside the function.

III.8_49_r_188_mean_sd_weight_do_out

When developing functions that work with do(), you might run into an error like Error: Results are not data frames at positions: 1, 2. This error message indicates that the function is not returning a data frame type, which is required for do(). To diagnose problems like this, you can add print() statements inside of the function to inspect the contents of variables as the function is applied.

III.8_50_dplyr_do

We can also group data frames by multiple columns, resulting in a single group per combination of column entries.[7]

III.8_51_r_189_two_col_group_do
III.8_52_r_190_two_col_group_do_out

The NA values for some standard deviations are the result of calling sd() on a vector of length one (because there was only one trout measurement per lake).

Although the applied function must take a data frame and return a data frame, there are no restrictions on the nature of the returned data frame. Here our function returns a single-row data frame, but it could return multiple rows that would be stitched together in the combine step. As an example, here’s a function that, given a data frame, computes the mean() of the weight column, and subtracts that mean from all of the entries, returning the modified data frame (so-called “mean normalization” of the data).

III.8_53_r_191_mean_normalize

And then we can easily mean-normalize the data on a per group basis!

III.8_54_r_192_mean_normalize_do
III.8_55_r_193_mean_normalize_do

In the above output, -1.15 and 1.15 are the deviations from the mean of the trout group, and the others are deviations from the mean for the bass group.

§ More Sugar, Optional Parameters, Summarize

Something to note about the call to do() is that it differs syntactically from the call to lapply(). In do(), we specify not only the function to apply to each group, but also how that function will be called, using . to denote the input group data frame. This is somewhat clearer when we want to specify optional arguments to the applied function. In this case, we may want to specify that NA values should be removed by setting remove_nas = TRUE in each call to mean_sd_weight().

III.8_56_r_194_do_optional_param

Speaking of syntactic sugar, the magrittr package (which is installed and loaded along with dplyr, though written by a different author) provides an interesting infix function, %>%. Consider the common pattern above; after the creation of a data frame (fish), we run it through a function to create an intermediary result (fish_by_species from group_by()), run that through another function to get another result (stats_by_species from mean_sd_weight()), and so on. While this process is clear, we could avoid the creation of multiple lines of code if we were willing to nest some of the function calls.

III.8_57_r_195_do_nested

This line of code certainly suffers from some readability problems. The infix %>% function supplies its left-hand side to its right-hand side (located by . in the right-hand side), returning the result of the call to the right-hand function. Thus we can rewrite the above as follows.

Notice the similarity between this and method chaining in Python and stdin/stdout pipes on the command line. (It’s also possible to omit the . if it would be the first parameter, as in fish %>% group_by(species) %>% do(mean_sd_weight(.)).)

We demonstrated that functions applied by do() can return multiple-row data frames (in the mean-normalization example), but our mean_sd_weight() function only returns a single-row data frame with columns made of simple summary statistics. For this sort of simplified need, the dplyr package provides a specialized function called summarize(), taking a grouped data frame and other parameters indicating how such group-wise summary statistics should be computed. This example produces the same output as above, without the need for the intermediary mean_sd_weight() function.

III.8_59_r_196_summarize

The dplyr package provides quite a few other features and functions to explore, but even the simple combination of group_by() and do() represent a powerful paradigm. Because R is such a flexible language, the sort of advanced syntactic sugar used by dplyr is likely to become more common. Although some effort is required to understand these domain-specific languages (DSLs) and how they relate to “normal” R, the time spent is often worth it.

§ Exercises

  1. The primary purpose of lapply() is to run a function on each element of a list and collate the results into a list. With some creativity, it can also be used to run an identical function call a large number of times. In this exercise we’ll look at how p values are distributed under the “null” model—when two random samples from the same distribution are compared.

    Start by writing a function null_pval() that generates two random samples with rnorm(100, mean = 0, sd = 1), compares them with t.test(), and returns the p value. The function should take a single parameter, say, x, but not actually make any use of it.

    Next, generate a list of 10,000 numbers with 10k_nums_list <- as.list(seq(1,10000)), and call 10k_pvals_list <- lapply(10k_nums, null_pval). Turn this into a vector with 10k_pvals_vec <- unlist(10k_pvals_list) and inspect the distribution with hist(10k_pvals_vec).

    What does this test reveal? What is the code you wrote doing, and why does it work? Why does the function need to take an x parameter that isn’t even used? What happens if you change one of the random samples to use rnorm(100, mean = 0.1, sd = 1)?

  2. If df is a data frame, using [] indexing treats it like a list of elements (columns). Write a function called numeric_cols() that takes a data frame as a parameter, and returns a version of the data frame with only the numeric columns kept. You may want to make use of lapply(), unlist(), as.numeric(), and the fact that lists (and data frames when treated like lists) can be indexed by logical vector.

    As an example, if df1 <- data.frame(id = c("PRQ", "XL2", "BB4"), val = c(23, 45.6, 62)), then print(numeric_cols(df1)) should print a data frame with only the val column. If df2 <- data.frame(srn = c(461, 514), name = c("Mel", "Ben"), age = c(27, 24)), then print(numeric_cols(df2)) should print a data frame with only srn and age columns.

  3. Write a function called subset_rows() that takes two parameters: first, a data frame df, and second, an integer n. The function should return a data frame consisting of n random rows from df (you may find the sample() function of use).

    The iris data frame is a commonly-referenced built-in R data set, describing measurements of petals for a variety of iris species. Here’s the output of print(head(iris)):
    III.8_60_r_196_2_iris_summaryUse group_by() to group the iris data frame by the Species column, and then do() along with your subset_rows() function to generate a data frame consisting of 10 random rows per species.

  4. Interestingly, there’s nothing stopping a function called by do() (or lapply()) from itself calling do() (and/or lapply()). Write a set of functions that compares every species in the iris data frame to every other species, reporting the mean difference in Petal.Width. (In this printout the difference in means compares SpeciesA to SpeciesB.)
    III.8_61_r_196_3_double_do To accomplish this, you will want to write a function compare_petal_widths() that takes two sub-data frames, one containing data for species A (sub_dfa) and the other for species B (sub_dfb), and returns a data frame containing the name of the A species, the name of the B species, and the difference of mean Petal.Width. You can test your function by manually extracting data frames representing setosa and versicolor.

    Next, write a function one_vs_all_by_species() that again takes two parameters; the first will be a sub-data frame representing a single species (sub_df), but the second will be a data frame with data for all species (all_df). This function should group all_df by species, and use do() to call compare_petal_widths() on each group sending along sub_df as a secondary parameter. It should return the result.

    Finally, a function all_vs_all_by_species() can take a single data frame df, group it by species, and call one_vs_all_by_species() on each group, sending along df as a secondary parameter, returning the result. In the end, all that is needed is to call this function on iris.

§ Gene Expression, Finished

With the discussion of split-apply-combine and dplyr under our belts, let’s return to the task of creating and analyzing a linear model for each ID in the gene expression data set. As a reminder, we had left off having read in the “cleaned” data set, extracting a sub-data frame representing a single ID, and writing a function that takes such a sub-data frame and returns a single-row data frame of p values. (It should now be clear why we went through the trouble of ensuring our function takes a data frame as a parameter and returns one as well.)

III.8_62_r_197_gene_exp_review

Now, we can use group_by() on the expr data frame to group by the id column, and do() to apply the sub_df_to_pvals_df() function to each group. Rather than work on the entire data set, though, let’s create a expr10 to hold a data frame representing measurements for 10 IDs; if we are satisfied with the results, we can always instead analyze the full expr table (though the full data set takes only a couple of minutes to analyze).

III.8_63_r_198_gene_exp_group_by_do

The result is a nicely organized table of p values for each gene in the data set:

III.8_64_r_199_gene_exp_group_by_do_out

There is one more important issue to consider for an analysis like this: multiple test correction. Suppose for a moment that none of the ~11,000 genes are differentially expressed in any way. Because p values are distributed evenly between zero and one under the null hypothesis (no difference), for the genotype column alone we could expect ~11,000 * 0.05 = 550 apparently (but erroneously) significant differences. Altering the scenario, suppose there were about 50 genes that truly were differentially expressed in the experiment: these would likely have small p values, but it is difficult to separate them from the other 550 apparently significant values.

There are a variety of solutions to this problem. First, we could consider a much smaller threshold for significance. What threshold should that be? Well, if we wanted the number of apparently—but not really—significant p values to be small (on average), say, 0.05, we could solve for 11,000 * α = 0.05, suggesting a cutoff of α = 0.000004545. This is known as Bonferroni correction. The trouble is, this is a conservative correction, and it’s likely that even our 50 significant genes won’t pass that small threshold.

The reason Bonferroni correction is so conservative is that we’ve specified that we’ll only accept the average number of false positives to be 0.05, which is quite small. We might alternatively say we would accept 10 false positives (and solve for 11,000 * α = 10); if the number of kept results is 100, that’s only a 10% false discovery rate (FDR), and so we can expect 90% of the kept results to be real (but we won’t know which are which!). On the other hand, if the number of kept results ends up being 15, that’s a 75% false discovery rate, indicating that we can expect most of the kept results to be false positives.

Instead of specifying the number of false positives we’re willing to accept, we can instead specify the rate and deal with whatever number of kept results come out. There are several of these “FDR-controlling” methods available, and some make more assumptions about the data than others. For example, the methods of Benjamini and Hochberg (described in 1995 and known by the acronym “BH”) and Benjamini, Hochberg, and Yekutieli (from 2001, known as “BY”) differ in the amount of independence the many tests are assumed to have. The BY method assumes less independence between tests but generally results in more results for a given FDR rate. (Consult your local statistician or statistics textbook for details.)

The p.adjust() function allows us to run either the Bonferroni, BH, or BY correction methods. It takes two arguments: first, a vector of p values to adjust, and, second, a method = parameter that can be set to "bonferroni", "BH", or "BY". (There are other possibilities, but these three are the most commonly used.) Returned is a vector of the same length of adjusted FDR values—selecting all entries with values less than Q is equivalent to setting an FDR rate cutoff of Q.

In the final analysis of all genes, we will produce a BY-adjusted column for the interaction p values, and then select from the data set only those rows where the FDR rate is less than 0.05. (The full analysis script is located in the file expr_analyze.R.)

III.8_65_r_200_full_p_adjust

Unfortunately (or fortunately, depending on how much data one hopes to further analyze), only one gene is identified as having a significant interaction after BY correction in this analysis. (If 100 were returned, we would expect 5 of them to be false positives due to the 0.05 FDR cutoff used.)

III.8_66_r_201_full_p_adjust_out

For grouped data frames, print() will omit printing some columns if they don’t fit in the display window. To see all the columns (including our new interaction_BY column), we can convert the grouped table back into a regular data frame and print the head() of it with print(head(data.frame(pvals_interaction_sig))). For the curious, the interaction_BY value of this ID is 0.048.

§ Exercises

  1. We spent a fair amount of work ensuring that the data frame generated and returned by sub_df_to_pvals_df() was programmatically generated. Take advantage of this fact by making the formula a secondary parameter of the function, that is, as function(sub_df, form), so that within the function the linear model can be produced as lm1 <- lm(form, data = sub_df).

    Next, run the analysis with group_by() and do(), but specify some other formula in the do() call (perhaps something like expression ~ genotype + treatment + tissue).

  2. In a previous exercise in this chapter, we wrote a function that took a portion of the CO2 data frame and returned a data frame with a p value for a comparison of CO2 uptake values between chilled and nonchilled plants. Now, use group_by() and do() to run this test for each conc group, producing a data frame of p values. Use p.adjust() with Bonferroni correction to add an additional column of multiply-corrected values.
  3. While functions called by do() must take a data frame as their first parameter and must return a data frame, they are free to perform any other actions as well, like generating a plot or writing a file. Use group_by(), do(), and write.table() to write the contents of the CO2 data frame into seven text files, one for each conc level. Name them conc_95.txt, conc_175.txt, and so on. You may need paste() or str_c() (from the stringr library) to generate the file names.

  1. Here we’ll be using the colloquial term “gene” to refer to a genetic locus producing messenger RNA and its expression to be represented by the amount of mRNA (of the various isoforms) present in the cell at the sampling time.
  2. Microarrays are an older technology for measuring gene expression that have been largely supplanted by direct RNA-seq methods. For both technologies, the overall levels of expression vary between samples (e.g., owing to measurement efficiency) and must be modified so that values from different samples are comparable; microarray data are often \log_2 transformed as well in preparation for statistical analysis. Both steps have been performed for these data; we’ve avoided discussion of these preparatory steps as they are specific to the technology used and an area of active research for RNA-seq. Generally, RNA-seq analyses involve work at both the command line for data preprocessing and using functions from specialized R packages.
  3. There are many interesting ways we could analyze these data. For the sake of focusing on R rather than statistical methods, we’ll stick to a relatively simple per gene 2x2 ANOVA model.
  4. For almost all data type conversions, R has a default interpretation (e.g., vectors are used as columns of data frames). As helpful as this is, it sometimes means nonstandard conversions require more creativity. Having a good understanding of the basic data types is thus useful in day-to-day analysis.
  5. Not all programming languages are “functional” in this way, and some are only partially functional. Python is an example of a mostly functional language, in that functions defined with the def keyword are also data just like any other. This book does not highlight the functional features of Python, however, preferring instead to focus on Python’s object-oriented nature. Here, R’s functional nature is the focus, and objects in R will be discussed only briefly later.
  6. Because the apply() function operates on matrices, it will silently convert any data frame given to it into a matrix. Because matrices can hold only a single type (whereas data frames can hold different types in different columns), using apply() on a data frame will first force an autoconversion so that all columns are the same type. As such, it is almost never correct to use apply() on a data frame with columns of different types.
  7. Because group_by() and similar functions take “bare” or “unquoted” column names, they can be difficult to call given a character vector. For functions in the dplyr and tidyr packages (covered later), versions with _ as part of the function name support this. For example, fish_by_species_lake <- group_by(fish, species, lake) can be replaced with fish_by_species_lake <- group_by_(fish, c("species", "lake")).