Part III: Programming in R

28 Vectors

Vectors (similar to single-type arrays in other languages) are ordered collections of simple types, usually numerics, integers, characters, or logicals. We can create vectors using the c() function (for concatenate), which takes as parameters the elements to put into the vector:

III.3_1_r_18_numeric_vector

The c() function can take other vectors as parameters, too—it will “deconstruct” all subvectors and return one large vector, rather than a vector of vectors.

III.3_2_r_19_c_deconstruct

We can extract individual elements from a vector using [] syntax; though note that, unlike many other languages, the first element is at index 1.

III.3_3_r_20_bracket_extract

The length() function returns the number of elements of a vector (or similar types, like lists, which we’ll cover later) as an integer:

III.3_4_r_21_vec_length

We can use this to extract the last element of a vector, for example.

§ No “Naked Data”: Vectors Have (a) Class

So far in our discussion of R’s data types, we’ve been making a simplification, or at least we’ve been leaving something out. Even individual values like the numeric 4.6 are actually vectors of length one. Which is to say, gc_content <- 0.34 is equivalent to gc_content <- c(0.34), and in both cases, length(gc_content) will return 1, which itself is a vector of length one. This applies to numerics, integers, logicals, and character types. Thus, at least compared to other languages, R has no “naked data”; the vector is the most basic unit of data that R has. This is slightly more confusing for character types than others, as each individual element is a string of characters of any length (including potentially the “empty” string "").

III.3_6_r_22_2_char_vec

This explains quite a lot about R, including some curiosities such as why print(gc_content) prints [1] 0.34. This output is indicating that gc_content is a vector, the first element of which is 0.34. Consider the seq() function, which returns a vector of numerics; it takes three parameters:[1] (1) the number at which to start, (2) the number at which to end, and (3) the step size.

III.3_7_r_23_seq_range

When we print the result, we’ll get output like the following, where the list of numbers is formatted such that it spans the width of the output window.

III.3_8_r_24_seq_range_out

The numbers in brackets indicate that the first element of the printed vector is 1.0, the sixteenth element is 8.5, and the thirty-first element is 16.0.

By the way, to produce a sequence of integers (rather than numerics), the step-size argument can be left off, as in seq(1,20). This is equivalent to a commonly seen shorthand, 1:20.

If all of our integers, logicals, and so on are actually vectors, and we can tell their type by running the class() function on them, then vectors must be the things that we are examining the class of. So, what if we attempt to mix types within a vector, for example, by including an integer with some logicals?

III.3_9_r_25_mix_class1

Running print(class(mix)) will result in "integer". In fact, if we attempt to print out mix with print(mix), we’d find that the logicals have been converted into integers!

III.3_10_r_26_mix_class1_out

R has chosen to convert TRUE into 1 and FALSE into 0; these are standard binary values for true and false, whereas there is no standard logical value for a given integer. Similarly, if a numeric is added, everything is converted to numeric.

III.3_11_r_27_mix_class2

And if a character string is added, everything is converted into a character string (with 3.5 becoming "3.5", TRUE becoming "TRUE", and so on).

III.3_12_r_28_mix_class3

In summary, vectors are the most basic unit of data in R, and they cannot mix types—R will autoconvert any mixed types in a single vector to a “lowest common denominator,” in the order of logical (most specific), integer, numeric, character (most general). This can sometimes result in difficult-to-find bugs, particularly when reading data from a file. If a file has a column of what appears to be numbers, but a single element cannot be interpreted as a number, the entire vector may be converted to a character type with no warning as the file is read in. We’ll discuss reading data in from text files after examining vectors and their properties.

§ Subsetting Vectors, Selective Replacement

Consider the fact that we can use [] syntax to extract single elements from vectors:

III.3_13_r_29_extract_vec1

Based on the above, we know that the 20 extracted is a vector of length one. The 2 used in the brackets is also a vector of length one; thus the line above is equivalent to second_el <- nums[c(2)]. Does this mean that we can use longer vectors for extracting elements? Yes!

III.3_14_r_30_extract_vec2

In fact, the extracted elements were even placed in the resulting two-element vector in the order in which they were extracted (the third element followed by the second element). We can use a similar syntax to selectively replace elements by specific indices in vectors.

III.3_15_r_31_selective_replace_index

Selective replacement is the process of replacing selected elements of a vector (or similar structure) by specifying which elements to replace with [] indexing syntax combined with assignment <-.[2]

R vectors (and many other data container types) can be named, that is, associated with a character vector of the same length. We can set and subsequently get this names vector using the names() function, but the syntax is a little odd.

Named vectors, when printed, display their names as well. The result from above:

III.3_17_r_33_named_vec1_out

Named vectors may not seem that helpful now, but the concept will be quite useful later. Named vectors give us another way to subset and selectively replace in vectors: by name.

III.3_18_r_34_named_vec_set

Although R doesn’t enforce it, the names should be unique to avoid confusion when selecting or selectively replacing this way. Having updated Student A’s and Student B’s score, the change is reflected in the output:

III.3_19_r_35_named_vec_set_out

There’s one final and extremely powerful way of subsetting and selectively replacing in a vector: by logical vector. By indexing with a vector of logicals of the same length as the vector to be indexed, we can extract only those elements where the logical vector has a TRUE value.

III.3_20_r_36_logical_selection

While indexing by index number and by name allows us to extract elements in any given order, indexing by logical doesn’t afford us this possibility.

We can perform selective replacement this way as well; let’s suppose Students A and C retake their quizzes and moderately improve their scores.

III.3_21_r_37_logical_replacement

And the printed output:

III.3_22_r_38_logical_replacement_out

In this case, the length of the replacement vector (c(159, 169)) is equal to the number of TRUE values in the indexing vector (c(TRUE, FALSE, TRUE)); we’ll explore whether this is a requirement below.

In summary, we have three important ways of indexing into/selecting from/selectively replacing in vectors:

  1. by index number vector,
  2. by character vector (if the vector is named), and
  3. by logical vector.

§ Vectorized Operations, NA Values

If vectors are the most basic unit of data in R, all of the functions and operators we’ve been working with—as.numeric(), *, and even comparisons like >—implicitly work over entire vectors.

III.3_23_r_39_vectorized_conversion

In this example, each element of the character vector has been converted, so that class(numerics) would return "numeric". The final character string, "9b3x", cannot be reasonably converted to a numeric type, and so it has been replaced by NA. When this happens, the interpreter produces a warning message: NAs introduced by coercion.

NA is a special value in R that indicates either missing data or a failed computation of some type (as in attempting to convert "9b3x" to a numeric). Most operations involving NA values return NA values; for example, NA + 3 returns NA, and many functions that operate on entire vectors return an NA if any element is NA. A canonical example is the mean() function.

III.3_24_r_40_mean_na

Such functions often include an optional parameter that we can give, na.rm = TRUE, specifying that NA values should be removed before the function is run.

III.3_25_r_41_mean_na_rm

While this is convenient, there is a way for us to remove NA values from any vector (see below).

Other special values in R include NaN, for “Not a Number,” returned by calculations such as the square root of -1, sqrt(-1), and Inf for “Infinity,” returned by calculations such as 1/0. (Inf/Inf, by the way, returns NaN.)

Returning to the concept of vectorized operations, simple arithmetic operations such as +, *, /, -, ^ (exponent), and %% (modulus) are vectorized as well, meaning that an expression like 3 * 7 is equivalent to c(3) * c(7). When the vectors are longer than a single element, the operation is done on an element-by-element basis.

III.3_26_r_42_vec_mult
III.3_27_vectorized_multiplication

If we consider the * operator, it takes two inputs (numeric or integer) and returns an output (numeric or integer) for each pair from the vectors. This is quite similar to the comparison >, which takes two inputs (numeric or integer or character) and returns a logical.

III.3_28_r_43_vec_compare

§ Vector Recycling

What happens if we try to multiply two vectors that aren’t the same length? It turns out that the shorter of the two will be reused as needed, in a process known as vector recycling, or the reuse of the shorter vector in a vectorized operation.

III.3_29_r_44_vector_recycling

This works well when working with vectors of length one against longer vectors, because the length-one vector will be recycled as needed.

III.3_30_r_45_vector_recycling_single

If the length of the longer vector is not a multiple of the length of the shorter, however, the last recycle will go only partway through.

III.3_31_r_46_vector_recycling_noneven

When this happens, the interpreter prints a warning: longer object length is not a multiple of shorter object length. There are few situations where this type of partial recycling is not an accident, and it should be avoided.

Vector recycling also applies to selective replacement; for example, we can selectively replace four elements of a vector with elements from a two-element vector:

III.3_32_r_46_vector_recycling_replacement

More often we’ll selectively replace elements of a vector with a length-one vector.

III.3_33_r_47_vector_recycling_replacement_single

These concepts, when combined with vector indexing of various kinds, are quite powerful. Consider that an expression like values > 35 is itself vectorized, with the shorter vector (holding just 35) being recycled such that what is returned is a logical vector with TRUE values where the elements of values are greater than 35. We could use this vector as an indexing vector for selective replacement if we wish.

III.3_34_r_48_vector_recycling_replacement_logical

More succinctly, rather than create a temporary variable for select_vec, we can place the expression values > 35 directly within the brackets.

III.3_35_r_49_vector_recycling_replacement_logical2

Similarly, we could use the result of something like mean(values) to replace all elements of a vector greater than the mean with 0 easily, no matter the order of the elements!

III.3_36_r_50_vector_recycling_replacement_logical3

More often, we’ll want to extract such values using logical selection.

III.3_37_r_51_vector_recycling_selection

These sorts of vectorized selections, especially when combined with logical vectors, are a powerful and important part of R, so study them until you are confident with the technique.

§ Exercises

  1. Suppose we have r as a range of numbers from 1 to 30 in steps of 0.3; r<- seq(1, 30, 0.3). Using just the as.integer() function, logical indexing, and comparisons like >, generate a sequence r_decimals that contains all values of r that are not round integers. (That is, it should contain all values of r except 1.0, 2.0, 3.0, and so on. There should be 297 of them.)
  2. We briefly mentioned the %%, or “modulus,” operator, which returns the remainder of a number after integer division (e.g., 4 %% 3 == 1 and 4 %% 4 == 0; it is also vectorized). Given any vector r, for example r <- seq(1, 30, 0.3), produce a vector r_every_other that contains every other element of r. You will likely want to use %%, the == equality comparison, and you might also want to use seq() to generate a vector of indices of the same length as r.

    Do the same again, but modify the code to extract every third element of r into a vector called r_every_third.

  3. From chapter 27, “Variables and Data,” we know that comparisons like ==, !=, >= are available as well. Further, we know that ! negates the values of a logical vector, while & combines two logical vectors with “and,” and | combines two logical vectors with “or.” Use these, along with the %% operator discussed above, to produce a vector div_3_4 of all integers between 1 and 1,000 (inclusive) that are evenly divisible by 3 and evenly divisible by 4. (There are 83 of them.) Create another, not_div_5_6, of numbers that are not evenly divisible by 5 or 6. (There are 667 of them. For example, 1,000 should not be included because it is divisible by 5, and 18 should not be included because it is divisible by 6, but 34 should be because it is divisible by neither.)

§ Common Vector Functions

As vectors (specifically numeric vectors) are so ubiquitous, R has dozens (hundreds, actually) of functions that do useful things with them. While we can’t cover all of them, we can quickly cover a few that will be important in future chapters.

First, we’ve already seen the seq() and length() functions; the former generates a numeric vector comprising a sequence of numbers, and the latter returns the length of a vector as a single-element integer vector.

III.3_38_r_52_seq_length

Presented without an example, mean(), sd(), and median() return the mean, standard deviation, and median of a numeric vector, respectively. (Provided that none of the input elements are NA, though all three accept the na.rm = TRUE parameter.) Generalizing median(), the quantile() function returns the Yth percentile of a function, or multiple percentiles if the second argument has more than one element.

III.3_39_r_53_quantile

The output is a named numeric vector:

III.3_40_r_54_quantile_out

The unique() function removes duplicates in a vector, leaving the remaining elements in order of their first occurrence, and the rev() function reverses a vector.

III.3_41_r_55_unique_rev

There is the sort() function, which sorts a vector (in natural order for numerics and integers, and lexicographic (dictionary) order for character vectors). Perhaps more interesting is the order() function, which returns an integer vector of indices describing where the original elements of the vector would need to be placed to produce a sorted order.

III.3_42_r_56_order

In this example, the order vector, 2 5 3 4 1, indicates that the second element of rev_uniq would come first, followed by the fifth, and so on. Thus we could produce a sorted version of rev_uniq with rev_uniq[order_rev_uniq] (by virtue of vectors’ index-based selection), or more succinctly with rev_uniq[order(rev_uniq)].

III.3_43_order_example

Importantly, this allows us to rearrange multiple vectors with a common order determined by a single one. For example, given two vectors, id and score, which are related element-wise, we might decide to rearrange both sets in alphabetical order for id.

III.3_44_r_57_order_multisort

The sample() function returns a random sampling from a vector of a given size, either with replacement or without as specified with the replace = parameter (FALSE is the default if unspecified).

III.3_45_r_58_sample

The rep() function repeats a vector to produce a longer vector. We can repeat in an element-by-element fashion, or over the whole vector, depending on whether the each = parameter is used or not.

III.3_46_r_59_rep

Last (but not least) for this discussion is the is.na() function: given a vector with elements that are possibly NA values, it returns a logical vector whole elements are TRUE in indices where the original was NA, allowing us to easily indicate which elements of vectors are NA and remove them.

III.3_47_r_60_is_na

Notice the use of the exclamation point in the above to negate the logical vector returned by is.na().

§ Generating Random Data

R excels at working with probability distributions, including generating random samples from them. Many distributions are supported, including the Normal (Gaussian), Log-Normal, Exponential, Gamma, Student’s t, and so on. Here we’ll just look at generating samples from a few for use in future examples.

First, the rnorm() function generates a numeric vector of a given length sampled from the Normal distribution with specified mean (with mean =) and standard deviation (with sd =).

III.3_48_r_61_rnorm

Similarly, the runif() function samples from a uniform distribution limited by a minimum and maximum value.

III.3_49_r_62_runif

The rexp() generates data from an Exponential distribution with a given “rate” parameter, controlling the rate of decay of the density function (the mean of large samples will approach 1.0/rate).

III.3_50_r_63_rexp
III.3_51_distribution_shapes

R includes a large number of statistical tests, though we won’t be covering much in the way of statistics other than a few driving examples. The t.test() function runs a two-sided student’s t-test comparing the means of two vectors. What is returned is a more complex data type with class "htest".

III.3_52_r_63_2_ttest

When printed, this complex data type formats itself into nice, human-readable output:

III.3_53_r_63_3_ttest_out

§ Reading and Writing Tabular Data, Wrapping Long Lines

Before we go much further, we’re going to want to be able to import data into our R programs from external files (which we’ll assume to be rows and columns of data in text files). We’ll do this with read.table(), and the result will be a type of data known as a “data frame” (or data.frame in code). We’ll cover the nuances of data frames later, but note for now that they can be thought of as a collection of vectors (of equal length), one for each column in the table.

As an example, let’s suppose we have a tab-separated text file in our present working directory called states.txt.[3] Each row represents one of the US states along with information on population, per capita income, illiteracy rate, murder rate (per 100,000), percentage of high school graduates, and region (all measured in the 1970s). The first row contains a “header” line with column names.

III.3_54_r_63_4_states_top

Later in the file, someone has decided to annotate Michigan’s line, indicating it as the “mitten” state:

III.3_55_r_63_5_states_mitten

Like most functions, read.table() takes many potential parameters (23, in fact), but most of them have reasonable defaults. Still, there are five or so that we will commonly need to set. Because of the need to set so many parameters, using read.table() often results in a long line of code. Fortunately, the R interpreter allows us to break long lines over multiple lines, so long as each line ends on a character that doesn’t complete the expression (so the interpreter knows it needs to keep reading following lines before executing them). Common character choices are the comma and plus sign. When we do wrap a long line in this way, it’s customary to indent the following lines to indicate their continuance in a visual way.

III.3_56_r_64_read_table_states

When reading states.txt, the file = parameter specifies the file name to be read, while header = TRUE indicates to the interpreter that the first line in the file gives the column names (without it, the column names will be "V1", "V2", "V3" and so on). The sep = "\t" parameter indicates that tab characters are used to separate the columns in the file (the default is any whitespace), and comment.char = "#" indicates that # characters and anything after them should be ignored while reading the file (which is appropriate, as evident by the # mitten annotation in the file). The stringsAsFactors = FALSE parameter is more cryptic: it tells the interpreter to leave the character-vector columns (like region in this example) as character vectors, rather than convert them to the more sophisticated factor data type (to be covered in later chapters).

At this point, the states variable contains the data frame holding the columns (vectors) of data. We can print it with print(states), but the result is quite a lot of output:

III.3_57_r_65_states_print

It might make better sense to extract just the first 10 rows of data and print them, which we can do with the head() function (head() can also extract just the first few elements of a long vector).

III.3_58_r_66_head_states

The functions nrow() and ncol() return the number of rows and columns of a data frame, respectively (which is preferred over length(), which returns the number of columns); the dim() function returns a two-element vector with number of rows (at index 1) and number of columns (at index 2).

As mentioned previously, individual columns of a data frame are (almost always) vectors. To access one of these individual vectors, we can use a special $ syntax, with the column name following the $.

III.3_59_r_67_states_col_extract

So long as the column name is sufficiently simple (in particular, so long as it doesn’t have any spaces), then the quote marks around the column name can be (and often are) omitted.

III.3_60_r_68_states_col_extract_noquotes

Although this syntax can be used to extract a column from a data frame as a vector, note that it refers to the vector within the data frame as well. In a sense, states$income is the vector stored in the states data frame. Thus we can use techniques like selective replacement to work with them just like any other vectors. Here, we’ll replace all instances of “North Central” in the states$region vector with just the term “Central,” effectively renaming the region.[4]

III.3_61_r_69_states_col_rename_north_central

Writing a data frame to a tab-separated file is accomplished with the write.table() function.[5] As with read.table(), write.table() can take quite a few parameters, most of which have reasonable defaults. But there are six or so we’ll want to set more often than others. Let’s write the modified states data frame to a file called states_modified.txt as a tab-separated file.

III.3_62_r_69_2_states_write

The first two parameters here are the data frame to write and the file name to write to. The quote = FALSE parameter specifies that quotation marks shouldn’t be written around character types in the output (so the name column will have entries like Alabama and Alaska rather than "Alabama" and "Alaska"). The sep = "\t" indicates that tabs should separate the columns, while row.names = FALSE indicates that row names should not be written (because they don’t contain any meaningful information for this data frame), and col.names = TRUE indicates that we do want the column names output to the first line of the file as a “header” line.

§ R and the Unix/Linux Command Line

In chapter 26, “An Introduction,” we mentioned that R scripts can be run from the command line by using the #!/usr/bin/env Rscript executable environment. (Older versions of R required the user to run a command like R CMD BATCH scriptname.R, but today using Rscript is preferred.) We devoted more discussion to interfacing Python with the command line environment than we will R, partially because R isn’t as frequently used that way, but also because it’s quite easy.

When using read.table(), for example, data can be read from standard input by using the file name "stdin". Anything that is printed from an R script goes to standard output by default. Because R does a fair amount of formatting when printing, however, it is often more convenient to print data frames using write.table() specifying file = "".

Finally, to get command line parameters into an R script as a character vector, the line args <- commandArgs(trailingOnly = TRUE) will do the trick. Here’s a simple script that will read a table on standard input, write it to standard output, and also read and print out any command line arguments:

III.3_63_r_69_3_stdin_stdout

Try making this script executable on the command line, and running it on p450s_blastp_yeast_top1.txt with something like cat p450s_blastp_yeast_top1.txt | ./stdin_stdout_ex.R arg1 'arg 2'.

§ Exercises

  1. Suppose we have any odd-length numeric vector (e.g., sample<- c(3.2, 5.1, 2.5, 1.6, 7.9) or sample <- runif(25, min = 0, max = 1)). Write some lines of code that result in printing the median of the vector, without using the median() or quantile() functions. You might find the length() and as.integer() functions to be helpful.
  2. If sample is a sample from an exponential distribution, for example, sample <- rexp(1000, rate = 1.5), then the median of the sample is generally smaller than the mean. Generate a vector, between_median_mean, that contains all values of sample that are larger than (or equal to) the median of the sample, and less than (or equal to) the mean of the sample.
  3. Read in the states.txt file into a data frame as described. Extract a numeric vector called murder_lowincome containing murder rates for just those states with per capita incomes less than the median per capita income (you can use the median() function this time). Similarly, extract a vector called murder_highincome containing murder rates for just those states with greater than (or equal to) the median per capita income. Run a two-sample t.test() to determine whether the mean murder rates are different between these two groups.
  4. Let states be the state information data frame described above. Describe what the various operations below do in terms of indexing, selective replacement, vector recycling, and the types of data involved (e.g., numeric vectors and logical vectors). To get you started, the first line adds a new column to the states data frame called "newpop" that contains the same information as the "population" column. III.3_64_r_70_describe_selective_replacement_hw
  5. Determine the number of unique regions that are listed in the states data frame. Determine the number of unique regions represented by states with greater than the median income.
  6. What does the sum() function report for a numeric vector c(2, 3, 0, 1, 0, 2)? How about for c(1, 0, 0, 1, 1, 0)? And, finally, how about for the logical vector c(TRUE, FALSE, FALSE, TRUE, TRUE, FALSE)? How could the sum() function thus be useful in a logical context?

  1. Most R functions take a large number of parameters, but many of them are optional. In the next chapter, we’ll see what such optional parameters look like, and how to get an extensive list of all the parameters that built-in R functions can take.
  2. The term “selective replacement” is not widely used outside of this book. In some situations, the term “conditional replacement” is used, but we wanted to define some concrete terminology to capture the entirety of the idea.
  3. When running on the command line, the present working directory is inherited from the shell. In RStudio, the present working directory is set to the “project” directory if the file is part of a project folder. In either case, it is possible to change the working directory from within R using the setwd() directory, as in setwd("/home/username/rproject") in Unix/Linux and setwd("C:/Documents and Settings/username/My Documents/rproject") in Windows. It is also possible to specify file names by absolute path, as in /home/username/rproject/states.txt, no matter the present working directory.
  4. If you have any familiarity with R, you might have run across the attach() function, which takes a data frame and results in the creation of a separate vector for each column. Generally, “disassembling” a data frame this way is a bad idea—after all, the columns of a data frame are usually associated with each other for a reason! Further, this function results in the creation of many variables with names based on the column names of the data frame. Because these names aren’t clearly delimited in the code, it’s easy to create hard-to-find bugs and mix up columns from multiple data frames this way.
  5. There are also more specialized functions for both reading and writing tabular data, such as read.csv() and write.csv(). We’ve focused on read.table() and write.table() because they are flexible enough to read and write tables in a variety of formats, including comma separated, tab separated, and so on.