Part II: Programming in Python

20 Dictionaries

Dictionaries (often called “hash tables” in other languages) are an efficient and incredibly useful way to associate data with other data. Consider a list, which associates each data element in the list with an integer starting at zero:

II.8_1_dict_list_indices_values

A dictionary works in much the same way, except instead of indices, dictionaries use “keys,” which may be integers or strings.[1] We’ll usually draw dictionaries longwise, and include the keys within the illustrating brackets, because they are as much a part of the dictionary structure as the values:

II.8_2_dict_keys_values

Let’s create a dictionary holding these keys and values with some code, calling it ids_to_gcs. Note that this name encodes both what the keys and values represent, which can be handy when keeping track of the contents of these structures. To create an empty dictionary, we call the dict() function with no parameters, which returns the empty dictionary. Then, we can assign and retrieve values much like we do with lists, except (1) we’ll be using strings as keys instead of integer indices, and (2) we can assign values to keys even if the key is not already present.

II.8_3_py_74_dict_ex

We can then access individual values just like with a list:

However, we cannot access a value for a key that doesn’t exist. This will result in a KeyError, and the program will halt.

II.8_5_py_76_dict_ex_error

Dictionaries go one way only: given the key, we can look up the value, but given a value, we can’t easily find the corresponding key. Also, the name “dictionary” is a bit misleading, because although real dictionaries are sorted in alphabetical order, Python dictionaries have no intrinsic order. Unlike lists, which are ordered and have a first and last element, Python makes no guarantees about how the key/value pairs are stored internally. Also, each unique key can only be present once in the dictionary and associated with one value, though that value might be something complex like a list, or even another dictionary. Perhaps a better analogy would be a set of labeled shelves, where each label can only be used once.

There are a variety of functions and methods that we can use to operate on dictionaries in Python. For example, the len() function will return the number of key/value pairs in a dictionary. For the dictionary above, len(ids_to_gcs) will return 3. If we want, we can get a list of all the keys in a dictionary using its .keys() method, though this list may be in a random order because dictionaries are unordered. We could always sort the list, and loop over that:

II.8_7_py_77_dict_keys_loop

Similarly, we can get a list of all the values using .values(), again in no particular order. So, ids_to_gcs.values() will return a list of three floats.[2]

If we try to get a value for a key that isn’t present in the dictionary, we’ll get a KeyError. So, we will usually want to test whether a key is present before attempting to read its value. We can do this with the dictionary’s .has_key() method, which returns True if the key is present in the dictionary.[3]

II.8_8_py_78_dict_has_key

§ Counting Gene Ontology Terms

To illustrate the usage of a dictionary in practice, consider the file PZ.annot.txt, the result of annotating a set of assembled transcripts with gene ontology (GO) terms and numbers. Each tab-separated line gives a gene ID, an annotation with a GO number, and a corresponding human-readable term associated with that number.

II.8_9_py_78_2_pz_annot_txt_sample

In this file, each gene may be associated with multiple GO numbers, and each GO number may be associated with multiple genes. Further, each GO term may be associated with multiple different GO numbers. How many times is each ID found in this file? Ideally, we’d like to produce tab-separated output that looks like so:

II.8_10_py_78_3_pz_annot_txt_sample_out

Our strategy: To practice some of the command line interaction concepts, we’ll have this program read the file on standard input and write its output to standard output (as discussed in chapter 19, “Command Line Interfacing”). We’ll need to keep a dictionary, where the keys are the gene IDs and the values are the counts. A for-loop will do to read in each line, stripping off the ending newline and splitting the result into a list on the tab character, \t. If the ID is in the dictionary, we’ll add one to the value. Because the dictionary will start empty, we will frequently run into IDs that aren’t already present in the dictionary; in these cases we can set the value to 1. Once we have processed the entire input, we can loop over the dictionary printing each count and ID.

In the code below (go_id_count.py), when the dictionary has the seqid key, we’re both reading from the dictionary (on the right-hand side) and writing to the value (on the left-hand side).

II.8_11_py_79_go_id_count_ex

But when the key is not present, we’re simply writing to the value. In the loop that prints the dictionary contents, we are not checking for the presence of each id before reading it to print, because the list of ids_list is guaranteed to contain exactly those keys that are in the dictionary, as it is the result of ids_to_counts.keys().

II.8_12_py_80_go_id_count_ex_call1

What’s the advantage of organizing our Python program to read rows and columns on standard input and write rows and columns to standard output? Well, if we know the built-in command line tools well enough, we can utilize them along with our program for other analyses. For example, we can first filter the data with grep to select those lines that match the term transcriptase:

II.8_13_py_81_go_id_grep_transcriptase

The result is only lines containing the word “transcriptase”:

II.8_14_py_81_2_go_id_grep_transcriptase_out

If we then feed those results through our program (cat PZ.annot.txt | grep 'transcriptase' | ./go_id_count.py), we see only counts for IDs among those lines.

II.8_15_py_81_3_go_id_grep_transcriptase_out_through_python

Finally, we could pipe the results through wc to count these lines and determine how many IDs were annotated at least once with that term (21). If we wanted to instead see which eight genes had the most annotations matching “transcriptase,” we could do that, too, by sorting on the counts and using head to print the top eight lines (here we’re breaking up the long command with backslashes, which allow us to continue typing on the next line in the terminal).[4]

II.8_16_py_82_go_id_count_ex_call3

It appears gene PZ32722_B has been annotated as a transcriptase seven times. This example illustrates that, as we work and build tools, if we consider how they might interact with other tools (even other pieces of code, like functions), we can increase our efficiency remarkably.

§ Extracting All Lines Matching a Set of IDs

Another useful property of dictionaries is that the .has_key() method is very efficient. Suppose we had an unordered list of strings, and we wanted to determine whether a particular string occurred in the list. This can be done, but it would require looking at each element (in a for-loop, perhaps) to see if it equaled the one we are searching for. If we instead stored the strings as keys in a dictionary (storing "present", or the number 1, or anything else in the value), we could use the .has_key() method, which takes a single time step (effectively, on average) no matter how many keys are in the dictionary.[5]

Returning to the GO/ID list from the last example, suppose that we had the following problem: we wish to first identify all those genes (rows in the table) that were labeled with GO:0001539 (which we can do easily with grep on the command line), and then we wish to extract all rows from the table matching those IDs to get an idea of what other annotations those genes might have.

In essence, we want to print all entries of a file:

II.8_17_py_82_2_go_terms

Where the first column matches any ID in the first column of another input:

II.8_18_py_82_3_go_id_grep_go

As it turns out, the above problem is common in data analysis (subsetting lines on the basis of an input “query” set), so we’ll be careful to design a program that is not specific to this data set, except that the IDs in question are found in the first column.[6]

We’ll write a program called match_1st_cols.py that takes two inputs: on standard input, it will read a number of lines that have the query IDs we wish to extract, and it will also take a parameter that specifies the file from which matching lines should be printed. For this instance, we would like to be able to execute our program as follows:

In terms of code, the program can first read the input from standard input and create a dictionary that has keys corresponding to each ID that we wish to extract (the values can be anything). Next, the program will loop over the lines of the input file (specified in sys.argv[1]), and for each ID check it with .has_key() against the dictionary created previously; if it’s found, the line is printed.

II.8_20_py_84_go_extract_code

Making the program (match_1st_cols.py) executable and running it reveals all annotations for those IDs that are annotated with GO:0001539.

II.8_21_py_85_go_extract_code_run

As before, we can use this strategy to easily extract all the lines matching a variety of criteria, just by modifying one or both inputs. Given any list of gene IDs of interest from a collaborator, for example, we could use that on the standard input and extract the corresponding lines from the GO file.

§ Exercises

  1. Dictionaries are often used for simple lookups. For example, a dictionary might have keys for all three base-pair DNA sequences ("TGG", "GCC", "TAG", and so on) whose values correspond to amino acid codes (correspondingly, "W", "A", "*" for “stop,” and so on). The full table can be found on the web by searching for “amino acid codon table.”

    Write a function called codon_to_aa() that takes in a single three-base-pair string and returns a one-character string with the corresponding amino acid code. You may need to define all 64 possibilities, so be careful not to make any typos! If the input is not a valid three-base-pair DNA string, the function should return "X" to signify “unknown.” Test your function with a few calls like print(codon_to_aa("TGG")), print(codon_to_aa("TAA")), and print(codon_to_aa("BOB")).

  2. Combine the result of the codon_to_aa() function above with the get_windows() function from the exercises in chapter 18, “Python Functions,” to produce a dna_to_aa() function. Given a string like "AAACTGTCTCTA", the function should return its translation as "KLSL".
  3. Use the get_windows() function to write a count_kmers() function; it should take two parameters (a DNA sequence and an integer) and return a dictionary of k-mers to count for those k-mers. For example, count_kmers("AAACTGTCTCTA", 3) should return a dictionary with keys "AAA", "AAC", "ACT", "CTG", "TGT", "GTC", "TCT", "CTC", "CTA" and corresponding values 1, 1, 1, 1, 1, 1, 2, 1, 1. (K-mer counting is an important step in many bioinformatics algorithms, including genome assembly.)
  4. Create a function union_dictionaries() that takes two dictionaries as parameters returns their “union” as a dictionary—when a key is found in both, the larger value should be used in the output. If dictionary dict_a maps "A", "B", "C" to 3, 2, 6, and dict_b maps "B", "C", "D" to 7, 4, 1, for example, the output should map "A", "B", "C", "D" to 3, 7, 6, 1.

  1. Dictionary keys may be any immutable type, which includes integers and strings, but they also include a number of other more exotic types, like tuples (immutable lists). Although floats are also immutable, because keys are looked up on the basis of equality and rounding errors can compound, it is generally not recommended to use them as dictionary keys.
  2. In Python 3.0 and later, what is returned by .keys() and .values() is not technically a list but a “view,” which operates similarly to a list without using any additional memory. The code shown still works, but ids.sort() (the sort-in-place version) would not, as views do not have a .sort() method.
  3. There is also a more Python-specific method of querying for a key in a dictionary: "TL34_X" in ids_to_gcs is equivalent to ids_to_gcs.has_keys("TL34_X"). Interestingly, the in keyword also works for lists: if ids = ["CYP6B", "CATB", "AGP4"], then "TL34_X" in ids will return False. However, in this case every entry in ids must be compared to "TL34_X" to make this determination, and so in is much slower for lists than for dictionaries. Using the in keyword for dictionaries is generally preferred by the Python community, and in fact .has_key() has been removed in Python 3.0 and later. For the purposes of this book, we'll use .has_key() when working with dictionaries so it is clear exactly what is happening.
  4. For simple problems like this, if we know the command line tools well enough, we don’t even need to use Python. This version of the problem can be solved with a pipeline like cat PZ.annot.txt | grep 'transcriptase' | awk '{print $1}' | sort | uniq -c | sort -k1,1nr | head.
  5. In computer science terms, we say that searching an unordered list runs in time O(n), or “order\textcolor{white}{|}n,” where\textcolor{white}{|}n is taken to be the size of the list. The .has_key() method runs in O(1), which is to say that the time taken is independent of the size of the dictionary. If we need to do such a search many times, these differences can add up significantly. More information on run-time considerations is covered in chapter 25, “Algorithms and Data Structures.”
  6. The grep utility can perform a similar operation; grep -f query_patterns.txt subject_file.txt will print all lines in subject_file.txt that are matched by any of the patterns in query_patterns.txt. But this requires that all patterns are compared to all lines (even if the patterns are simple), and so our custom Python solution is much faster when the number of queries is large (because the .has_key() method is so fast).