Part II: Programming in Python

15 Collections and Looping: Lists and for

A list, as its name implies, is a list of data (integers, floats, strings, Booleans, or even other lists or more complicated data types). Python lists are similar to arrays or vectors in other languages. Like letters in strings, elements of a list are indexed starting at 0 using [] syntax. We can also use brackets to create a list with a few elements of different types, though in practice we won’t do this often.


Just like with strings, we can use [] notation to get a sublist “slice,” and we can use the len() function to get the length of a list.


Unlike strings, though, lists are mutable, meaning we can modify them after they’ve been created, for example, by replacing an element with another element. As mentioned above, lists can even contain other lists!


We will typically want our code to create an empty list, and then add data elements to it one element at a time. An empty list is returned by calling the list() function with no parameters. Given a variable which references a list object, we can append an element to the end using the .append() method, giving the method the element we want to append as a parameter.


This syntax might seem a bit odd compared to what we’ve seen so far. Here new_list.append("G") is telling the list object the new_list variable refers to to run its .append() method, taking as a parameter the string "G". We’ll explore the concepts of objects and methods more formally in later chapters. For now, consider the list not just a collection of data, but a “smart” object with which we can interact using . methods.

Note that the .append() method asks the list to modify itself (which it can do, because lists are mutable), but this operation doesn’t return anything of use.[1]

This type of command opens up the possibility for some insidious bugs; for example, a line like new_list = new_list.append("C") looks innocent enough and causes no immediate error, but it is probably not what the programmer intended. The reason is that the new_list.append("C") call successfully asks the list to modify itself, but then the None value is returned, which would be assigned to the new_list variable with the assignment. At the end of the line, new_list will refer to None, and the list itself will no longer be accessible. (In fact, it will be garbage collected in due time.) In short, use some_list.append(el), not some_list = some_list.append(el).

We often want to sort lists, which we can do in two ways. First, we could use the sorted() function, which takes a list as a parameter and returns a new copy of the list in sorted order, leaving the original alone. Alternatively, we could call a lists .sort() method to ask a list to sort itself in place.


As with the .append() method above, the .sort() method returns None, so the following would almost surely have resulted in a bug: a_list = a_list.sort().

At this point, one would be forgiven for thinking that . methods always return None and so assignment based on the results isn’t useful. But before we move on from lists, let’s introduce a simple way to split a string up into a list of substrings, using the .split() method on a string data type. For example, let’s split up a string wherever the subsequence "TA" occurs.


If the sequence was instead "CGCGTATACAGA", the resulting list would have contained ["CGCG", "", "CAGA"] (that is, one of the elements would be a zero-length empty string). This example illustrates that strings, like lists, are also “smart” objects with which we can interact using. methods. (In fact, so are integers, floats, and all other Python types that we’ll cover.)

§ Tuples (Immutable Lists)

As noted above, lists are mutable, meaning they can be altered after their creation. In some special cases, it is helpful to create an immutable version of a list, called a “tuple” in Python. Like lists, tuples can be created in two ways: with the tuple() function (which returns an empty tuple) or directly.


Tuples work much like lists—we can call len() on them and extract elements or slices with [] syntax. We can’t change, remove, or insert elements.[2]

§ Looping with for

A for-loop in Python executes a block of code, once for each element of an iterable data type: one which can be accessed one element at a time, in order. As it turns out, both strings and lists are such iterable types in Python, though for now we’ll explore only iterating over lists with for-loops.

A block is a set of lines of code that are grouped as a unit; in many cases they are executed as a unit as well, perhaps more than one time. Blocks in Python are indicated by being indented an additional level (usually with four spaces—remember to be consistent with this indentation practice).

When using a for-loop to iterate over a list, we need to specify a variable name that will reference each element of the list in turn.


In the above, one line is indented an additional level just below the line defining the for-loop. In the for-loop, the gene_id variable is set to reference each element of the gene_ids list in turn. Here’s the output of the loop:


Using for-loops in Python often confuses beginners, because a variable (e.g., gene_id) is being assigned without using the standard = assignment operator. If it helps, you can think of the first loop through the block as executing gene_id = gene_ids[0], the next time around as executing gene_id = gene_ids[1], and so on, until all elements of gene_ids have been used.

Blocks may contain multiple lines (including blank lines) so that multiple lines of code can work together. Here’s a modified loop that keeps a counter variable, incrementing it by one each time.


The output of this loop would be the same as the output above, with an additional line printing 3 (the contents of counter after the loop ends).

Some common errors when using block structures in Python include the following, many of which will result in an IndentationError.

  1. Not using the same number of spaces for each indentation level, or mixing tab indentation with multiple-space indentation. (Most Python programmers prefer using four spaces per level.)
  2. Forgetting the colon : that ends the line before the block.
  3. Using something like a for-loop line that requires a block, but not indenting the next line.
  4. Needlessly indenting (creating a block) without a corresponding for-loop definition line.

We often want to loop over a range of integers. Conveniently, the range() function returns a list of numbers.[3] It commonly takes two parameters: (1) the starting integer (inclusive) and (2) the ending integer (exclusive). Thus we could program our for-loop slightly differently by generating a list of integers to use as indices, and iterating over that:


The output of one of the loops above:


The second example above illustrates the rationale behind the inclusive/exclusive nature of the range() function: because indices start at zero and go to one less than the length of the list, we can use range(0, len(ids)) (as opposed to needing to modify the ending index) to properly iterate over the indices of ids without first knowing the length of the list. Seasoned programmers generally find this intuitive, but those who are not used to counting from zero may need some practice. You should study these examples of looping carefully, and try them out. These concepts are often more difficult for beginners, but they are important to learn.

Loops and the blocks they control can be nested, to powerful effect:


In the above, the outer for-loop controls a block of five lines; contained within is the inner for-loop controlling a block of only two lines. The outer block is principally concerned with the variable i, while the inner block is principally concerned with the variable j. We see that both blocks also make use of variables defined outside them; the inner block makes use of sum, i, and j, while lines specific to the outer block make use of sum and i (but not j). This is a common pattern we’ll be seeing more often. Can you determine the value of total at the end without running the code?

§ List Comprehensions

Python and a few other languages include specialized shorthand syntax for creating lists from other lists known as list comprehensions. Effectively, this shorthand combines a for-loop syntax and list-creation syntax into a single line.

Here’s a quick example: starting with a list of numbers [1, 2, 3, 4, 5, 6], we generate a list of squares ([1, 4, 9, 16, 25, 36]):


Here we’re using a naming convention of num in nums, but like a for-loop, the looping variable can be named almost anything; for example, squares = [x ** 2 for x in nums] would accomplish the same task.

List comprehensions can be quite flexible and used in creative ways. Given a list of sequences, we can easily generate a list of lengths.


These structures support “conditional inclusion” as well, though we haven’t yet covered operators like ==:


The next example generates a list of 1s for each element where the first base is "T", and then uses the sum() function to sum up the list, resulting in a count of sequences beginning with "T".


Although many Python programmers often use list comprehensions, we won’t use them much in this book. Partially, this is because they are a feature that many programming languages don’t have, but also because they can become difficult to read and understand owing to their compactness. As an example, what do you think the following comprehension does? [x for x in range(2, n) if x not in [j for i in range(2, sqrtn) for j in range (i*2, n, i)]] (Suppose n = 100 and sqrtn = 10. This example also makes use of the fact that range() can take a step argument, as in range(start, stop, step). )

§ Exercises

  1. What is the value of total at the end of each loop set below? First, see if you can compute the answers by hand, and then write and execute the code with some added print() statements to check your answers. II.3_19_py_33_sum_nested_ex
  2. Suppose we say the first for-loop block above has “depth” 1 and “width” 4, and the second has depth 2 and width 4, and the third has depth 3 and width 4. Can you define an equation that indicates what the total would be for a nested for-loop block with depth d and width w? How does this equation relate to the number of times the interpreter has to execute the line total = total + 1?
  3. Determine an equation that relates the final value of total below to the value of x. II.3_20_py_34_sum_nested_ex2
  4. Given a declaration of a sequence string, like seq = "ATGATAGAGGGATACGGGATAG", and a subsequence of interest, like subseq = "GATA", write some code that prints all of the locations of that substring in the sequence, one per line, using only the Python concepts we’ve covered so far (such as len(), for-loops, and .split()). For the above example, the output should be 3, 11, and 18.

    Your code should still work if the substring occurs at the start or end of the sequence, or if the subsequence occurs back to back (e.g., in "GATACCGATAGATA", "GATA" occurs at positions 1, 7, and 11). As a hint, you may assume the subsequence is not self-overlapping (e.g., you needn’t worry about locating "GAGA" in "GAGAGAGAGA", which would occur at positions 1, 3, 5, and 7).

  5. Suppose we have a matrix represented as a list of columns: cols = [[10, 20, 30, 40], [5, 6, 7, 8], [0.9, 0.10, 0.11, 0.12]]. Because each column is an internal list, this arrangement is said to be in “column-major order.” Write some code that produces the same data in “row-major order”; for example, rows should contain [[10, 5, 0.9], [20, 6, 0.10], [30, 7, 0.11], [40, 8, 0.12]]. You can assume that all columns have the same number of elements and that the matrix is at least 2 by 2.

    This problem is a bit tricky, but it will help you organize your thoughts around loops and lists. You might start by first determining the number of rows in the data, and then building the “structure” of rows as a list of empty lists.

  1. It returns a special data type known as None, which allows for a variable to exist but not reference any data. (Technically, None is a type of data, albeit a very simple one.) None can be used as a type of placeholder, and so in some situations isn’t entirely useless.
  2. Tuples are a cause of one of the more confusing parts of Python, because they are created by enclosing a list of elements inside of parentheses, but function calls also take parameters listed inside of parentheses, and mathematical expressions are grouped by parentheses, too! Consider the expression (4 + 3) * 2. Is (4 + 3) an integer, or a single-element tuple? Actually, it’s an integer. By default, Python looks for a comma to determine whether a tuple should be created, so (4 + 3) is an integer, while (4 + 3, 8) is a two-element tuple and (4 + 3,) is a single-element tuple. Use parentheses deliberately in Python: either to group mathematical expressions, create tuples, or call functions—where the function name and opening parenthesis are neighboring, as in print(a) rather than print (a). Needlessly adding parentheses (and thereby accidentally creating tuples) has been the cause of some difficult-to-find bugs.
  3. An alternative to range() in Python 2.7 is xrange(), which produces an iterable type that works much like a list of numbers but is more memory efficient. In more recent versions of Python (3.0 and above) the range() function works like xrange() and xrange() has been removed. Programmers using Python 2.7 may wish to use xrange() for efficiency, but we’ll stick with range() so that our code works with the widest variety of Python versions, even if it sacrifices efficiency in some cases. There is one important difference between range() and xrange() in Python 2.7: range() returns a list, while xrange() returns an iterable type that lacks some features of true lists. For example, nums = range(0, 4) followed by nums[3] = 1000 would result in nums referencing [0, 1, 2, 1000], while nums = xrange(0, 4) followed by nums[3] = 1000 would produce an error.