Part II: Programming in Python

17 Conditional Control Flow

The phrase “control flow” refers to the fact that constructs like for-loops change the flow of program execution away from the simple top-to-bottom order. There are several other types of control flow we will cover, two of which are “conditional” in nature.

§ Using If-Statements

If-statements allow us to conditionally execute a block of code, depending on a variable referencing a Boolean True or False, or more commonly a condition that returns a Boolean True or False. The syntax is fairly simple, described here with an example.

II.5_1_py_43_if_example

All the lines from the starting if to the last line in an elif: or else: block are part of the same logical construct. Such a construct must have exactly one if conditional block, may have one or more elif blocks (they are optional), and may have exactly one catchall else block at the end (also optional). Each conditional is evaluated in order: the first one that evaluates to True will run, and the rest will be skipped. If an else block is present, it will run if none of the earlier if or elif blocks did as a “last resort.”

Just like with for-loops, if-statements can be nested inside of other blocks, and other blocks can occur inside if-statement blocks. Also just like for-loops, Python uses indentation (standard practice is four spaces per indentation level) to indicate block structure, so you will get an error if you needlessly indent (without a corresponding control flow line like for, if, elif, or else) or forget to indent when an indentation is expected.[1]

II.5_2_py_44_if_example2

The above code would print Number short: 2 number long: 2.

§ Using While-Loops

While-loops are less often used (depending on the nature of the programming being done), but they can be invaluable in certain situations and are a basic part of most programming languages. A while-loop executes a block of code so long as a condition remains True. Note that if the condition never becomes False, the block will execute over and over in an “infinite loop.” If the condition is False to begin with, however, the block is skipped entirely.

II.5_3_py_44_2_while_pre_example

The above will print Counter is now: 0, followed by Counter is now: 1, Counter is now: 2, Counter is now: 3, and finally Done. Counter ends with: 4. As with using a for-loop over a range of integers, we can also use a while-loop to access specific indices within a string or list.

II.5_4_py_45_while_example1

The above code will print base is: A, then base is: C, and so on, ending with base is: T before finally printing Done. While-loops can thus be used as a type of fine-grained for-loop, to iterate over elements of a string (or list), in turn using simple integer indexes and [] syntax. While the above example adds 1 to base_index on each iteration, it could just as easily add some other number. Adding 3 would cause it to print every third base, for example.

§ Boolean Operators and Connectives

We’ve already seen one type of Boolean comparison, <, which returns whether the value of its left-hand side is less than the value of its right-hand side. There are a number of others:

Operator Meaning Example (with a = 7, b = 3)
< less than? a < b # False
> greater than? a > b # True
<= less than or equal to? a <= b # False
>= greater than or equal to? a >= b # True
!= not equal to? a != b # True
== equal to? a == b # False

These comparisons work for floats, integers, and even strings and lists. Sorting on strings and lists is done in lexicographic order: an ordering wherein item A is less than item B if the first element of A is less than the first element of B; in the case of a tie, the second element is considered, and so on. If in this process we run out of elements for comparison, that shorter one is smaller. When applied to strings, lexicographic order corresponds to the familiar alphabetical order.

Let’s print the sorted version of a Python list of strings, which does its sorting using the comparisons above. Note that numeric digits are considered to be “less than” alphabetic characters, and uppercase letters come before lowercase letters.

II.5_5_py_46_sort_lexicographic

Boolean connectives let us combine conditionals that return True or False into more complex statements that also return Boolean types.

Connective Meaning Example (with a = 7, b = 3)
and True if both are True a < 8 and b == 3 # True
or True if one or both are True a < 8 or b == 9 # True
not True if following is False not a < 3 # True

These can be grouped with parentheses, and usually should be to avoid confusion, especially when more than one test follow a not.[2]

Finally, note that generally each side of an and or or should result in only True or False. The expression a == 3 or a == 7 has the correct form, whereas a == 3 or 7 does not. (In fact, 7 in the latter context will be taken to mean True, and so a == 3 or 7 will always result in True.)

§ Logical Dangers

Notice the similarity between = and ==, and yet they have dramatically different meanings: the former is the variable assignment operator, while the latter is an equality test. Accidentally using one where the other is meant is an easy way to produce erroneous code. Here count == 1 won’t initialize count to 1; rather, it will return whether it already is 1 (or result in an error if count doesn’t exist as a variable at that point). The reverse mistake is harder to make, as Python does not allow variable assignment in if-statement and while-loop definitions.

II.5_7_py_47_if_safety

In the above, the intent is to determine whether the length of seq is a multiple of 3 (as determined by the result of len(seq)%3 using the modulus operator), but the if-statement in this case should actually be if remainder == 0:. In many languages, the above would be a difficult-to-find bug (remainder would be assigned to 0, and the result would be True anyway!). In Python, the result is an error: SyntaxError: invalid syntax.

Still, a certain class of dangerous comparison is common to nearly every language, Python included: the comparison of two float types for equality or inequality.

Although integers can be represented exactly in binary arithmetic (e.g., 751 in binary is represented exactly as 1011101111), floating-point numbers can only be represented approximately. This shouldn’t be an entirely unfamiliar concept; for example, we might decide to round fractions to four decimal places when doing calculations on pencil and paper, working with 1/3 as 0.3333. The trouble is that these rounding errors can compound in difficult-to-predict ways. If we decide to compute (1/3)*(1/3)/(1/3) as 0.3333*0.3333/0.3333, working left to right we’d start with 0.3333*0.3333 rounded to four digits as 0.1110. This is then divided by 0.3333 and rounded again to produce an answer of 0.3330. So, even though we know that (1/3)*(1/3)/(1/3) == 1/3, our calculation process would call them unequal because it ultimately tests 0.3330 against 0.3333!

Modern computers have many more digits of precision (about 15 decimal digits at a minimum, in most cases), but the problem remains the same. Worse, numbers that don’t need rounding in our Base-10 arithmetic system do require rounding in the computer’s Base-2 system. Consider 0.2, which in binary is 0.001100110011, and so on. Indeed, 0.2 * 0.2 / 0.2 == 0.2 results in False!

While comparing floats with <, >, <=, and >= is usually safe (within extremely small margins of error), comparison of floats with == and != usually indicates a misunderstanding of how floating-point numbers work.[3] In practice, we’d determine if two floating-point values are sufficiently similar, within some defined margin of error.

II.5_8_py_48_epsilon_compare

§ Counting Stop Codons

As an example of using conditional control flow, we’ll consider the file seq.txt, which contains a single DNA string on the first line. We wish to count the number of potential stop codons "TAG", "TAA", or "TGA" that occur in the sequence (on the forward strand only, for this example).

Our strategy will be as follows: First, we’ll need to open the file and read the sequence from the first line. We’ll need to keep a counter of the number of stop codons that we see; this counter will start at zero and we’ll add one to it for each "TAG", "TAA", or "TGA" subsequence we see. To find these three possibilities, we can use a for-loop and string slicing to inspect every 3bp subsequence of the sequence; the 3bp sequence at index 0 of seq occurs at seq[0:3], the one at position 1 occurs at seq[1:4], and so on.

We must be careful not to attempt to read a subsequence that doesn’t occur in the sequence. If seq = "AGAGAT", there are only four possible 3bp sequences, and attempting to select the one starting at index 4, seq[4:7], would result in an error. To make matters worse, string indexing starts at 0, and there are also the peculiarities of the inclusive/exclusive nature of [] slicing and the range() function!

To help out, let’s draw a picture of an example sequence, with various indices and 3bp subsequences we’d like to look at annotated.

II.5_9_seq_stop_windows

Given a starting index index, the 3bp subsequence is defined as seq[index:index + 3]. For the sequence above, len(seq) would return 15. The first start index we are interested in is 0, while the last start index we want to include is 12, or len(seq) - 3. If we were to use the range() function to return a list of start sequences we are interested in, we would use range(0, len(seq) - 3 + 1), where the + 1 accounts for the fact that range() includes the first index, but is exclusive in the last index.[4]

We should also remember to run .strip() on the read sequence, as we don’t want the inclusion of any \n newline characters messing up the correct computation of the sequence length!

Notice in the code below (which can be found in the file stop_count_seq.py) the commented-out line #print(codon).

II.5_10_py_49_stop_count_ex

While coding, we used this line to print each codon to be sure that 3bp subsequences were reliably being considered, especially the first and last in seq1.txt (ATA and AAT). This is an important part of the debugging process because it is easy to make small “off-by-one” errors with this type of code. When satisfied with the solution, we simply commented out the print statement.

For windowing tasks like this, it can occasionally be easier to access the indices with a while-loop.

II.5_11_py_49_2_stop_count_ex_while

If we wished to access nonoverlapping codons, we could use index = index + 3 rather than index = index + 1 without any other changes to the code. Similarly, if we wished to inspect 5bp windows, we could replace instances of 3 with 5 (or use a windowsize variable).

§ Exercises

  1. The molecular weight of a single-stranded DNA string (in g/mol) is (count of "A")*313.21 + (count of "T")*304.2 + (count of "C")*289.18 + (count of "G")*329.21 – 61.96 (to account for removal of one phosphate and the addition of a hydroxyl on the single strand).

    Write code that prints the total molecular weight for the sequence in the file seq.txt. The result should be 21483.8. Call your program mol_weight_seq.py.

  2. The file seqs.txt contains a number of sequences, one sequence per line. Write a new Python program that prints the molecular weight of each on a new line. For example:
    ii-5_11b_py_out_exYou may wish to use substantial parts of the answer for question 1 inside of a loop of some kind. Call your program mol_weight_seqs.py.
  3. The file ids_seqs.txt contains the same sequences as seqs.txt; however, this file also contains sequence IDs, with one ID per line followed by a tab character (\t) followed by the sequence. Modify your program from question 2 to print the same output, in a similar format: one ID per line, followed by a tab character, followed by the molecular weight. The output format should thus look like so (but the numbers will be different, to avoid giving away the answer):
    II.5_12_py_49_2_mol_weight_out_ex

    Call your program mol_weight_ids_seqs.py.

    Because the tab characters cause the output to align differently depending on the length of the ID string, you may wish to run the output through the command line tool column with a -t option, which automatically formats tab-separated input.
    II.5_13_py_50_mol_weight3_output

  4. Create a modified version of the program in question 3 of chapter 15, “Collections and Looping, Part 1: Lists and for,” so that it also identifies the locations of subsequences that are self-overlapping. For example, "GAGA" occurs in "GAGAGAGAGATATGAGA" at positions 1, 3, 5, 7, and 14.

  1. Python is one of the only languages that require blocks to be delineated by indentation. Most other languages use pairs of curly brackets to delineate blocks. Many programmers feel this removes too much creativity in formatting of Python code, while others like that it enforces a common method for visually distinguishing blocks.
  2. In the absence of parentheses for grouping, and takes precedence over or, much like multiplication takes precedence over addition in numeric algebra. Boolean logic, by the way, is named after the nineteenth-century mathematician and philosopher George Boole, who first formally described the “logical algebra” of combining truth values with connectives like “and,” “or,” and “not.”
  3. You can see this for yourself: print(0.2*0.2/0.2 == 0.2) prints False! Some mathematically oriented languages are able to work entirely symbolically with such equations, bypassing the need to work with numbers at all. This requires a sophisticated parsing engine but enables such languages to evaluate even generic expressions like x*x/x == x as true.
  4. Yes, this sort of detailed logical thinking can be tedious, but it becomes easier with practice. Drawing pictures and considering small examples is also invaluable when working on programming problems, so keep a pencil and piece of paper handy.