Part I: Introduction to Unix/Linux

9 Sorting, First and Last Lines

Continuing with the fasta_stats examples from chapter 8, “The Standard Streams,” the seventh column of the output contains the length of the longest perfect repeat, in characters.


Which sequence contains the longest perfect repeat? To answer this question, we could consider sorting the lines according to this seventh column. (First, we’ll have to remove the header lines themselves, which we can accomplish by filtering out lines matching # using the -v flag of grep, or by grepping for unit:, as in chapter 8.) Enter sort, which sorts lines from a text file (or from standard input) by specified columns: sort <file> or ... | sort.

By default, sort sorts by all columns, by comparing the entire lines in “dictionary,” or lexicographic, order. To sort by specific columns, we need to use a rather sophisticated syntax. We’ll illustrate with a figure.


The sort utility takes many potential parameters, though the most important are the -k parameters that specify the columns by which to sort and how that sorting should be done, and occasionally the -u flag. The -k (key) parameters are considered in order; the above specifies that the sorting should be done on columns 2 through 4 (conglomerated into a single “column”), considering them in dictionary order, and sorting them in reverse. In the case of ties, only the first column is considered in normal dictionary order, and in the case of further ties, the fifth column is considered in numeric order.[1] (The difference between n and g ordering is that g can handle entries in scientific notation like 1e-6, but generally n is preferred because it is faster and not subject to small rounding errors.)

The optional -u flag (which may be specified before or after the keys, or even mixed in) specifies that after all the keys are considered, if there are still any ties between rows, then only the first row should be output. It outputs only “unique” lines according to the overall sorting order.

By default, sort uses whitespace as the column separator, though it can be changed (run man sort for more information). To view information about the longest perfect repeat, we will use sort -k7,7nr, indicating that we wish sort on the seventh column only, in reverse numeric order.


The first few lines of output indicate that the longest perfect repeat is 94 bases long and occurs in sequence PZ805359 (this sequence’s GC content is 0, because it’s composed entirely of a long AT repeat).


The results also indicate that there are a number of ties; several sequences contain perfect repeats of length 18 base pairs. If we only wanted one sequence reported per different repeat length, we could try sort -k7,7nr -u. We could alternatively modify our sort to secondarily sort by GC content (second column), sort -k7,7nr -k2,2g.

A useful trick is to perform two sorts: one that initially sorts the data on whatever criteria are wanted, and a second that gets only the first line of a group based on secondary criteria. We may wish report only the highest GC content sequence per different repeat length, for example. We can initially use a sort -k7,7nr -k2,2gr to sort by repeat length and break ties by GC content, leaving the highest-GC-content sequences on top. From there, we can use a sort -k7,7nr -u to re-sort the data (even though they are already sorted!) by the seventh column, keeping only the top line per repeat length.




There is one small concern, however: how can we be sure that our careful ordering by GC content wasn’t undone in the second sort? After all, the second sort would technically be free to reorder ties according to the seventh column, resulting in incorrect output. There is an additional flag for sort, the -s flag, indicating that stable sorting should be used. Stable sorting means that, in the case of ties, elements are left in their original order. So, to be safe, we could use a secondary sort of sort -k7,7nr -u -s, though a careful reading of the documentation for sort indicates that on most systems the -u flag implies the -s flag.

§ First and Last Lines

The head and tail utilities, like the others covered previously, write their output to standard output, and so they can be used within pipelines. They are often employed to inspect the beginning or end of a file (to check results or formatting). They also commonly extract test data sets. For example, head -n 40000 input.fastq > test.fastq would extract the first 10,000 sequence records from input.fastq and produce test.fastq (because every four lines of a FASTQ sequence file represents information for a single sequence).


The above shows the first 12 lines of a FASTQ file generated on an Illumina HiSeq 2000. The first line in each set of four represents an identifier for the sequence read, the second line contains the sequence itself, the third line is often unused (containing only a +, though it may be followed by the identifier and other optional data), and the fourth line contains the “quality” of each base in the sequence encoded as a character. (The encoding has varied in the past, but in recent years, sequencing companies have standardized the encoding used by Sanger sequencing machines.)

With a bit of modified syntax, tail can also be used to extract all lines of a file starting at a given line. As an example, tail -n +5 input.fastq > test.fastq would result in test.fastq having all but the first sequence record of input.fastq (i.e., starting at the fifth line). This feature is especially useful for stripping off header lines of output or files before further processing, as in ./fasta_stats pz_cDNAs.fasta | tail -n +9, rather than using grep -v '#' above.

§ Exercises

  1. Running fasta_stats on pz_cDNAs.fasta, the seventh column represents the length of the longest perfect repeat found in each sequence. Use only grep, sort, wc, head, and tail to determine the median value in this column. (You may need to run multiple commands or pipelines.)
  2. Running fasta_stats on pz_cDNAs.fasta, how many different perfect repeat units (column six) are found?
  3. The file pz_blastx_yeast_top10.txt is the result of running blastx -query ../fasta_stats/pz_cDNAs.fasta -db orf_trans -evalue 1e-6 -max_target_seqs 10 -max_hsps 1 -outfmt 7 -out pz_blastx_yeast_top1.txt. Aside from the “comment” lines that start with #, the first column is the query ID, the second the target (yeast) ID, the third the percentage identity of the HSP, the eleventh the E value, and the twelfth the “bitscore.” Which query ID had the largest bitscore? How many different query sequences (entries in the first column) had one or more HSPs against the database?
  4. Extract from pz_blastx_yeast_top10.txt a file called pz_blastx_yeast_top1.txt containing only the smallest E-valued HSP line per query ID. (You may remove comment lines starting with # altogether.)

  1. A word of caution: if a column contains an entry that cannot be interpreted as an integer or general number, it will be treated as 0 in the sorting order.