Parsing assignment
Due Friday, 2014 Oct 17

UCSC BME 205 Fall 2014
Bioinformatics: models and algorithms

(Last Update: 07:39 PDT 22 October 2014 )

Purpose

The purpose of this assignment is for you to build re-usable parsers for reading some of the most common sequence formats, so that you will have them available for future assignments.

Two concepts are important here: creating reusable modules and what the popular sequence formats are.

The program for this assignment should be built starting from the template created in the first assignment. Note that it need not be the template that you created for the first assignment, but if you start with someone else's template, they need to be credited in writing—their name should still be in the source code!

The formats: FASTA and FASTQ

The two most popular formats for sequences are FASTA (in many variants) and FASTQ (with two variants). You will need to do input and output of FASTA and FASTQ files for this assignment. As always with bioinformatics programs, you should be tolerant of minor variations in the format on input, but very strict on output.

There are four pieces of data associated with a sequence in a file:

  1. The name of the sequence, which usually is a string that contains neither spaces nor commas, but may have other special characters in it (NCBI loves using the "|" character in sequence names—a very bad choice if the sequence names ever get used in command lines or as file names). For the purposes of this assignment, the name should be considered atomic, and not broken into smaller pieces.
  2. Extra information about the sequence. This is a string which may have any information in it, but is usually prohibited from having newlines. Again, some programs structure this extra information (in various incompatible ways), but for this assignment, it is an uninterpreted string.
  3. The sequence itself, consisting of either nucleotides or amino acids. Various wild cards are allowed, the special symbol "*" for stop codons and the special symbol "-" for gaps (which are interpreted differently in different programs). Programs generally have an accepted alphabet they can handle and various ways of treating unexpected characters in the input.
  4. Sequence of quality information (Q-values), generally integers that are rounded approximations to
    –10 log10 P(character is wrong) .
    For your module to be useful, you must use an internal format that represents the Q-values as integers, not as some string encoding the values. String representations should only be present in the input or output files, not the internal representation.

FASTA format

FASTA format has ID lines starting with ">" in the first column and sequence lines. The ID extends from immediately after the ">" on the id line to the first white space or comma. The extra information is everything after the id and separator on the id line. Sequence lines contain the sequence, and are not allowed to contain ">" characters, but may have white space that is ignored.

Rather than giving you a precise definition of the FASTA format, I point you to the OK description on Wikipedia's FASTA format page. Go read it now. On that page, they say

Some databases and bioinformatics applications do not recognize these comments and follow the NCBI FASTA specification.

You should follow the NCBI FASTA specification for output of FASTA, since most programs don't understand the ';' comments. The principle of being forgiving on input but strict on output suggests that any comments that start with ';' should be discarded, not copied to the output.

But one of NCBI's suggestions is both good and bad advice. They recommend

It is recommended that all lines of text be shorter than 80 characters in length.
Limiting lines in a file to a maximum length helps avoid problems with buffer overflow in programs where a fixed-length array is allocated to hold an input line. Unfortunately, it is not possible to keep all lines to below 80 characters. Since the extra information on the ID line may be arbitrarily long, and only one ID line is permitted per sequence, it is not possible to keep the ID line to 80 characters.

Furthermore, some programs don't understand new lines or even spaces in sequences, expecting there to be exactly one line for a sequence, with every character being a position in the sequence. This single-line format is not standard FASTA, but it might be handy to provide it as an output option, so that you can use your program to clean up FASTA files for programs that have such bad input parsers. There should probably be a command-line option in any program that outputs FASTA to choose between line-length limitations and sequence-on-a-single-line.

Some FASTA-derived formats, such as UCSC's A2M format extend the alphabet (adding "." in A2M), and some programs do not accept parts of the standard format (such as "*"). Your first FASTA parser should accept any FASTA file that follows the NCBI conventions, and should properly ignore white space within sequences. It should also ignore any extraneous characters (like the "." of A2M format). You can add options, if you want, to modify what alphabet is considered acceptable.

Note that the FASTA format provides name, extra data, and sequence information, but does not provide quality information. When quality information is needed with FASTA files, two files are used. One is the straight FASTA file with names, extra data, and sequences, and the other is a file with ">" lines having the same ids in the same order, but having white-space-separated integers instead of bases or amino acids. Note that the number of integers is expected to be identical to the number of bases or amino acids for the corresponding sequence in the sequence file. Having different lengths for a sequence and the corresponding quality sequence indicates that the file pair has been corrupted (or mis-parsed).

FASTQ format

The FASTQ format is used only with nucleotide alphabets, unlike FASTA, which is used with both nucleotide and amino-acid alphabets. Note that the nucleotide alphabet may include bases other that A, C, G, and T (like "H", which means "anything except G").

The Wikipedia article on FASTQ format is a pretty good description of the format and the variant meanings people have given it. It also has some methods for guessing what encoding was used in a fastq file.

Some important points to note:

End trimming

One thing that quality values are often used for is to "trim" reads from a sequencer to remove lower-quality data that could contaminate an assembly. Better assemblers use all the data, but weight it according to quality, but when you have a lot of sequence data it is sometimes simpler just to throw out some of it. There are several variants on throwing out bad data, depending on the types of errors the sequencer technology makes. The simplest is just to discard any read that has low quality bases in it, but this is rarely a useful option. More often, with sequencing-by-synthesis technologies the quality of a read starts out high and gradually gets worse as the read progresses and the synthesis produces more and more out-of-sync molecules to contaminate the signal. With these technologies, it can be beneficial to remove the last few bases of the read—starting with the first low-quality base. This is known as end trimming. There are variants of end trimming that will remove initial bad bases as well as final bad bases (returning the longest substring in which all bases have sufficient quality), but for this assignment we'll only look at removing final bases.

The task

Create three generators in a module that you can include from other Python programs: