Parsing assignment: UCSC BME 205 Fall 2012

Parsing assignment
Due Friday, 2012 Oct 12

UCSC BME 205 Fall 2012
Bioinformatics: models and algorithms

(Last Update: 23:33 PDT 4 October 2012 )

Purpose

The purpose of this assignment is for you to build re-usable parsers for reading some of the most common sequence formats, so that you will have them available for future assignments.

Two concepts are important here: creating reusable modules and what the popular sequence formats are.

The program for this assignment should be built starting from the template created in the first assignment. Note that it need not be the template that you created for the first assignment, but if you start with someone else's template, they need to be credited in writing—their name should still be in the source code!

The formats: FASTA and FASTQ

The two most popular formats for sequences are FASTA (in many variants) and FASTQ (with two variants). You will need to do input and output of FASTA and FASTQ files for this assignment. As always with bioinformatics programs, you should be tolerant of minor variations in the format on input, but very strict on output.

There are four pieces of data associated with a sequence in a file:

The name of the sequence, which usually is a string that contains neither spaces nor commas, but may have other special characters in it (NCBI loves using the "|" character in sequence names). For the purposes of this assignment, the name should be considered atomic, and not broken into smaller pieces.
Extra information about the sequence. This is a string which may have any information in it, but is usually prohibited from having newlines. Again, some programs structure this extra information (in various incompatible ways), but for this assignment, it is an uninterpreted string.
The sequence itself, consisting of either nucleotides or amino acids. Various wild cards are allowed, the special symbol "*" for stop codons and the special symbol "-" for gaps (which are interpreted differently in different programs).
Sequence of quality information, generally integers that are rounded approximations to
–10 log₁₀ P(character is wrong)

FASTA format

FASTA format has ID lines starting with ">" in the first column and sequence lines. The ID extends from immediately after the ">" on the id line to the first white space or comma. The extra information is everything after the id and separator on the id line. Sequence lines contain the sequence, and are not allowed to contain ">" characters, but may have white space that is ignored.

Rather than giving you a precise definition of the FASTA format, I point you to the OK description on Wikipedia's FASTA format page. Go read it now. On that page, they say

Some databases and bioinformatics applications do not recognize these comments and follow the NCBI FASTA specification.

You should follow the NCBI FASTA specification for output of FASTA, since most programs don't understand the ';' comments. But one of NCBI's suggestions is both good and bad advice. They recommend

It is recommended that all lines of text be shorter than 80 characters in length.

Limiting lines in a file to a maximum length helps avoid problems with buffer overflow in programs where a fixed-length array is allocated to hold an input line. Unfortunately, it is not possible to keep all lines to below 80 characters. Since the extra information on the ID line may be arbitrarily long, and only one ID line is permitted per sequence, it is not possible to keep the ID line to 80 characters.

Some FASTA-derived formats, such as UCSC's A2M format extend the alphabet (adding "." in A2M), and some programs do not accept parts of the standard format (such as "*"). Your first FASTA parser should accept any FASTA file that follows the NCBI conventions, and should properly ignore white space within sequences. It should also ignore any extraneous characters (like the "." of A2M format). You can add options, if you want, to modify what alphabet is considered acceptable.

I have occasionally encountered a badly written program that can't handle having white space or newlines breaking up a sequence, in which case you need to put out the entire sequence on one line. There should probably be a command-line option in any program that outputs FASTA to choose between line-length limitations and sequence-on-a-single-line.

Note that the FASTA format provides name, extra data, and sequence information, but does not provide quality information. When quality information is needed with FASTA files, two files are used. One is the straight FASTA file with names, extra data, and sequences, and the other is a file with ">" lines having the same ids in the same order, but having white-space-separated integers instead of bases or amino acids. Note that the number of integers is expected to be identical to the number of bases or amino acids for the corresponding sequence in the sequence file.

FASTQ format

The Wikipedia article on FASTQ format is a pretty good description of the format.

Some important points to note:

On input, you need to be able to handle quality sequences that are on multiple lines, even ones that have "@" or "+" characters at the beginnings of the line. The key to getting this right is that a quality sequence must be exactly the same length as the preceding nucleotide sequence, and the "+" and "@" characters can't appear in nucleotide sequences.
On output, FASTQ sequences should not have line breaks, so that other programs with less careful parsers won't break. I've seen some break anyway, since the first quality character can be a "+" or "@", and some parsers are so fragile that they can't even handle that case.
There are two different ways to convert Phred quality scores to FASTQ quality letters: Phred+33 and Phred+64. Defaults for input and output should be Phred+33 format, but Phred+64 should be available as an option for either. Both formats convert a small integer that represents the quality of a base to a single ASCII character. They can be thought of in Python as chr(qual+33) and chr(qual+64).

The usual interpretation of the Phred quality number is as –log₁₀(Prob(error)), which really only makes sense for quality values bigger than 1, since if we know nothing about a base, we have about 3/4 chance of being wrong about it. Despite this, most programs expect 0 as the quality for unknown bases (though Illumina files from versions 1.5 through 1.7 used 2, encoded as 'B' in Phred+64, for unknown bases).

Solexa files (Illumina version 1.0) encoded –log₁₀(Prob(error)/(1–Prob(error))), which actually makes more sense than the standard interpretation, but standards prevail over sense, so Illumina version 1.3 onward uses –log₁₀(Prob(error)), as do other sequencing platforms.

Although you can use a dict look-up table to translate FASTQ quality values to integers, it is probably better to use the ord() function and to use chr() to covert quality values to characters.
It is possible to guess from a FastQ file which of several encoding schemes is used, by looking at the distribution of characters, but that sort of cleverness is not expected in this assignment—you can assume that the user has given you the correct choice.

The task

Create three generators in a module that you can include from other Python programs:

read_fasta, for yielding sequences from a fasta file passed in as a file-like object.
read_fastq, for yielding sequences with Phred-score quality values from a fastq file passed in as a file-like object.
read_fasta_with_quality, for yielding sequences with Phred-score quality values from a pair of files (one fasta, one similarly formatted but with white-space separated numbers instead of amino acid or nucleic acid sequences. It would probably be best if this parser used the read_fasta parser directly, so that bug fixes there would automatically propagate.

Your generators should accept arguments that affect how they parse (for example, which characters are kept in sequences, whether other characters are ignored or treated as errors, whether case is preserved, and so forth). You will probably want to extend and modify these options for different assignments over the quarter, so it is more important now just to get the parsers working than to dream up lots of complications.

You may choose any convenient internal format for yielding the sequences, but you should be sure to include the sequence id, any comments that are on the id line, and the sequence itself. Three obvious choices are tuples: (id, comment, sequence) or (id, comment, sequence, quality sequence), classes with members id, comment, sequence, quality, or dicts: {'id': id, 'comment':comment, 'sequence':sequence, 'quality':quality}.

Using classes is probably the most Pythonic choice, but tuples are easier to implement and may be easier to unpack in the for-loops that the generators will be called from.

Write a main program "fastq2fasta" that can accept a FASTQ file and convert it either to FASTA or to a pair of files FASTA and quality.

Write another main program "fasta2fastq" that can accept a FASTA plus quality file and convert the pair to FASTQ.

I will test your programs with the following commands:

fastq2fasta < foo.fastq  > foo.fasta
fastq2fasta --33 --in foo.fastq --out foo.fasta --qual foo.qual
fastq2fasta --64 --in foo.fastq --out foo.fasta --qual foo.qual

fasta2fastq --64 --in foo.fasta --qual foo.qual --out foo.fastq
fasta2fastq --33 --in foo.fasta --qual foo.qual --out foo.fastq

You may find it useful to do other variants (merging datafiles, automatically generating extensions, detecting errors in the input, ...), but the core of this assignment is producing reusable sequence parsers and sequence output in a couple of the most common formats. Get the core functionality working completely before you try adding bells and whistles. Core functionality and readable, simple code is more important than error checking or reporting. Note that empty files and empty sequences are legal (even common) use cases, and should not cause crashes.

Because the specification here is deliberately vague about the precise output format to use, I will be checking output by using your program to convert from one format to another, then my program to covert back. I will expect the results of that conversion to be identical to doing both conversions with my programs. If your programs produce legal files with the correct data in them, I should be able to read them and get identical results, even if the formatting choices are different from the choices I make.

There are a few fastq test files for your use in http://users.soe.ucsc.edu/~dbernick/bme205/ Note: I believe that 7_1.fastq is deliberately bad. I've also provided tiny.fasta, tiny.qual, and tiny.fastq files for quicker testing (tiny.fastq is a Phred+33 file and has the same data as tiny.fasta and tiny.qual, but may stress some fastq parsers).

Note: the following text is borrowed from David Bernick's assignment in Fall 2011:

Think carefully about boundary conditions: an empty file, an id line with a zero-length sequence, a mis-ordered FASTQ file (ID lines of sequence and quality lines are both specified but not equal). For this exercise we will not allow quality lines without an ID specified.

The program should also report to stderr (not stdout—we don't want to contaminate the output files) any anomalies that might be found, and the assumptions that are made about the input file. There are no specific requirements for what you must report, but some things to consider include the following: Is it a PHRED+33 or PHRED+64 file? How many records were found? How many bad records were found (missing IDs, duplicate IDs, wrong number of quality scores, sequence without quality or quality section without sequence, ...)?

Any error messages to stderr should be fairly terse. You probably want to report only once errors that are likely to be in many sequences (like quality score out of range for PHRED+33 or illegal character in sequence). I would like to see short informative messages like

ERROR: '@' not in first column for 'HWI-ST611_0189:1:1101:1237:2140#0/1'
ERROR: not FASTA format, sequence line before first id line.
WARNING: input has no sequences.
WARNING: input has 3 empty sequences.

Parsing assignment Due Friday, 2012 Oct 12