Parsing assignment: UCSC BME 205 Fall 2015

Parsing assignment
Due Friday, 2015 Oct 9

UCSC BME 205 Fall 2015
Bioinformatics: models and algorithms

(Last Update: 10:27 PDT 12 October 2015 )

Purpose

The purpose of this assignment is for you to build re-usable parsers for reading some of the most common sequence formats, so that you will have them available for future assignments.

Two concepts are important here: creating reusable modules and what the popular sequence formats are.

The program for this assignment should be built starting from the template created in the first assignment. Note that it need not be the template that you created for the first assignment, but if you start with someone else's template, they need to be credited in writing—their name should still be in the source code!

In Python 3, a distinction is made between strings (which contain Unicode characters) and byte strings (which contain single-byte characters). The formats discussed here (FASTA and FASTQ) are byte-encoded, not Unicode, so should be read as bytes, not strings. If you want strings, the conversion of a byte-string to a Unicode string is

  b'hello'.decode('utf-8')

The formats: FASTA and FASTQ

The two most popular formats for sequences are FASTA (in many variants) and FASTQ (with two variants). You will need to do input and output of FASTA and FASTQ files for this assignment. As always with bioinformatics programs, you should be tolerant of minor variations in the format on input, but be very strict on output.

There are four pieces of data associated with a sequence in a file:

The name of the sequence, which usually is a string that contains neither spaces nor commas, but may have other special characters in it (NCBI loves using the "|" character in sequence names—a very bad choice if the sequence names ever get used in command lines or as file names). For the purposes of this assignment, the name should be considered atomic, and not broken into smaller pieces.
Extra information about the sequence. This is a string which may have any information in it, but is usually prohibited from having newlines. Again, some programs structure this extra information (in various incompatible ways), but for this assignment, it is an uninterpreted string.
The sequence itself, consisting of either nucleotides or amino acids. Various wild cards are allowed, the special symbol "*" for stop codons and the special symbol "-" for gaps (which are interpreted differently in different programs). Programs generally have an accepted alphabet they can handle and various ways of treating unexpected characters in the input.
Sequence of quality information (Q-values), generally integers that are rounded approximations to
–10 log₁₀ P(character is wrong) .
For your module to be useful, you must use an internal format that represents the Q-values as integers, not as some string encoding the values. String representations should only be present in the input or output files, not the internal representation.

FASTA format

FASTA format has ID lines starting with ">" in the first column and sequence lines. The ID extends from immediately after the ">" on the id line to the first white space or comma. Note that this definition of the ID means that an id cannot contain a space or a comma, and that putting a space after the ">" results in the empty string for the ID. The extra information is everything after the id and separator on the id line. Sequence lines contain the sequence, and are not allowed to contain ">" characters, but may have white space that is ignored. There may be many lines of sequence after an ID line—the newlines, like other white space, are ignored.

Rather than giving you a precise definition of the FASTA format, I point you to the OK description on Wikipedia's FASTA format page. Go read it now. On that page, they say

Some databases and bioinformatics applications do not recognize these comments and follow the NCBI FASTA specification.

You should follow the NCBI FASTA specification for output of FASTA, since most programs don't understand the ';' comments. The principle of being forgiving on input but strict on output suggests that any comments that start with ';' should be discarded, not copied to the output.

But one of NCBI's suggestions is both good and bad advice. They recommend

It is recommended that all lines of text be shorter than 80 characters in length.

Limiting lines in a file to a maximum length helps avoid problems with buffer overflow in programs where a fixed-length array is allocated to hold an input line. Unfortunately, it is not possible to keep all lines to below 80 characters. Since the extra information on the ID line may be arbitrarily long, and only one ID line is permitted per sequence, it is often not possible to keep the ID line to 80 characters.

Furthermore, some programs don't understand new lines or even spaces in sequences, expecting there to be exactly one line for a sequence, with every character being a position in the sequence. This single-line format is not standard FASTA, but it might be handy to provide it as an output option, so that you can use your program to clean up FASTA files for programs that have such bad input parsers. There should probably be a command-line option in any program that outputs FASTA to choose between line-length limitations and sequence-on-a-single-line.

Some FASTA-derived formats, such as UCSC's A2M format extend the alphabet (adding "." in A2M), and some programs do not accept parts of the standard format (such as "*"). Your first FASTA parser should accept any FASTA file that follows the NCBI conventions, and should properly ignore white space within sequences. It should also ignore any extraneous characters (like the "." of A2M format). You can add options, if you want, to modify what alphabet is considered acceptable.

Note that the FASTA format provides name, extra data, and sequence information, but does not provide quality information. When quality information is needed with FASTA files, two files are used. One is the straight FASTA file with names, extra data, and sequences, and the other is a file with ">" lines having the same ids in the same order, but having white-space-separated integers instead of bases or amino acids. Note that the number of integers is expected to be identical to the number of bases or amino acids for the corresponding sequence in the sequence file. Having different lengths for a sequence and the corresponding quality sequence indicates that the file pair has been corrupted (or mis-parsed).

FASTQ format

The FASTQ format is used only with nucleotide alphabets, unlike FASTA, which is used with both nucleotide and amino-acid alphabets. Note that the nucleotide alphabet may include bases other that A, C, G, and T (like "H", which means "anything except G").

The Wikipedia article on FASTQ format is a pretty good description of the format and the variant meanings people have given it. It also has some methods for guessing what encoding was used in a fastq file.

Some important points to note:

The FASTQ format is poorly designed, since the punctuation characters "+" and "@" are also legal quality values.
The FASTQ format consists of records, each of which has the data for one sequence and the corresponding quality information. If the line that introduces the quality data has an identifier, it should match the identifier on the id line for the nucleotide sequence. Your program does not need to check this, but you could add it as an extra feature, once everything else is working correctly.
In past years, I required that students be able to handle fastq files in which nucleotide or quality sequences were on multiple lines, which makes parsing a bit more complicated, since a quality line might be broken at a place where the next quality character is an "@" or "+". Such fastq files are rare now, so we'll simplify the specifications: you only need to parse fastq files where each sequence is represented by exactly 4 lines: an id line, a sequence line, a "+" separator line, and a quality line. Note that the "+" line contains an optional id. If it has an id, the id should match that of the "@" line.
On output, FASTQ sequences should not have line breaks in either the nucleotide sequence or the quality sequence, so that other programs can read them. I've seen some break anyway, since the first quality character can be a "+" or "@", and some parsers are so fragile that they can't handle that case even when the nucleotide and quality sequences are single lines.
There are two different ways to convert Phred quality scores to FASTQ quality letters: Phred+33 and Phred+64. Defaults for input and output should be Phred+33 format, but Phred+64 should be available as an option for either. Both formats convert a small integer that represents the quality of a base to a single byte They can be thought of in Python as chr(qual+33) and chr(qual+64). The conversion in the other direction is ord(c)-33 or ord(c)-64.

The usual interpretation of the Phred quality number is as –log₁₀(Prob(error)), which really only makes sense for quality values bigger than 1, since if we know nothing about a base, we have about 3/4 chance of being wrong about it. Despite this, most programs expect 0 as the quality for unknown bases (though Illumina files from versions 1.5 through 1.7 used 2, encoded as 'B' in Phred+64, for unknown bases).

Solexa files (Illumina version 1.0) encoded –log₁₀(Prob(error)/(1–Prob(error))), which actually makes more sense than the standard interpretation, but standards prevail over sense, so Illumina version 1.3 onward uses –log₁₀(Prob(error)), as do other sequencing platforms.

Although you could use a dict look-up table to translate FASTQ quality values to integers, it is better to use the ord() function to convert to integers and to use chr() to covert quality values to characters.
It is possible to guess from a FastQ file which of several encoding schemes is used, by looking at the distribution of characters, but that sort of cleverness is not expected in this assignment—you can assume that the user has given you the correct choice.

End trimming

One thing that quality values are often used for is to "trim" reads from a sequencer to remove lower-quality data that could contaminate an assembly. Better assemblers use all the data, but weight it according to quality, but when you have a lot of sequence data it is sometimes simpler just to throw out some of it than to try to figure out how much weight to give to junk. There are several variants on throwing out bad data, depending on the types of errors the sequencer technology makes. The simplest is just to discard any read that has low quality bases in it, but this is rarely a useful option. More often, with sequencing-by-synthesis technologies the quality of a read starts out high and gradually gets worse as the read progresses and the synthesis produces more and more out-of-sync molecules to contaminate the signal. With these technologies, it can be beneficial to remove the last few bases of the read—starting with the first low-quality base. This is known as end trimming. There are variants of end trimming that will remove initial bad bases as well as final bad bases (returning the longest substring in which all bases have sufficient quality), but for this assignment we'll only look at removing final bases.

Factoring your code

It seems that many introductory programming courses are only teaching low-level coding, and that many students come into this course with no practice in thinking about the structure of their programs. I don't want to provide scaffolding for the students on this assignment—the goal here is for students to learn to build their own program structures.

There are some basic principles of engineering design that are important here: breaking a problem into subproblems and having simple, well-defined interfaces between the subproblems.

There are conceptually three parts to the assignment that might serve as suitable subproblems: parsing input, doing end trimming, and producing output. Try to keep them as separate from each other as possible: the code that does trimming should not need to know what the input or output formats are; the input parser should not need to know what is going to be done with the sequences; and the output routines should not need to know where the sequences came from.

Make sure that your internal representation that is yielded by the parsers, manipulated by the endtrimmer, and converted to output by the output routines is exactly the same no matter which formats are used for input and output. Define (and document!) that internal representation first—so that it can be used as part of the interface between the subproblems.

The task

Create three generators in a module that you can include from other Python programs:

read_fasta, for yielding sequences from a fasta file passed in as a file-like object.
read_fastq, for yielding sequences with Phred-score quality values from a fastq file passed in as a file-like object.
read_fasta_with_quality, for yielding sequences with Phred-score quality values from a pair of files (one fasta, one similarly formatted but with white-space-separated numbers instead of amino acid or nucleic acid sequences. It would probably be best if this parser used the read_fasta parser directly, so that bug fixes there would automatically propagate.

The module should be a separate file whose name ends with ".py", so that it can be included in your main Python program. You'll be reusing this module a lot this quarter, so try to make it both simple and robust.

Your generators should accept arguments that affect how they parse (for example, which characters are kept in sequences, whether other characters are ignored or treated as errors, whether case is preserved, and so forth). You will probably want to extend and modify these options for different assignments over the quarter, so it is more important now just to get the parsers working than to dream up lots of complications. For this assignment, accept both upper- and lower-case alphabets (any ASCII letter) and "*". Also preserve the case—later assignments may require case conversion or even removing lower-case letters.

You may choose any convenient internal format for yielding the sequences, but you should be sure to include the sequence id, any comments that are on the id line, and the sequence itself. Three obvious choices are tuples: (id, comment, sequence) or (id, comment, sequence, quality sequence), classes with members id, comment, sequence, quality, or dicts: {'id': id, 'comment':comment, 'sequence':sequence, 'quality':quality}. In all of these the quality sequence should be a list or tuple of ints, with exactly as many ints as there are characters in the sequence.

Using classes is probably the most Pythonic choice, but tuples are easier to implement and may be easier to unpack in the for-loops that the generators will be called from. I generally use tuples for these parsers, because of the easy of unpacking them:

for id,seq,comment,qual in read_fasta_with_quality(sys.stdin):
    # do something with one sequence

Write a program "endtrim" that reads a fastq file or fasta/quality file pair and echoes it to output in either format, trimming each read to end just before the first low-quality base. Note that if the first base is bad, then the trimmed read consists of the empty sequence. The input and output formats and the threshold for quality are all determined by command-line arguments. For example,

endtrim --min_q 30 --in_33 foo.fastq --out_fasta foo.fa --out_qual foo.qual

--min_qual [integer]: The lowest quality value that can appear in the output. If a base of lower quality appears in an input sequence, the sequence is truncated to remove all bases at that position or later.
--in_33 [filename]: Input file in fastq format, using Phred+33 coding
--in_64 [filename]: Input file in fastq format, using Phred+64 coding
--out_33 [filename]: Output file in fastq format, using Phred+33 coding
--out_64 [filename]: Output file in fastq format, using Phred+64 coding
--in_fasta [filename]: Input file in fasta format. If used, there must also be a --in_qual argument.
--in_qual [filename]: Input file in fasta quality format. If used, there must also be a --in_fasta argument.
--out_fasta [filename]: Output file in fasta format.
--out_qual [filename]: Output file in fasta quality format. If used, there must also be a --out_fasta argument. It is possible to have a --out_fasta without --out_qual, in which case the quality information is quietly discarded. This is a fairly common use case, when converting data for programs that can't use quality information.

Note that the "33" and "64" modifiers always refer to how quality information is encoded in the fastq files. The .qual files paired with .fasta files are always encoded as decimal numbers, not with the single-character encodings of the fastq format.

The version of the program I wrote treats the input formats as mutually exclusive, but allows multiple output formats. It also defaults to using in_33 from sys.stdin, if no inputs are specified and out_33 to sys.stdout, if no outputs are specified. I won't be checking these things (all commands will give explicit input and output files), but more advanced students may want to think about how to do these things cleanly and concisely.

You may find it useful to do other variants (merging datafiles, automatically generating extensions, detecting errors in the input, ...), but the core of this assignment is producing reusable sequence parsers and sequence output in a couple of the most common formats. Get the core functionality working completely before you try adding bells and whistles. Core functionality and readable, simple code is more important than error checking or reporting. Note that empty files and empty sequences are legal (even common) use cases, and should not cause crashes.

One edge condition to think about: what should be done if the first base of a sequence is low quality, so that the trimmed read is the empty sequence? Should the read be kept (as an empty sequence) or discarded? I'm going to argue for keeping the read, as reads are often part of read pairs, and it is easier to stay in sync if both paired reads are present, even if one is empty.

Because the specification here is deliberately vague about the precise output format to use, I will be checking output by using your program to do end trimming and convert from one format to another, then my program to covert back. I will expect the results of that conversion to be identical to doing both conversions with my programs. If your programs produce legal files with the correct data in them, I should be able to read them and get identical results, even if the formatting choices are different from the choices I make.

There are a few fasta and fastq test files for your use in hw2-files, also accessible on the SoE computers as ~karplus/.html/bme205/f15/hw2-files/ Among the test cases are tiny.fa, tiny.qual, tiny.fq33, and tiny.f64, which should all contain the same data in different formats. The zero-length file, null, should produce a zero-length file as output no matter what input format and output format are requested. The empty sequences (for fasta and qual in empty.fa, for fastq formats in empty.fastq) should interconvert also. The sequences in f.fa and f.qual test multi-line handling and case sensitivity. The larger file hot.fq33 can be used for timing tests—it is a real sequence file from a PacBio sequencing run. I will use these files as test cases, and probably some others. You will probably want to generate your own small test cases for your code as well.

Submit your code as file "endtrim" in a directory "BME205/hw2/" in your home directory. If you include test files in your directory, I may include them in the test suite I evaluate everyone's code with. Make sure that your home directory, BME205, and hw2 are all have permission codes that allow others to execute them (readability is also nice but not required). The UNIX permission code "755" is a good one for these directories. The files in the directory should also be readable, and "endtrim" itself should be executable.

Note: the following text is borrowed from David Bernick's assignment in Fall 2011, with some modifications:

Think carefully about boundary conditions: for example, an empty file, an empty string as the id, or an id line with a zero-length sequence (all of which are legal inputs). Think also about illegal inputs, like a mis-ordered FASTQ file (ID lines of sequence and quality lines are both specified but not equal), or a FASTA file that contains extraneous characters (like numbers in nucleotide sequence). Your programs must be able to handle any legal input, but its behavior may be undefined on illegal inputs.

The program should also report to stderr (not stdout—we don't want to contaminate the output files) any anomalies that are found, and the assumptions that are made about the input file. There are no specific requirements for what you must report, but some things to consider include the following: Is it a PHRED+33 or PHRED+64 file? How many records were found? How many bad records were found (missing IDs, duplicate IDs, wrong number of quality scores, sequence without quality or quality section without sequence, ...)?

Start with no reporting of extra information or errors, and gradually add information and error checks as you have time or see the need. It is possible to get tangled up in doing too much error checking initially, when your focus should be on clean, readable code.

Any error messages to stderr should be fairly terse. You probably want to report errors only once that are likely to be in many sequences (like quality score out of range for PHRED+33 or illegal character in sequence). I would like to see short informative messages like

ERROR: '@' not in first column for 'HWI-ST611_0189:1:1101:1237:2140#0/1'
ERROR: not FASTA format, sequence line before first id line.
WARNING: input has no sequences.
WARNING: input has 3 empty sequences.

Note: it would probably better to give the names of the three empty sequences. Line numbers can also be useful in error messages, though they require a bit more forethought in designing your parser—look up the Python "enumerate" function.

I generally reserve "ERROR" for serious problems that should shut down a pipeline and prevent further processing until the problem has been fixed manually. I use "WARNING" for things that may be legal, but are rare and ought to be checked, as they may result from a file having been contaminated or truncated.

Things to change next year

Be more consistent in use of "could", "should", or "must" to distinguish required features from optional, but desirable ones. Perhaps pull out a list of minimal requirements, and a separate list of desirable enhancements.

Provide some test sequences that catch the off-by-one error in trimming (people doing >min_qual rather than ≥min_qual). Also test for fasta with blank first line, "@" or ">" within id.

Parsing assignment Due Friday, 2015 Oct 9

UCSC BME 205 Fall 2015 Bioinformatics: models and algorithms