UCSC BME 205 Fall 2015

Bioinformatics: Models and Algorithms
Homework 4

(Last Update: 09:33 PDT 15 October 2015 )

Simple Markov chains
Due Friday 23 Oct 2015 (beginning of class).

This assignment is intended to deepen your understanding of first-order Markov chains and of coding cost. I will provide you with a data set consisting of many protein sequences in a FASTA file (see FASTA/FASTQ assignment). You will build stochastic models of these sequences as zero-order and first-order Markov chains, and measure the information gain of the first-order model over the zero-order model.

I have provided two large FASTA files: mark_f14_1.seqs and mark_f14_2.seqs for training and testing the models. Warning: these FASTA files contain amino acids in both upper and lower case, but you should ignore the case when reading the files, so that 'A' and 'a' both stand for alanine. You will do 2-fold cross-validation (training on each of the sets and testing on the other). These can be accessed on School of Engineering machines directly from the directory /soe/karplus/.html/bme205/f14/markov_files/ so that you don't need to make your own copies. (Note: the "f14" is correct---I'm reusing the files from last year.)

The combined set was derived from Dunbrack's culledpdb list of X-ray structures with resolution 1.6 Angstroms or better and R-factor <= 0.25, reduced so that no two sequences had more than 90% identity. I used only those sequences that were in our t06 template library, which primarily means that short sequences were removed. I randomly split the list of chain ids into two partitions. For full details on how the set was created, see the Makefile.

Because the sequence files are fairly large (around 680,000 characters) and many of you will have written rather slow code, I've also provided a markov_files/tiny.seqs set for quick testing.

We will be modeling the sequences including a special "stop" character at the end of the sequences that is not present in the data files. This means that the Markov chains will have probabilities that add to one over all strings that end with the stop character, rather than over strings of a fixed length. For simplicity, we'll be using different non-sequence characters to indicate the start of a sequence (^) and end of the sequence($). Note that the training and test data files do not have these special characters—the alphabet is amino acids (with a possibility of special wild cards 'X', 'B', and 'Z').

It might be a good idea in your code to have separate parameters for start and stop characters, and to allow start=None to mean that k-mers that start before the first character of the sequence are not counted (similarly for stop=None.

The simplest model we discussed in class is the i.i.d. (independent, identically distributed) model, or zero-order Markov model. It assumes that each character of the string comes from the same background distribution P_0, and the probability of a string is just the product of the probabilities of the letters of the string (including the final "$"). To estimate the background distribution, we need counts of all the letters in the training set, plus the number of strings in the training set (represented as the number of times "$" occurs).

The next simplest model was a Markov chain, in which the probability of each character depends only on the preceding character. To estimate the conditional probabilities, we'll need the counts for all pairs of letters in the training set, including initial pairs (like "^M") and ending pairs (like "K$"). Note that "^$" is a possible pair (what one would get for the empty string).

In general, for estimating a k-th order Markov chain, we need counts of all (k+1)-mers in the training data.

First program: count-kmers

Write a program called "count-kmers" that reads a FASTA file from stdin and outputs a table of counts to stdout. The output should have two tokens on each line: the k-mer followed by the count of that k-mer. Initial and final k-mers should be symmetrically included, that is, if the sequence is ABDEF and you are counting 3-mers, the counts should be

^^A 1
^AB 1
EF$ 1
F$$ 1
Note: case should be ignored in the input sequence, so that "abDE", "ABDE", and "abde" are all counted as the same 4-mer.

Hint: there are two different approaches to counting k-mers. In one, you process the file a line at a time (to distinguish id lines), then process the sequence lines a character at a time, using something like kmer = kmer[0:-1]+letter. In the other approach, you can define a generator function in Python to read a FASTA file one sequence at a time, then extract the kmers from the sequence with something like

    for (fasta_id,comment,seq) in read_fasta(genome):
	for start in range(len(seq)-k):
            counts[seq[start:start+k]] += 1
Both approaches are useful. The first one does not require large amounts of RAM when you get very long sequences (like whole plant chromosomes), and the second one can be used for other FASTA reading, where the sequences need to be processed individually in more complex ways. Given that you have already written a FASTA parser, I expect that most students will choose to use it, probably with some modifications to add optional arguments about whether case should be ignored and what alphabet is to be used.

The "count-kmers" program must accept two options:

--order (or -o)
a required argument that gives the order of the model (hence --order=0 should count 1-mers)
--alphabet (or -a)
which specifies which letters of the sequence are to be used (for example, to count only standard amino acids, use --alphabet ACDEFGHIKLMNPQRSTVWY). All characters in sequences that are not in the alphabet should be ignored. Note: you may want to use the "set" type to represent the alphabet internally. The alphabet should default to ASCII letters if not specified. Note: you want case conversion to be done before determining whether a letter is in the alphabet, so that users don't have to provide both lower-case and upper-case copies of the alphabet.

The command-line option parsing should be done with the "argparse" module. The actual counting of k-mers should be done by a function you define in a separate module (Markov.py or markov.py). The counts are most easily kept and returned in a "collections.Counter" object, which provides a very convenient way to do all sorts of counting.

If the k-mer counting function is passed a file-like object (like sys.stdin) as an argument, then you can easily reuse it on other files. Furthermore, if you use for line in input_stream: to read the input, then you can do testing by passing in a list of lines:

>>> dummy_in= '>seq1\nABCDEF\n>seq2\nAC\nde\nFG\n>seq3\n'
>>> count=get_counts(dummy_in.splitlines(True), 2, alphabet="ABCDEFGHIJK", start='^', stop='$')

Here is an example of what you should get when your run your program:

count-kmers -o0 < /soe/karplus/.html/bme205/f14/markov_files/mark_f14_1.seqs 

$ 2804
A 53783
B 1
C 8336
D 37458
E 40373
F 24864
G 49926
H 17643
I 33796
K 35127
L 54962
M 14260
N 27386
P 30034
Q 23554
R 30858
S 39767
T 35235
V 44037
W 9412
X 56
Y 22591

With a restricted alphabet (getting rid of the B and X wildcards),
count-kmers -o0 --alphabet=ACDEFGHIKLMNPQRSTVWY < /soe/karplus/.html/bme205/f14/markov_files/mark_f14_1.seqs 
should produce
$ 2804
A 53783
C 8336
D 37458
E 40373
F 24864
G 49926
H 17643
I 33796
K 35127
L 54962
M 14260
N 27386
P 30034
Q 23554
R 30858
S 39767
T 35235
V 44037
W 9412
Y 22591
Similarly, when you run
count-kmers -o1 --alphabet=ACDEFGHIKLMNPQRSTVWY < /soe/karplus/.html/bme205/f14/markov_files/tiny.seqs 
you should get
A$ 1
AC 4
AG 1
AM 7
CA 3
CD 4
DE 4
EF 4
FA 3
FG 1
GG 9
GH 1
GM 1
HI 1
IK 1
K$ 1
M$ 1
MA 6
MC 3
MM 2
MW 1
WM 1
WW 3
^A 1
^M 2

For symmetry, report as many stop characters as you report start characters (this makes it easier to make "reverse Markov" models that model the sequences in reverse order)---in both cases, up to the order of the model characters.

The order-0 models are a special case, as the normal number of start (and stop) characters included is the order of the model, but we want to have counts of the stop character even for the order-0 model. For order-0 only, report the number of stop characters as the number of sequences, but don't report the number of start characters.

count-kmers -o2 --alphabet=ACDEFGHIKLMNPQRSTVWY < /soe/karplus/.html/bme205/f14/markov_files/tiny.seqs 
you should get
A$$ 1
AM$ 1
IK$ 1
K$$ 1
M$$ 1
MA$ 1
^AC 1
^MA 1
^MC 1
^^A 1
^^M 2

Note: the output should be sorted in alphabetical order, with a single space separator, so that I can check I/O behavior using the "diff" program.

Optional bonus points:

Second program: coding-cost

Write a program that reads in a table of k-mer counts (in the same format as you output them) from stdin and FASTA-formatted sequences from a file, and computes the total encoding cost in bits for the sequences. The encoding cost of a character x using model M is -log_2 P_M(x), and the encoding cost of a sequence is the sum of the costs for the characters (note: this is equivalent to taking -log_2 of the product of the probabilities).

The calling sequence for the program should be

coding-cost tiny.seqs < mark_f14_1.count1 > tiny.cost1
The use of stdin for the table allows using count-kmers and coding-cost in a pipe:
count-kmers -order=1 --alphabet=ACDEFGHIKLMNPQRSTVWY < mark_f14_1.seqs | coding-cost tiny.seqs > tiny.cost1

Like count-kmers, coding-cost should accept an --alphabet argument to restrict the character set. The default should be the 20 standard amino acids: ACDEFGHIKLMNPQRSTVWY. Note that the file name for coding-cost is a positional argument without any "--" name in front of it, and that there is no "--order" argument. The order can be determined from the table read from stdin.

Report not only the total encoding cost for the file, but the average cost per sequence and per character. So that everyone gets the same numbers, for the per-character costs the numerator is the total cost, which includes encoding exactly one stop character at the end of each sequence (for any order, including order 0), and the denominator is the number of characters not counting the stop character. I'm not defending this as necessarily the best numerator and denominator to use for average cost/character, but it is the choice I used in generating the "right" answers.

Note that to do the computation, you need to convert the counts into probabilities. The simplest approach is to normalize the counts to sum to 1 over each "condition". Note that kmer[0:-1] is the "condition" for a given k-mer. You might think that because the FASTA training file will be large, you will not need to worry about pseudocounts, but even by order 2, there are 3-mers that have zero counts, so the maximum-likelihood estimate of their probability is 0 and the encoding cost is infinite (resulting in a "ValueError: math domain error" from python if you actually try to compute it).

The next simplest approach is to add a pseudocount to every count (including the zero counts for a context, so if you have seen "WWA", "WWC", ... but not "WWH" or "WWI", you have the context "WW" and will need to add a pseudocount for every "WWx" where x ranges over your alphabet. I recommend adding a function to your markov module that produces a conditional probability dict from a Counter (without changing the counter), so that you can easily modify the way you handle the probability estimation. The default pseudocount should be 1.0 for each cell. Note that pseudocounts get added to every kmer (whether they were zero count or not).

The simple approach of making sure that pseudocounts have been used for letters in every observed context will work fine up to about order 2, but for higher order Markov chains, you might run into k-mers for which even the k-1-mer that is the context has never been seen in the training set. (For example, mark_f14_1.seqs has a sequence starting with CYC, producing 4-mer ^CYC , but the context ^CY never occurs in mark_f14_2.seqs.)

You can fix this problem by making sure that the log-prob table has keys for all k-mers over the alphabet.

One tricky point in the log-prob table: We know that once we get to the end of a sequence and have seen the final "$", then all subsequent letters must be "$", so we do not want P("A$A") to be non-zero, even if we have seen counts for "A$$".

Hint: for this assignment, since we are not looking at the encoding cost individually for each sequence, but only the total encoding cost, you can summarize the FASTA file with the same sort of dictionary of counts that was used for count-kmers. You can determine the size of k-mer you need to count by looking at the size of the keys you read in for the model.

Run coding_cost on both sets of sequences using both order 0 and order 1 models. Get the coding costs both for the training set and for the test set. For comparison, my code with an autodetected alphabet and pseudocounts of 1.0 got

TrainTestorder 0 bits/charorder1 bits/char

The table with the constrained alphabet ACDEFGHIKLMNPQRSTVWY is

TrainTestorder 0 bits/charorder1 bits/char

Note that a simple uniform model over 20 characters would take 4.32193 bits/char, and over a 21-letter alphabet (including stop) would take 4.39232 bits/char.

Optional bonus points:

To turn in

I want copies of your programs on paper, but I don't want all the output from every run of the programs! Instead, write a short summary describing the results of the tests, giving the average encoding costs for each set of sequences for each model. Try to write your summary as if you were describing the method and the results in a journal paper that has a page limit: you want to describe the data sets and methods completely but concisely. If you do it well, the presentation should take less than a page.

Turn in your summary page with the program stapled after it. (Avoid paper clips—they have a tendency to either come off or to pick up other people's papers.)

I also want a directory name for a directory on the School of Engineering machines that I can read that contains all the programs and writeup for this assignment. Make sure that the directory and all files in it are publicly readable, and that all directories on the path to it are publicly executable, or I won't be able to access the files.


Things learned for next year

SoE home
sketch of Kevin Karplus by Abe
Kevin Karplus's home page
Biomolecular Engineering Department
BME 205 home page UCSC Bioinformatics research

Questions about page content should be directed to Kevin Karplus
Biomolecular Engineering
University of California, Santa Cruz
Santa Cruz, CA 95064
318 Physical Sciences Building