This assignment is intended to deepen your understanding of first-order Markov chains and of coding cost. I will provide you with a data set consisting of many protein sequences in a FASTA file (see FASTA/FASTQ assignment). You will build stochastic models of these sequences as zero-order and first-order Markov chains, and measure the information gain of the first-order model over the zero-order model.
I have provided two large FASTA files: mark_f14_1.seqs and mark_f14_2.seqs for training and testing the models. Warning: these FASTA files contain amino acids in both upper and lower case, but you should ignore the case when reading the files, so that 'A' and 'a' both stand for alanine. You will do 2-fold cross-validation (training on each of the sets and testing on the other). These can be accessed on School of Engineering machines directly from the directory /soe/karplus/.html/bme205/f14/markov_files/ so that you don't need to make your own copies. (Note: the "f14" is correct---I'm reusing the files from last year.)
The combined set was derived from Dunbrack's culledpdb list of X-ray structures with resolution 1.6 Angstroms or better and R-factor <= 0.25, reduced so that no two sequences had more than 90% identity. I used only those sequences that were in our t06 template library, which primarily means that short sequences were removed. I randomly split the list of chain ids into two partitions. For full details on how the set was created, see the Makefile.
Because the sequence files are fairly large (around 680,000 characters) and many of you will have written rather slow code, I've also provided a markov_files/tiny.seqs set for quick testing.
We will be modeling the sequences including a special "stop" character at the end of the sequences that is not present in the data files. This means that the Markov chains will have probabilities that add to one over all strings that end with the stop character, rather than over strings of a fixed length. For simplicity, we'll be using different non-sequence characters to indicate the start of a sequence (^) and end of the sequence($). Note that the training and test data files do not have these special characters—the alphabet is amino acids (with a possibility of special wild cards 'X', 'B', and 'Z').
It might be a good idea in
your code to have separate parameters for start and stop characters,
and to allow
start=None to mean that k-mers that start
before the first character of the sequence are not counted (similarly
The simplest model we discussed in class is the i.i.d. (independent, identically distributed) model, or zero-order Markov model. It assumes that each character of the string comes from the same background distribution P_0, and the probability of a string is just the product of the probabilities of the letters of the string (including the final "$"). To estimate the background distribution, we need counts of all the letters in the training set, plus the number of strings in the training set (represented as the number of times "$" occurs).
The next simplest model was a Markov chain, in which the probability of each character depends only on the preceding character. To estimate the conditional probabilities, we'll need the counts for all pairs of letters in the training set, including initial pairs (like "^M") and ending pairs (like "K$"). Note that "^$" is a possible pair (what one would get for the empty string).
In general, for estimating a k-th order Markov chain, we need counts of all (k+1)-mers in the training data.
Write a program called "count-kmers" that reads a FASTA file from stdin and outputs a table of counts to stdout. The output should have two tokens on each line: the k-mer followed by the count of that k-mer. Initial and final k-mers should be symmetrically included, that is, if the sequence is ABDEF and you are counting 3-mers, the counts should be
^^A 1 ^AB 1 ABD 1 BDE 1 DEF 1 EF$ 1 F$$ 1Note: case should be ignored in the input sequence, so that "abDE", "ABDE", and "abde" are all counted as the same 4-mer.
Hint: there are two different approaches to counting k-mers. In
one, you process the file a line at a time (to distinguish id
lines), then process the sequence lines a character at a time, using
kmer = kmer[0:-1]+letter.
In the other approach, you can define a generator function in Python
to read a FASTA file one sequence at a time, then extract the kmers
from the sequence with something like
for (fasta_id,comment,seq) in read_fasta(genome): for start in range(len(seq)-k): counts[seq[start:start+k]] += 1Both approaches are useful. The first one does not require large amounts of RAM when you get very long sequences (like whole plant chromosomes), and the second one can be used for other FASTA reading, where the sequences need to be processed individually in more complex ways. Given that you have already written a FASTA parser, I expect that most students will choose to use it, probably with some modifications to add optional arguments about whether case should be ignored and what alphabet is to be used.
The "count-kmers" program must accept two options:
--order=0should count 1-mers)
--alphabet ACDEFGHIKLMNPQRSTVWY). All characters in sequences that are not in the alphabet should be ignored. Note: you may want to use the "set" type to represent the alphabet internally. The alphabet should default to ASCII letters if not specified. Note: you want case conversion to be done before determining whether a letter is in the alphabet, so that users don't have to provide both lower-case and upper-case copies of the alphabet.
The command-line option parsing should be done with the "argparse" module. The actual counting of k-mers should be done by a function you define in a separate module (Markov.py or markov.py). The counts are most easily kept and returned in a "collections.Counter" object, which provides a very convenient way to do all sorts of counting.
If the k-mer counting function is passed a file-like object (like sys.stdin) as an argument, then you can
easily reuse it on other files. Furthermore, if you use
for line in input_stream: to read the input, then you can
do testing by passing in a list of lines:
>>> dummy_in= '>seq1\nABCDEF\n>seq2\nAC\nde\nFG\n>seq3\n' >>> count=get_counts(dummy_in.splitlines(True), 2, alphabet="ABCDEFGHIJK", start='^', stop='$')
Here is an example of what you should get when your run your program:
count-kmers -o0 < /soe/karplus/.html/bme205/f14/markov_files/mark_f14_1.seqs $ 2804 A 53783 B 1 C 8336 D 37458 E 40373 F 24864 G 49926 H 17643 I 33796 K 35127 L 54962 M 14260 N 27386 P 30034 Q 23554 R 30858 S 39767 T 35235 V 44037 W 9412 X 56 Y 22591With a restricted alphabet (getting rid of the B and X wildcards),
count-kmers -o0 --alphabet=ACDEFGHIKLMNPQRSTVWY < /soe/karplus/.html/bme205/f14/markov_files/mark_f14_1.seqsshould produce
$ 2804 A 53783 C 8336 D 37458 E 40373 F 24864 G 49926 H 17643 I 33796 K 35127 L 54962 M 14260 N 27386 P 30034 Q 23554 R 30858 S 39767 T 35235 V 44037 W 9412 Y 22591Similarly, when you run
count-kmers -o1 --alphabet=ACDEFGHIKLMNPQRSTVWY < /soe/karplus/.html/bme205/f14/markov_files/tiny.seqsyou should get
A$ 1 AC 4 AG 1 AM 7 CA 3 CD 4 DE 4 EF 4 FA 3 FG 1 GG 9 GH 1 GM 1 HI 1 IK 1 K$ 1 M$ 1 MA 6 MC 3 MM 2 MW 1 WM 1 WW 3 ^A 1 ^M 2
For symmetry, report as many stop characters as you report start characters (this makes it easier to make "reverse Markov" models that model the sequences in reverse order)---in both cases, up to the order of the model characters.
The order-0 models are a special case, as the normal number of start (and stop) characters included is the order of the model, but we want to have counts of the stop character even for the order-0 model. For order-0 only, report the number of stop characters as the number of sequences, but don't report the number of start characters.
count-kmers -o2 --alphabet=ACDEFGHIKLMNPQRSTVWY < /soe/karplus/.html/bme205/f14/markov_files/tiny.seqsyou should get
A$$ 1 ACD 4 AGH 1 AM$ 1 AMA 4 AMC 1 AMM 1 CAM 3 CDE 4 DEF 4 EFA 3 EFG 1 FAC 2 FAG 1 FGG 1 GGG 8 GGM 1 GHI 1 GMA 1 HIK 1 IK$ 1 K$$ 1 M$$ 1 MA$ 1 MAC 1 MAM 4 MCA 3 MMM 1 MMW 1 MWW 1 WMC 1 WWM 1 WWW 2 ^AC 1 ^MA 1 ^MC 1 ^^A 1 ^^M 2
Note: the output should be sorted in alphabetical order, with a single space separator, so that I can check I/O behavior using the "diff" program.
Optional bonus points:
--stopto change the start and stop characters.
Write a program that reads in a table of k-mer counts (in the same format as you output them) from stdin and FASTA-formatted sequences from a file, and computes the total encoding cost in bits for the sequences. The encoding cost of a character x using model M is -log_2 P_M(x), and the encoding cost of a sequence is the sum of the costs for the characters (note: this is equivalent to taking -log_2 of the product of the probabilities).
The calling sequence for the program should be
coding-cost tiny.seqs < mark_f14_1.count1 > tiny.cost1The use of stdin for the table allows using count-kmers and coding-cost in a pipe:
count-kmers -order=1 --alphabet=ACDEFGHIKLMNPQRSTVWY < mark_f14_1.seqs | coding-cost tiny.seqs > tiny.cost1
Like count-kmers, coding-cost should accept an --alphabet argument to restrict the character set. The default should be the 20 standard amino acids: ACDEFGHIKLMNPQRSTVWY. Note that the file name for coding-cost is a positional argument without any "--" name in front of it, and that there is no "--order" argument. The order can be determined from the table read from stdin.
Report not only the total encoding cost for the file, but the average cost per sequence and per character. So that everyone gets the same numbers, for the per-character costs the numerator is the total cost, which includes encoding exactly one stop character at the end of each sequence (for any order, including order 0), and the denominator is the number of characters not counting the stop character. I'm not defending this as necessarily the best numerator and denominator to use for average cost/character, but it is the choice I used in generating the "right" answers.
Note that to do the computation, you need to convert the counts
into probabilities. The simplest approach is to normalize the counts
to sum to 1 over each "condition". Note that
is the "condition" for a given k-mer. You might think that because
the FASTA training file will be large, you will not need to worry
about pseudocounts, but even by order 2, there are 3-mers that have
zero counts, so the maximum-likelihood estimate of their probability
is 0 and the encoding cost is infinite (resulting in a "ValueError:
math domain error" from python if you actually try to compute it).
The next simplest approach is to add a pseudocount to every count (including the zero counts for a context, so if you have seen "WWA", "WWC", ... but not "WWH" or "WWI", you have the context "WW" and will need to add a pseudocount for every "WWx" where x ranges over your alphabet. I recommend adding a function to your markov module that produces a conditional probability dict from a Counter (without changing the counter), so that you can easily modify the way you handle the probability estimation. The default pseudocount should be 1.0 for each cell. Note that pseudocounts get added to every kmer (whether they were zero count or not).
The simple approach of making sure that pseudocounts have been used for letters in every observed context will work fine up to about order 2, but for higher order Markov chains, you might run into k-mers for which even the k-1-mer that is the context has never been seen in the training set. (For example, mark_f14_1.seqs has a sequence starting with CYC, producing 4-mer ^CYC , but the context ^CY never occurs in mark_f14_2.seqs.)
You can fix this problem by making sure that the log-prob table has keys for all k-mers over the alphabet.
One tricky point in the log-prob table: We know that once we get to the end of a sequence and have seen the final "$", then all subsequent letters must be "$", so we do not want P("A$A") to be non-zero, even if we have seen counts for "A$$".
Hint: for this assignment, since we are not looking at the encoding cost individually for each sequence, but only the total encoding cost, you can summarize the FASTA file with the same sort of dictionary of counts that was used for count-kmers. You can determine the size of k-mer you need to count by looking at the size of the keys you read in for the model.
Run coding_cost on both sets of sequences using both order 0 and order 1 models. Get the coding costs both for the training set and for the test set. For comparison, my code with an autodetected alphabet and pseudocounts of 1.0 got
|Train||Test||order 0 bits/char||order1 bits/char|
The table with the constrained alphabet ACDEFGHIKLMNPQRSTVWY is
|Train||Test||order 0 bits/char||order1 bits/char|
Note that a simple uniform model over 20 characters would take 4.32193 bits/char, and over a 21-letter alphabet (including stop) would take 4.39232 bits/char.
Optional bonus points:
for pseudocount in [1e-10, 0.01, 0.1, 0.2, 0.5, 1, 2, 5, 10, 100]:or as sophisticated as you want to make it.
There are several ways to handle wildcards and non-standard amino acid letters:
I want copies of your programs on paper, but I don't want all the output from every run of the programs! Instead, write a short summary describing the results of the tests, giving the average encoding costs for each set of sequences for each model. Try to write your summary as if you were describing the method and the results in a journal paper that has a page limit: you want to describe the data sets and methods completely but concisely. If you do it well, the presentation should take less than a page.
Turn in your summary page with the program stapled after it. (Avoid paper clips—they have a tendency to either come off or to pick up other people's papers.)
I also want a directory name for a directory on the School of Engineering machines that I can read that contains all the programs and writeup for this assignment. Make sure that the directory and all files in it are publicly readable, and that all directories on the path to it are publicly executable, or I won't be able to access the files.
yieldstatement in Python to make a generator.
|BME 205 home page||UCSC Bioinformatics research|
Questions about page content should be directed to
University of California, Santa Cruz
Santa Cruz, CA 95064
318 Physical Sciences Building