Note: this assignment was originally created by David Bernick, then edited by Kevin Karplus. Both are about equally responsible for the final shape of the assignment.
Feature prediction within genomic data is a powerful method for us to learn novel relationships armed with little more than a biological model and a way to express that model in terms of a score.
Proteins that bind specific regions of DNA, either as a regulation mechanism (transcription factors), or to recognize foreign DNA (phage DNA) need to make use of short sequences. These sequences need to be short in order to minimize the complexity of the binding protein, and they need to be specific in order to produce a specific regulatory effect. Consider, for example, what would happen if a regulatory protein that targets destruction of viral DNA happened upon that same sequence within the host genome. The benefits to the host of reserving these “special words” can now be seen. Since the machinery that makes use of these “special words” is also quite complex, we would tend to see them preserved over evolutionary time.
One class of special words appear to be DNA palindromes, which are reverse-complement palindromes (that is, the word is the same on both strands of the DNA—reversing the word and complementing it gives you back the word). For example, “TATA” is one of the more important palindromic sequences, used to signal start of transcription (http://www.pdb.org/pdb/101/motm.do?momID=67. It can be written as: T A A' T', where A' means the reverse complement of A (T), and T' means the reverse complement of T (A). Palindromes are common recognition sites, because many of the DNA-binding proteins bind as dimers or with two copies of the same protein domain, with the two chains having the DNA binding sites facing each other and recognizing corresponding sections on the forward and reverse strands, as can be seen in this cartoon picture of the PDB file 1LMB:
Rocha et al. (http://genome.cshlp.org/content/11/6/946.full doi:10.1101/gr.153101), noticed that palindromes of size 4 and 6 seem to be under-represented in some prokaryotic genomes. They suggest that this may be due to particular structures (restriction or modification systems) that exist in some prokaryotes and some phage (lambda for example). This may be a mechanism for establishing exclusive use or restricting use of these words within a genome.
So, armed with the possibility that there exist special words, and that these words are sometimes palindromes, can we now see if we can find them in our genome of choice?
For us to predict when a word is missing or under-represented, we need to have a model of the frequency with which we would expect a particular word—this will be our null model. We want to find the palindromes that occur significantly less frequently than would be expected.
There are many null models that are possible, and for this assignment we will make use of a Markov model that models extending the central palindrome in both directions. Note: nothing in this null model is restricted to palindromes—we could use this model to look for any under-represented n-mer. The symmetry of extending in both directions from the central (n–2)-mer is particularly elegant for palindromes, but a very similar method can be made by using an order (n–2) Markov model as the null model. Somewhat surprisingly, the calculation for the expected count comes out the same in either model
To compute the probability that an arbitrary n-mer is the word (W)
that we are interested in, we will use 3 components:
the probability that the core (n-2) bases match, and the probabilities
of the first and last bases given that the core matches.
P(W) = P(W1 | W2...Wn-1) *
P(Wn | W2...Wn-1) *
P(W2...Wn-1) .
If we check for the palindrome N times (roughly the length of the genome),
we would expect to see N*P(W) occurrences of W. We can use counts and
maximum-likelihood estimates of the probabilities in our null model to
get an estimation formula:
E(C(W)) = C(W1...Wn-1X)
* C(XW2...Wn)
/ C(XW2...Wn-1X) ,
where E(C(W)) is the expected value for the count of the number of
times W occurs in the genome, and C(Wi...Wj)
is the actual count of the number of times the word
Wi...Wj occurs. The 'X' character on the subwords
is intended to represent "any character". We are counting the number
of words of length n that agree on the first n-1, the second n-1, or
the middle n-2 positions.
In practice, since our genome sequences are very long, we can ignore
edge effects and just use
E(C(W)) = C(W1...Wn-1)
* C(W2...Wn)
/ C(W2...Wn-1) ,
If n=2, we are not counting empty strings
on the bottom, but the number of 2-letter windows in our training data.
(Actually, I counted the number of letters, which is very slightly larger.)
Because our model for the count is the sum of N almost independent
observations, each with probability P(W), it can be well modeled as a
binomial distribution, with variance
Var(C(W)) = N* P(W) * (1-P(W))
= E(C(W)) * (1 - E(C(W))/N) ,
and the standard deviation is
sigma(W) = sqrt(E(C(W)) * (1 - E(C(W))/N)) .
Note: because E(C(W)) is much smaller than N, the variance is
approximately the same as the mean.
We can then compute a Z-score for each of our words:
Z(W) = (C(W) – E(C(W))) / sigma(W) .
Using a Z-score is quite reasonable here, because we are expecting a
distribution very close to a normal distribution.
(Also, a grad student did an implementation with exact binomial
distributions and got almost the same results as with the normal
approximation.)
How big a palindrome can we use this method on? As the palindromes get bigger, the expected counts get smaller, and it is harder to tell whether the number is really significantly smaller than expected. Say we want to have 0.01 probability of a false positive. If we check 4^4 (256) palindromes (the number of palindromes of length exactly 8), we would need a p-value < 3.9E-05 to get an E-value < 0.01. According to a normal probability table (http://www.math.unb.ca/~knight/utility/NormTble.htm), we would need a Z-score < -4 for this level of significance. For a zero count to be at least 4 standard deviations below the expected value, we would need for the expected value to be at least 16. For a megabase genome, the expected number of counts for each of the 4^8 (64k) different 8-mers should be around 16 on one strand, and expected counts for 9mers around 4, so 8-long palindromes are about as big as we can go. For longer palindromes, even a count of zero would not mean significant under-representation.
Note that if we wanted to check all 64k 8-mers for under-representation and not just the 256 palindromes, we'd need a probability of 1.5E-07 of a false positive, or a Z-value around -5.1. This requires even larger expected counts for a 0 to be significant under-representation.
Write a short piece of Python code that generates all DNA palindromes for a given range of lengths. I found it useful to use itertools.product to yield all k-mers over a given alphabet for a fixed value of k. I also found the following function useful in Python2.7:
# define reverse complement complement_table = string.maketrans("acgturykmswbdhvnxACGTURYKMSWBDHVNX", "tgcaayrmkswvhdbnnTGCAAYRMKSWVHDBNN") def reverse_comp(dna): """return a string with the reverse-complement of the DNA string dna. """ return dna[::-1].translate(complement_table)Python3 does not support the translation-table functions (because Unicode translation tables would be huge), so you'll probably have to write your own code for complementing DNA.
The definition of DNA palindrome above does not allow odd-length
palindromes, since the middle letter is never the complement of itself.
In practice there are often unimportant bases in the middle of a
matching sequence, so let's add "odd palindromes" that have an extra, unmatched
letter in the middle: dna+'A'+reverse_comp(dna)
,
dna+'G'+reverse_comp(dna)
, ...
Note that odd palindromes come in pairs—so ACCGT and ACGGT are the same "palindrome" on opposite strands. This means that under-representation of the pair should be treated as a single hypothesis, not as two distinct hypotheses, but we will treat them as separate hypotheses for ease in programming.
We only count k-mers on one strand---if we counted on both strands, the even-length palindromes would always have even counts (since any occurrence on one strand is also an occurrence on the other strand), but the odd-length palindromes may have either an odd or an even count. Since the binomial distribution does not have missing odd values, we do not want to count on both strands, but only on one strand.
The number of hypotheses is twice the number of palindromes checked, as the hypotheses "α is under-represented" and "α is over-represented" are separate hypotheses. As a sanity check, the E-value for count=expected_count (that is, Z-score=0) should be the the number of palindromes, since every palindrome is at least that extreme (one way or the other). Note that the number of hypotheses has nothing to do with the genome length (unlike in a search problem, where the question being asked is "is there an instance at this location?").
Write a program, named score_palindromes, that reports significantly under- or over-represented palindromes. The program should have at least the following options:
Your program would then be executed as follows (for example):
score_palindromes -e 0.01 --max_k=8 P.abyssi.fa.gz P.furiosus.fa.gz P.horikoshii.fa.gz
Note: you can reopen a gzipped file in Python fairly
easily using the gzip
module:
for i,f in enumerate(options.files): if f.name.endswith(".gz"): options.files[i] = gzip.GzipFile(fileobj=f, mode="r")
I will provide a few genomes for various prokaryotes as FASTA
files (actually, as gzipped fasta files).
A.fulgidus.fa.gz
A.pernix.fa.gz
H.influenzae.fa.gz
H.pylori.fa.gz
M.jannaschii.fa.gz
M.thermoautotrophicus.fa.gz
N.meningitidus.fa.gz
P.abyssi.fa.gz
P.furiosus.fa.gz
P.horikoshii.fa.gz
S.acidocaldarius.fa.gz
S.solfataricus.fa.gz
S.tokadaii.fa.gz
V.cholerae.I.fa.gz
V.cholerae.II.fa.gz
V.fischeri.I.fa.gz
V.fischeri.II.fa.gz
V.parahaemolyticus.I.fa.gz
V.parahaemolyticus.II.fa.gz
V.vulnificus.I.fa.gz
V.vulnificus.II.fa.gz
These genomes are also available directly from /cse/classes/bme205/Fall05/ on the School of Engineering computers, and it is fairly simple to find other full prokaryotic genomes on the web. These files happen to have one sequence per file, but your program should be able to handle multiple sequences in one file, so that it can be applied to multi-chromosome genomes or genomes that are not fully assembled. You could also apply the program to a set of genomes from the same genus, as the extra data would allow you to search for slightly longer palindromes. Note: the files take up about 45 Mbytes unzipped, so leave them in the compressed format. It is good practice to learn how to read gzipped files without having to uncompress them first, so that is a requirement for this assignment.
Hint: the program could use the k-mer counting module that you developed for the Markov chain assignment, modified to count all k-mers up to the maximum size specified. You can use a single Counter or dict to store all the different counts, since the k-mer keys are distinct.
After you have counted all the k-mers, you can iterate over all the palindromes that you generated for Part 1 and compare the observed counts with computed counts.
It is easy to compute the z-scores, since we have a model for both the mean and standard deviation that we expect. To turn the z-scores into p-values, we need the cumulative normal distribution function. This is not built into Python, but we have access to the math library, which includes the erf() and erfc() functions:
from math import erf, erfc
Use the erfc function to compute the desired p-value. For example, if
you have a threshold t, then the probability of getting a z-value
larger than t is
P(z > t) = erfc(t/sqrt(2))/2
and the probability of getting a z-value less than t is
P(z < t) = erfc(-t/sqrt(2))/2
After we've computed the z-value for a particular palindrome, we can
compute its p-value (the probability of seeing a z-value more extreme
than it).
When I run the command
score_palindromes --max_e=0.01 --min_k=6 --max_k=6 ~karplus/.html/bme205/f05/H.influenzae.fa.gzI get
# Reading from /soe/karplus/.html/bme205/f05/H.influenzae.fa.gz # After reading, user time is 3.73 seconds # There are 64 words being considered ( 128 hypotheses), # and about 1830025 positions where words could be. # Reporting palindromes from length 6 to 6 , # which are under- or over-represented with E_value < 0.01 #k-mer observed expected Z_score E_value GTTAAC 19 336.7069 -17.3157 2.291e-65 GAATTC 302 668.6205 -14.1810 7.690e-44 TAATTA 916 1410.2788 -13.1670 8.697e-38 GATATC 176 448.5854 -12.8716 4.162e-36 GTGCAC 141 380.4550 -12.2777 7.637e-33 AAGCTT 359 651.0699 -11.4485 1.530e-28 ATGCAT 255 477.5719 -10.1861 1.464e-22 GGCGCC 27 139.7556 -9.5383 9.294e-20 GTCGAC 4 93.4067 -9.2511 1.422e-18 CATATG 174 337.4198 -8.8973 3.661e-17 CTTAAG 122 258.8434 -8.5062 1.150e-15 TGCGCA 350 546.2680 -8.3987 2.890e-15 AAATTT 2904 3380.4274 -8.2019 1.515e-14 CAATTG 674 920.2017 -8.1182 3.029e-14 AATATT 2035 2409.5255 -7.6349 1.447e-12 TTCGAA 176 302.3879 -7.2687 2.322e-11 GGTACC 72 161.5759 -7.0473 1.168e-10 CGATCG 142 252.8431 -6.9713 2.010e-10 TTTAAA 2284 2585.1269 -5.9267 1.978e-07 ATATAT 425 565.7035 -5.9167 2.103e-07 AGCGCT 47 105.5573 -5.6997 7.683e-07 GCTAGC 61 120.6735 -5.4324 3.559e-06 GTATAC 57 114.7418 -5.3907 4.492e-06 GAGCTC 64 121.8943 -5.2439 1.006e-05 GGATCC 96 161.4720 -5.1526 1.644e-05 GGGCCC 25 66.6053 -5.0980 2.197e-05 CAGCTG 310 411.1362 -4.9884 3.896e-05 CGTACG 117 178.6201 -4.6108 2.567e-04 CACGTG 236 318.0180 -4.5996 2.709e-04 GACGTC 35 73.2500 -4.4693 5.023e-04 ATCGAT 387 480.5220 -4.2669 1.268e-03 TTATAA 888 1014.6950 -3.9784 4.440e-03 # Total user time is 3.73 seconds
BUG: I've not defined a unique output format, so comparison of results will be difficult to do with standard programs like "diff". It will help me if the output is done as 5 columns separated by white space, in the order k-mer, observed count, expected count, Z-score, and E-value, and if any other information is done as a comment line that starts with a "#". I can then write a special-purpose comparator program that compares two such files. Sorting the output by the Z-score will also make comparison easier.
Sorting by Z-score puts the most significant under-representation at the beginning of the list, and the most significant over-representation at the end of the list, which is handy for finding the biggest outliers. If you sort your output by length of word and alphabetically within that, you may see other helpful patterns, such as palindromic sequences that have a don't-care region in the center. Note that I've suppressed the output of the words that were not significantly under- or over-represented. I expect you to do that also.
Run your program as in the example above (H. influenzae for palindromes of length exactly 6). Also run your program for the three Pyrococcus genomes taken together, for length 4 to 8 palindromes:
score_palindromes --min_k=4 --max_k=8 --max_e=0.01 P.abyssi.fa.gz P.furiosus.fa.gz P.horikoshii.fa.gzNote that this run is looking at combined counts for the three genomes, not looking at each genome separately.
I'm curious also how long your program takes to run. My unoptimized program took about 24 seconds to run on the server "protein" for these three genomes on k from 4 to 8.
Write up a short discussion that includes your data, along with an interpretation of their meaning. Is there anything special about these sequences? Try Google, or the following links:
--all
) for looking at all n-mers,
and not just palindromic ones—make sure you change the number of
hypotheses when computing the E-values!
I ran into some serious artifacts with an approach that merged counts for wildcards before computed expected values. For example, CATSATG appears hugely under-represented in H. influenzae, but CATCATG and CATGATG are not. This artifact results because CATG is very rare, so CATGATG and CATCATG are both quite rare, as are CATGAT- and -ATCATG (because of the rarity of CATG), but -ATSATG and CATSAT- are not, because GATG and CATC are not rare. To avoid this artifact, it is better to compute the expected counts for each sequence of DNA bases separately, then add the expected counts to merge wildcards, rather than computing expected counts from merged k-mer counts.