Note: this assignment was originally created by David Bernick, then edited by Kevin Karplus. Both are about equally responsible for the final shape of the assignment.
Feature prediction within genomic data is a powerful method for us to learn novel relationships armed with little more than a biological model and a way to express that model in terms of a score.
Proteins that bind specific regions of DNA, either as a regulation mechanism (transcription factors), or to recognize foreign DNA (phage DNA) need to make use of short sequences. These sequences need to be short in order to minimize the complexity of the binding protein, and they need to be specific in order to produce a specific regulatory effect. Consider, for example, what would happen if a regulatory protein that targets destruction of viral DNA happened upon that same sequence within the host genome. The benefits to the host of reserving these “special words” can now be seen. Since the machinery that makes use of these “special words” is also quite complex, we would tend to see them preserved over evolutionary time.
One class of special words appear to be DNA palindromes, which are reversecomplement palindromes (that is, the word is the same on both strands of the DNA—reversing the word and complementing it gives you back the word). For example, “TATA” is one of the more important palindromic sequences, used to signal start of transcription (http://www.pdb.org/pdb/101/motm.do?momID=67. It can be written as: T A A' T', where A' means the reverse complement of A (T), and T' means the reverse complement of T (A). Palindromes are common recognition sites, because many of the DNAbinding proteins bind as dimers or with two copies of the same protein domain, with the two chains having the DNA binding sites facing each other and recognizing corresponding sections on the forward and reverse strands, as can be seen in this cartoon picture of the PDB file 1LMB:
Rocha et al. (http://genome.cshlp.org/content/11/6/946.full doi:10.1101/gr.153101), noticed that palindromes of size 4 and 6 seem to be underrepresented in some prokaryotic genomes. They suggest that this may be due to particular structures (restriction or modification systems) that exist in some prokaryotes and some phage (lambda for example). This may be a mechanism for establishing exclusive use or restricting use of these words within a genome.
So, armed with the possibility that there exist special words, and that these words are sometimes palindromes, can we now see if we can find them in our genome of choice?
For us to predict when a word is missing or underrepresented, we need to have a model of the frequency with which we would expect a particular word—this will be our null model. We want to find the palindromes that occur significantly less frequently than would be expected.
There are many null models that are possible, and for this assignment we will make use of a Markov model that models extending the central palindrome in both directions. Note: nothing in this null model is restricted to palindromes—we could use this model to look for any underrepresented nmer. The symmetry of extending in both directions from the central (n–2)mer is particularly elegant for palindromes, but a very similar method can be made by using an order (n–2) Markov model as the null model. Somewhat surprisingly, the calculation for the expected count comes out the same in either model
To compute the probability that an arbitrary nmer is the word (W)
that we are interested in, we will use 3 components:
the probability that the core (n2) bases match, and the probabilities
of the first and last bases given that the core matches.
P(W) = P(W_{1}  W_{2}...W_{n1}) *
P(W_{n}  W_{2}...W_{n1}) *
P(W_{2}...W_{n1}) .
If we check for the palindrome N times (roughly the length of the genome),
we would expect to see N*P(W) occurrences of W. We can use counts and
maximumlikelihood estimates of the probabilities in our null model to
get an estimation formula:
E(C(W)) = C(W_{1}...W_{n1}X)
* C(XW_{2}...W_{n})
/ C(XW_{2}...W_{n1}X) ,
where E(C(W)) is the expected value for the count of the number of
times W occurs in the genome, and C(W_{i}...W_{j})
is the actual count of the number of times the word
W_{i}...W_{j} occurs. The 'X' character on the subwords
is intended to represent "any character". We are counting the number
of words of length n that agree on the first n1, the second n1, or
the middle n2 positions.
In practice, since our genome sequences are very long, we can ignore
edge effects and just use
E(C(W)) = C(W_{1}...W_{n1})
* C(W_{2}...W_{n})
/ C(W_{2}...W_{n1}) ,
If n=2, we are not counting empty strings
on the bottom, but the number of 2letter windows in our training data.
(Actually, I counted the number of letters, which is slightly larger.)
Because our model for the count is the sum of N almost independent
observations, each with probability P(W), it can be well modeled as a
binomial distribution, with variance
Var(C(W)) = N* P(W) * (1P(W))
= E(C(W)) * (1  E(C(W))/N) ,
and the standard deviation is
sigma(W) = sqrt(E(C(W)) * (1  E(C(W))/N)) .
Note: because E(C(W)) is much smaller than N, the variance is
approximately the same as the mean.
We can then compute a Zscore for each of our words:
Z(W) = (C(W) – E(C(W))) / sigma(W) .
Using a Zscore is quite reasonable here, because we are expecting a
distribution very close to a normal distribution.
How big a palindrome can we use this method on? As the palindromes get bigger, the expected counts get smaller, and it is harder to tell whether the number is really significantly smaller than expected. Say we want to have 0.01 probability of a false positive. If we check 4^4 (256) palindromes (the number of palindromes of length exactly 8), we would need a pvalue < 3.9E05 to get an Evalue < 0.01. According to a normal probability table (http://www.math.unb.ca/~knight/utility/NormTble.htm), we would need a Zscore < 4 for this level of significance. For a zero count to be at least 4 standard deviations below the expected value, we would need for the expected value to be at least 16. For a megabase genome, the expected number of counts for the 4^8 (64k) different 8mers should be around 32 on the two strands combined, and expected counts for 9mers aroung 8, so 8long palindromes are about as big as we can go. For longer palindromes, even a count of zero would not mean significant underrepresentation.
Note that if we wanted to check all 64k 8mers for underrepresentation and not just the 256 palindromes, we'd need a probability of 1.5E07 of a false positive, or a Zvalue around 5.1. This requires even larger expected counts for a 0 to be significant underrepresentation.
Write a short piece of Python code that generates all DNA palindromes for a given range of lengths. I found it useful to first write a (recursive) generator function yielding all kmers over a given alphabet for a fixed value of k. I also found the following function useful:
# define reverse complement complement_table = string.maketrans("ACGT", "TGCA") def reverse_comp(dna): """return a string with the reversecomplement of the DNA string "dna". Assumes that all bases in DNA are canonical (ACGT). To generalize to wildcard bases, complement_table would need to be redefined. """ return dna[::1].translate(complement_table)
The definition of DNA palindrome above does not allow oddlength
palindromes, since the middle letter is never the complement of itself.
In practice there are often unimportant bases in the middle of a
matching sequence, so let's add "odd palindromes" that have an extra
letter in the middle: dna+'A'+reverse_comp(dna)
,
dna+'G'+reverse_comp(dna)
, ...
Note that odd palindromes come in pairs—so ACCGT and ACGGT are the same "palindrome" on opposite strands. This means that underrepresenation of the pair should be treated as a single hypothesis, not as two distinct hypotheses. One way to handle this is to allow only "G" and "T" as the middle base of the odd palindrome, not "A" or "C" (making sure you count the palindromes on both strands).
When counting the palindromes on both strands, the evenlength palindromes will always have even counts (since any occurence on one strand is also an occurence on the other strand), but the oddlength palindromes may have either an odd or an even count.
The number of hypotheses is twice the number of palindromes checked, as the hypotheses "α is underrepresented" and "α is overrepresented" are separate hypotheses. As a sanity check, the Evalue for count=expected_count (Zscore=0) should be the number of palindromes, since every palindrome is at least that extreme.
Write a program, named score_palindromes, that reports significantly under or overrepresented palindromes. The program should have at least the following options:
Your program would then be executed as follows (for example):
score_palindromes e 0.01 max_k=8 P.abyssi.fa.gz P.furiosus.fa.gz P.horikoshii.fa.gz
Note: you can open a gzipped file in Python fairly
easily using the gzip
module.
file = gzip.GzipFile(filename,'r') if filename.endswith(".gz") else open(filename,'r')
I will provide a few genomes for various prokaryotes as FASTA
files (actually, as gzipped fasta files).
A.fulgidus.fa.gz
A.pernix.fa.gz
H.influenzae.fa.gz
H.pylori.fa.gz
M.jannaschii.fa.gz
M.thermoautotrophicus.fa.gz
N.meningitidus.fa.gz
P.abyssi.fa.gz
P.furiosus.fa.gz
P.horikoshii.fa.gz
S.acidocaldarius.fa.gz
S.solfataricus.fa.gz
S.tokadaii.fa.gz
V.cholerae.I.fa.gz
V.cholerae.II.fa.gz
V.fischeri.I.fa.gz
V.fischeri.II.fa.gz
V.parahaemolyticus.I.fa.gz
V.parahaemolyticus.II.fa.gz
V.vulnificus.I.fa.gz
V.vulnificus.II.fa.gz
These genomes are also available directly from /cse/classes/bme205/Fall05/ on the School of Engineering computers, and it is fairly simple to find other full prokaryotic genomes on the web. These files happen to have one sequence per file, but your program should be able to handle multiple sequences in one file, so that it can be applied to multichromosome genomes or genomes that are not fully assembled. You could also apply the program to a set of genomes from the same genus, as the extra data would allow you to search for slightly longer palindromes. Note: the files take up about 45 Mbytes unzipped, so leave them in the compressed format. It is good practice to learn how to read gzipped files without having to uncompress them first, so that is a requirement for this assignment.
Hint: the program could use the kmer counting module that you developed for the Markov chain assignment, modified to count all kmers up to the maximum size specified. You can use a single dictionary to store all the different counts, since the kmer keys are distinct. Note that your previous kmer counting only counted one strand of the DNA. There is no need to run though the entire reversecomplement strand, as you can get the counts for the opposite strand fairly simply:
rev_counts = dict( (reverse_comp(dna),cnt) for dna,cnt in counts.items()) for dna, cnt in rev_counts.items(): counts[dna] += cntI did this in two passes, to avoid possible timing errors where counts would be changed before the reverse complement was looked up.
After you have counted all the kmers, you can iterate over all the palindromes that you generated for Part 1 and compare the observed counts with computed counts.
It is easy to compute the zscores, since we have a model for both the mean and standard deviation that we expect. To turn the zscores into pvalues, we need the cumulative normal distribution function. This is not built into Python, but we have access to the math library, which includes the erf() and erfc() functions:
from math import erf, erfc
Use the erfc function to compute the desired pvalue. For example, if
you have a threshold t, then the probability of getting a zvalue
larger than t is
P(z > t) = erfc(t/sqrt(2))/2
and the probability of getting a zvalue less than t is
P(z < t) = erfc(t/sqrt(2))/2
After we've computed the zvalue for a particular palindrome, we can
compute its pvalue (the probability of seeing a zvalue more extreme
than it).
When I run the command
score_palindromes max_e=0.01 min_k=6 max_k=6 ~karplus/.html/bme205/f05/H.influenzae.fa.gzI get
# Reading from /cse/faculty/karplus/.html/bme205/f05/H.influenzae.fa.gz # After reading, user time is 5.28 sec # There are 64 words being considered (128 hypotheses), # and about 3660050 positions where words could be. # Reporting palindromes from length 6 to 6 , # which are under or overrepresented with E_value < 0.01 #kmer observed expected Z_score E_value GTTAAC 38 673.6165 24.4923 1.143e130 GAATTC 604 1337.2567 20.0552 1.162e87 TAATTA 1832 2820.5996 18.6216 1.374e75 GATATC 352 897.6054 18.2133 2.590e72 GTGCAC 282 760.9260 17.3637 9.944e66 AAGCTT 718 1302.2547 16.1932 3.607e57 ATGCAT 510 955.6213 14.4172 2.583e45 GGCGCC 54 279.7820 13.4988 1.017e39 GTCGAC 8 186.8813 13.0856 2.548e37 CATATG 348 674.8471 12.5829 1.677e34 CTTAAG 244 517.7412 12.0314 1.556e31 TGCGCA 700 1092.7578 11.8830 9.279e31 AAATTT 5808 6760.8635 11.5993 2.660e29 CAATTG 1348 1840.4037 11.4808 1.054e28 AATATT 4070 4819.5399 10.8038 2.113e25 TTCGAA 352 604.7960 10.2802 5.539e23 GGTACC 144 323.1519 9.9664 1.369e21 CGATCG 284 505.9324 9.8674 3.687e21 TTTAAA 4568 5171.3611 8.3962 2.952e15 ATATAT 850 1131.4110 8.3675 3.765e15 AGCGCT 94 211.2264 8.0661 4.645e14 GCTAGC 122 241.7564 7.7024 8.549e13 GTATAC 114 229.5894 7.6288 1.517e12 GAGCTC 128 243.7889 7.4161 7.721e12 GGATCC 192 323.8696 7.3279 1.496e11 GGGCCC 50 133.2151 7.2100 3.582e11 CAGCTG 620 822.2796 7.0549 1.105e10 CACGTG 472 636.7205 6.5285 4.253e09 CGTACG 234 357.2604 6.5216 4.453e09 GACGTC 70 146.6275 6.3283 1.587e08 ATCGAT 774 961.0846 6.0355 1.014e07 TTATAA 1776 2029.5272 5.6292 1.159e06 CTATAG 204 285.7456 4.8361 8.476e05 TGATCA 978 1138.0063 4.7439 1.342e04 TTGCAA 2518 2740.4423 4.2508 1.363e03 TGTACA 554 651.7729 3.8301 8.198e03 # Total user time is 5.3 seconds
BUG: I've not defined a unique output format, so comparison of results will be difficult to do with standard programs like "diff". It will help me if the output is done as 5 columns separated by white space, in the order kmer, observed count, expected count, Zscore, and Evalue, and if any other information is done as a comment line that starts with a "#". I can then write a specialpurpose comparator program that compares two such files. Sorting the output by the Zscore will also make comparison easier.
Sorting by Zscore puts the most significant underrepresentation at the beginning of the list, and the most signficant overrepresentation at the end of the list, which is handy for finding the biggest outliers. If you sort your output by length of word and alphabetically within that, you may see other helpful patterns, such as palindromic sequences that have a don'tcare region in the center. Note that I've supressed the output of the words that were not significantly under or overrepresented. I expect you to do that also.
Run your program as in the example above (H. influenzae for palindromes of length exactly 6). Also run your program for the three Pyrococcus genomes taken together, for length 4 to 8 palindromes:
score_palindromes min_k=4 max_k=8 max_e=0.01 P.abyssi.fa.gz P.furiosus.fa.gz P.horikoshii.fa.gzNote that this run is looking at combined counts for the three genomes, not looking at each genome separately.
I'm curious also how long your program takes to run. My unoptimized program took about 37 seconds on an old 32bit linux box for these three genomes on k from 4 to 8.
Write up a short discussion that includes your data, along with an interpretation of their meaning. Is there anything special about these sequences? Try Google, or the following links: http://www.neb.com/nebecomm/tech_reference/restriction_enzymes/cross_index_recognition_sequences.asp, http://www.neb.com/toolsandresources/interactivetools/enzymefinder, and http://rebase.neb.com/rebase/rebadvsearch.html The first link has a table of restriction enzymes, the second has a search site for finding known restriction enzymes by organism or sequence (among other methods). (Use 'N' for wildcard, not 'X' on this search page.) Do your finds match up with known restriction enzymes?
all
) for looking at all nmers,
and not just palindromic ones—make sure you change the number of
hypotheses when computing the Evalues!
Something I observed, but have not fixed yet in my code, is that for evenlength palindromes, the number of positions should be the singlestranded genome length, and the number of occurences of each palindrome should be the number of occurences in the singlestranded genome (so both observed and expected counts are doubled in the current computation). For oddlength palindromes, the count and reverse count added together are being correctly computed, as the number of occurences is indeed the number on both strands, so they should not be halved, the way the even palindrome counts should be. Counting nonpalindromic words (if you implement that option) should be done on the full, doublestranded genome.
Because I have not changed my code to count the evenlength palindromes on just the singlestranded genome, you probably should have your program default to the same behavior as mine, counting words on the doublestranded genome.
Some of the stuff I wrote above is wrong.
There are (at least) two ways to compute expected counts for odd palindromes. One can merge counts with a wildcard before computing expected counts, or one can compute the expected counts separately for each possible word, and add the expected counts. For some sequences, it makes a huge difference. For example, for CATSATG in H. influenzae, CATCATG is slightly underrepresented, CATGATG is slightly overrepresented, but CATSATG is hugely underrepresented. This appears to stem from CATG being selected against, not anything specific to CATCATG as a whole, so the analysis by separate words is much more convincing than the analysis by words with wildcards. This needs to be written up in the assignment next year.
For counting all words, we should probably merge counts for x and x', either in the initial counting or as a postprocessing step on the counts. Watch out for x=x' though, as it should not be double counted.
I should probably have give students code for reopening files as gzip input:
for i,f in enumerate(options.files): if f.name.endswith(".gz"): options.files[i] = gzip.GzipFile(fileobj=f, mode="r")
Add links
https://www.neb.com/toolsandresources/selectioncharts/crossindexofrecognitionsequences
https://www.neb.com/toolsandresources/interactivetools/enzymefinder?searchType=1&recognitionSite=&ctl01=GO&matchType=1
https://en.wikipedia.org/wiki/List_of_restriction_enzyme_cutting_sites


 BME 205 home page  UCSC Bioinformatics research 
Questions about page content should be directed to
Kevin Karplus
Biomolecular Engineering
University of California, Santa Cruz
Santa Cruz, CA 95064
USA
karplus@soe.ucsc.edu
18314594250
318 Physical Sciences Building