There is a nice tutorial on dynamic programming at http://www.sbc.su.se/~per/molbioinfo2001/seqali-dyn.html with pointers to an example at http://www.sbc.su.se/~per/molbioinfo2001/dynprog/dynamic.html. It uses the more expensive alignment algorithm that allows arbitrary gap costs for different length gaps, though only an affine gap cost is used. If you want a textbook presentation, one of the best I know is in Chapter 2 of Durbin, Eddy, Krogh, and Mitchison's book Biological Sequence Analysis. The book An Introduction to Bioinformatics Algorithms by Neil C. Jones and Pavel A. Pevzner is more tutorial, but doesn't cover affine gap costs as well. Chapter 6 is the relevant chapter there.
The version of dynamic programming with affine gap costs that we presented in class used only nearest neighbors, but needed 3 matrices:
Mi, j = max { | Ici – 1, j – 1 + subst(Ai, Bj) |
Mi – 1, j – 1 + subst(Ai, Bj) | |
Iri – 1, j – 1 + subst(Ai, Bj) |
Ici, j = max { | Ici , j – 1 – extend |
Mi , j – 1 – open | |
Iri , j – 1 – double_gap |
Iri, j = max { | Mi–1 , j – open |
Iri–1 , j – extend | |
Ici–1 , j – double_gap |
In Python, one can handle this just as well with one array, containing triples of values—for example, (M,Ir,Ic). Some students find handling one subscripting and a triple of information easier.
Be careful of the boundary conditions. For global alignment, you should have "start"-"start" aligned with 0 cost in the MATCH state and –infinity in Ir and Ic, so that a gap opening penalty is incurred if you start with an insertion or deletion. Before start, all rows and columns are –infinity, and the rest of the start row and column can be inferred from the recurrence relation. Make sure that your initial conditions are consistent with the recurrence relations—people sometimes get the indices of their matrices reversed, or confuse the Ir and Ic matrices.
For local alignment, the recurrence relations and the boundary conditions are different:
Mi, j = subst(Ai, Bj) + max { | 0 |
Ici – 1, j – 1 | |
Mi – 1, j – 1 | |
Iri – 1, j – 1 |
Ici, j = max { | Ici , j – 1 – extend |
Mi , j – 1 – open | |
Iri , j – 1 – double_gap |
Iri, j = max { | Mi–1 , j – open |
Iri–1 , j – extend | |
Ici–1 , j – double_gap |
Since local alignment is allowed to start at any match state (with the 0 choice for previous alignment cost), all rows and columns before the first letters of the sequences can be minus infinity.
You'll need a substitution matrix for this assignment. I've provided BLOSUM62 in the format provided by matblas and used by BLAST.
Compute the score for the following global alignment using the BLOSUM62 scoring matrix with a gap-opening cost of 12 and a gap-extension cost of 1 and a double-open cost of 2 (all are in 1/2 bit units to be compatible with the BLOSUM62 matrix). Show what you added up to get the score, since the final number alone could easily be incorrect due to a transcription or addition error.
--CTNIDDQSA--DK-NR- VLDAM-EENIGEGDDIKMVNote: this is neither the global nor the local optimal alignment.
Create a module (align.py) that contains two classes:
local_aligner
and global_aligner
.
Both classes should have the same methods:
aligner= local_aligner(subst,open=,extend=,double=)
aligner= global_aligner(subst,open=,extend=,double=)
__init__
method) records the
substitution matrix and gap costs for the
alignments to follow. I found it useful to store the substitution
matrix as a dict, mapping strings of 2 letters to
the substitution matrix value for that pair.
The __init__
method should assign reasonable default
values for the three gap costs.
score_a2m(self,s1,s2)
score=align(row_seq,col_seq)
Creates and fills in alignment matrices and stores row_seq and col_seq.
I found it convenient (though slow) to use dict objects for the matrices, mapping (i,j) 2-tuples to the value of the matrix at that coordinate pair. Note that this allows negative indices. You can make things faster by using numpy arrays with normal 0-based indexing of arrays, but you have to be careful about the bounds.
seq=traceback_col_seq()
You may implement the recursive definition of the computation
of the alignment matrix in any way that works, as long as
subalignments are not recomputed.
Although dynamic
programming is the standard implementation (being fast and
fairly simple), some people find it easier to do
memoized
computation. For example, you might have
three methods m(i,j)
,
ir(i,j)
, and ic(i,j)
that look up the
value in the matrices if it is already there, and otherwise do
recursion (or base cases) necessary to compute it.
There are many ways to implement the recursive definition,
and memoization is not the fastest of them.
test_align
--subst_matrix BLOSUM62
--align local
and --align local
--open 10
--extend 2
--double_gap 3
The program should read a fasta file from stdin with multiple sequences and output an alignment as an A2M file to stdout with the same sequences in the same order, and with the same names and comments on the id lines.
The fasta files may have alignment characters in them, like spaces, periods, and hyphens. These are not part of the sequences and should be removed or ignored. The alphabet for the fasta input is taken to be the characters used as indices in the substitution matrix. Whether letters in the fasta file are upper- or lower-case should also be ignored. This means that you should be able to read an A2M file and recover the unaligned sequences in a uniform case. As a sanity check, if you feed the output of your program back into the program with the same alignment options, the output should be identical to the input.
The first sequence of the fasta file is taken to be the "master" sequence and is echoed to stdout. All subsequent sequences are aligned to the first sequence and output to stdout. The score for each alignment should be output to stderr.
The gap penalties should (by default) be open=12, extend=1, double_gap=3. (The open and extend penalties are reasonable, but I'm not so sure about the double_gap penalty.)
The complete output of the program should be a legal A2M file.
>test1 CCCCCCCCCCCCCCCADEFGCCCCCCCCCCCCCCCC >test2 eeeeeeeeee---------------ADEFG----------------eeeeeeeeeeeeeeeeeor
>test1 CCCCCCCCCCCCCCCADEFGCCCCCCCCCCCCCCCC >test2 ---------------eeeeeeeeeeADEFGeeeeeeeeeeeeeeeee----------------See http://compbio.soe.ucsc.edu/a2m-desc.html for more details of a2m format. You can check that the file is formatted ok, by running it through the SAM program
checkseq
, which
unfortunately does not use stdin. To check the file x.a2m, run
checkseq foo -d x.a2mYou can also use the SAM program prettyalign to convert the a2m into a more visually appealing format. The SAM programs are installed in /projects/compbio/bin/i686 and /projects/compbio/bin/x86_64
Note: Python is not the language of choice for computation-intensive applications like sequence alignment. It may be better than Perl, but it is still not very efficient. My memoized implementation on 1001 alignments to a 302-long sequence (input from 1bqcA.fa.gz) which requires about 98.2 million cell updates took 1380 seconds on my laptop, or about 71,000 cell-updates-per-second (CUPS). A reasonable software implementation in an efficient language should do about 30 million CUPS on a somewhat newer machine, and a highly optimized implementation taking advantage of all the hardware on fast modern desktops may manage 46 billion CUPS (http://www.clcbio.com/wp-content/uploads/2012/09/Cell_A413.pdf). Python is required for this assignment only for pedagogic purposes, and my implementation was chosen for simplicity, not speed. Even in Python, one could go much faster---the memoized one is ludicrously slow.
Name your program "test_align", so that I can test everyone's program
with a script that does test_align --align=local < test.fasta
and test_align --align=global < test.fasta
.
Turn in your program after testing it on the following fasta files:
Don't waste paper turning in printouts of the alignments, as I'll be recomputing them with your program anyway. I'll be grading based on whether your alignment scores the same as an optimal alignment that my program finds, so it is essential that your output be a legal A2M file that correctly represents the alignment, but it need not be the same optimal alignment that I find.
To help you with your debugging, I provide possible output for local alignment of align-3, with g=12, e=1, d=2:
>fat ISAYGARFIELDISAFATCAT >cat ----GARFIELDwasthelASTFATCAT >cat2 ----GARFIELDISAFATCAT >cat3 ----GARFIELDISTheFATCAT >cat4 ----GARFIELDISAFAsTCAT >cat4 ----GARFIELDISTHefaSTCAT >cat5 ----GARFIELDIS-FAT--- >say -SAYFAt---------------with scores 58, 85, 68, 73, 57, 51, and 16. Possible global alignments are
>fat ISAYGARFIELDISAFATCAT >cat ----GARFIELDwasthelASTFATCAT >cat2 ----GARFIELDISAFATCAT >cat3 ----GARFIELDISTheFATCAT >cat4 ----GARFIELDISAFAsTCAT >cat4 ----GARFIELDISTHefaSTCAT >cat5 ----GARFIELDIS----FAT >say -SAY--------------FATwith scores 43, 70, 53, 58, 42, 25, and -15.
|
|
| BME 205 home page | UCSC Bioinformatics research |
Questions about page content should be directed to
Kevin Karplus
Biomolecular Engineering
University of California, Santa Cruz
Santa Cruz, CA 95064
USA
karplus@soe.ucsc.edu
1-831-459-4250
318 Physical Sciences Building