There is a nice tutorial on dynamic programming at http://www.sbc.su.se/~per/molbioinfo2001/seqali-dyn.html with pointers to an example at http://www.sbc.su.se/~per/molbioinfo2001/dynprog/dynamic.html. It uses the more expensive alignment algorithm that allows arbitrary gap costs for different length gaps, though only an affine gap cost is used. If you want a textbook presentation, one of the best I know is in Chapter 2 of Durbin, Eddy, Krogh, and Mitchison's book Biological Sequence Analysis. The book An Introduction to Bioinformatics Algorithms by Neil C. Jones and Pavel A. Pevzner is more tutorial, but doesn't cover affine gap costs as well. Chapter 6 is the relevant chapter there.
The version of dynamic programming with affine gap costs that we presented in class used only nearest neighbors, but needed 3 matrices:
Mi, j = max { | Iyi – 1, j – 1 + subst(Ai, Bj) |
Mi – 1, j – 1 + subst(Ai, Bj) | |
Ixi – 1, j – 1 + subst(Ai, Bj) |
Iyi, j = max { | Iyi , j – 1 – extend |
Mi , j – 1 – open | |
Ixi , j – 1 – double_gap |
Ixi, j = max { | Mi–1 , j – open |
Ixi–1 , j – extend | |
Iyi–1 , j – double_gap |
In Python, one can handle this just as well with one array, containing triples of values—for example, (M,Ix,Iy). Some students find handling one subscripting and a triple of information easier.
Be careful of the boundary conditions. For global alignment, you should have "start"-"start" aligned with 0 cost in the MATCH state and –infinity in Ix and Iy, so that a gap opening penalty is incurred if you start with an insertion or deletion. Before start, all rows and columns are –infinity, and the rest of the start row and column can be inferred from the recurrence relation. Make sure that your initial conditions are consistent with the recurrence relations—people sometimes get the indices of their matrices reversed, or confuse the Ix and Iy matrices.
For local alignment, the recurrence relations and the boundary conditions are different:
Mi, j = subst(Ai, Bj) + max { | 0 |
Iyi – 1, j – 1 | |
Mi – 1, j – 1 | |
Ixi – 1, j – 1 |
Iyi, j = max { | Iyi , j – 1 – extend |
Mi , j – 1 – open | |
Ixi , j – 1 – double_gap |
Ixi, j = max { | Mi–1 , j – open |
Ixi–1 , j – extend | |
Iyi–1 , j – double_gap |
Since local alignment is allowed to start at any match state (with the 0 choice for previous alignment cost), all rows and columns before the first letters of the sequences can be minus infinity.
You'll need a substitution matrix for this assignment. I've provided BLOSUM62 in the format provided by matblas and used by BLAST.
Compute the score for the following global alignment using the BLOSUM62 scoring matrix with a gap-opening cost of 12 and a gap-extension cost of 1 and a double-open cost of 2 (all are in 1/2 bit units to be compatible with the BLOSUM62 matrix). Show what you added up to get the score, since the final number alone could easily be incorrect due to a transcription or addition error.
---CTNIDDSADKN-R-- VLDAME-EIGEGDDIKMVNote: this is neither the global nor the local optimal alignment.
Create a module (align.py) that contains two classes:
local_aligner
and global_aligner
.
Both classes should have the same methods:
aligner= local_aligner(subst,gaps)
aligner= global_aligner(subst,gaps)
score_a2m(self,s1,s2)
score=align(xseq,yseq)
seq=traceback_yseq()
I found it convenient to have three other methods m(i,j)
,
ix(i,j)
, and iy(i,j)
that look up the
value in the matrices if it is already there, and otherwise do
recursion (or base cases) necessary to compute it. This sort of
computation is referred to
as memoized
computation and is a generalization of
dynamic
programming.
There are many other ways to implement the dynamic programming,
and memoization is not the fastest of them.
test_align
--subst_matrix BLOSUM62
--align local
--open 10
--extend 2
--double_gap 3
The program should read a fasta file from stdin. The fasta files may have alignment characters in them, like spaces, periods, and hyphens. These are not part of the sequences and should be removed or ignored. The alphabet for the fasta input is taken to be the characters used as indices in the substitution matrix.
The first sequence of the fasta file is taken to be the "master" sequence and is echoed to stdout. All subsequent sequences are aligned to the first sequence and output to stdout. The score for each alignment should be output to stderr.
The gap penalties should (by default) be open=12, extend=1, double_gap=2. (The open and extend penalties are reasonable, but I'm not so sure about the double_gap penalty.)
The complete output of the program should be a legal A2M file.
>test1 CCCCCCCCCCCCCCCADEFGCCCCCCCCCCCCCCCC >test2 eeeeeeeeee---------------ADEFG----------------eeeeeeeeeeeeeeeeeor
>test1 CCCCCCCCCCCCCCCADEFGCCCCCCCCCCCCCCCC >test2 ---------------eeeeeeeeeeADEFGeeeeeeeeeeeeeeeee----------------See http://compbio.soe.ucsc.edu/a2m-desc.html for more details of a2m format. You can check that the file is formatted ok, by running it through the SAM program
checkseq
, which
unfortunately does not use stdin. To check the file x.a2m, run
checkseq foo -d x.a2mYou can also use the SAM program prettyalign to convert the a2m into a more visually appealing format. The SAM programs are installed in /projects/compbio/bin/i686 and /projects/compbio/bin/x86_64
Note: Python is not the language of choice for computation-intensive applications like sequence alignment. It may be better than Perl, but it is still not very efficient. My implementation on 1001 alignments to a 302-long sequence (input from 1bqcA.fa.gz) which requires about 98.2 million cell updates took 1380 seconds on my laptop, or about 71,000 cell-updates-per-second (CUPS). A reasonable software implementation in an efficient language should do about 30 million CUPS on a somewhat newer machine, and a highly optimized implementation taking advantage of all the hardware on fast modern desktops may manage 46 billion CUPS (http://www.clcbio.com/wp-content/uploads/2012/09/Cell_A413.pdf). Python is required for this assignment only for pedagogic purposes, and my implementation was chosen for simplicity, not speed. Even in Python, one could go much faster.
Name your program "test_align", so that we can test everyone's program
with a script that does test_align --align=local < test.fasta
and test_align --align=global < test.fasta
.
Turn in both your program
and the output for running the program on the
sequence pairs in the the following fasta files:
To help you with your debugging, I provide possible output for local alignment of align-3, with g=12, e=1, d=2:
>fat ISAYGARFIELDISAFATCAT >cat ----GARFIELDwasthelASTFATCAT >cat2 ----GARFIELDISAFATCAT >cat3 ----GARFIELDISTheFATCAT >cat4 ----GARFIELDISAFAsTCAT >cat4 ----GARFIELDISTHefaSTCAT >cat5 ----GARFIELDIS-FAT--- >say -SAYFAt---------------with scores 58, 85, 68, 73, 57, 51, and 16. Possible global alignments are
>fat ISAYGARFIELDISAFATCAT >cat ----GARFIELDwasthelASTFATCAT >cat2 ----GARFIELDISAFATCAT >cat3 ----GARFIELDISTheFATCAT >cat4 ----GARFIELDISAFAsTCAT >cat4 ----GARFIELDISTHefaSTCAT >cat5 ----GARFIELDIS----FAT >say -SAY--------------FATwith scores 43, 70, 53, 58, 42, 25, and -15.
|
|
| BME 205 home page | old BME 205 discussion forum | Fall 2012 BME 205 discussion forum | UCSC Bioinformatics research |
Questions about page content should be directed to
Kevin Karplus
Biomolecular Engineering
University of California, Santa Cruz
Santa Cruz, CA 95064
USA
karplus@soe.ucsc.edu
1-831-459-4250
318 Physical Sciences Building