This track shows a measure of evolutionary conservation based on a phylogenetic hidden Markov model (phylo-HMM). A multiz alignment between this genome and related species' genomes was used to generate the annotation. The phylogenetic tree for these organisms is shown below, with branch lengths drawn to scale and given in substitutions per site. The tree is based on fitting a time-reversible (REV) model for nucleotide substitutions to the whole genome multiple alignment of these species.
In "full" visibility mode, this track displays pairwise alignments of each species aligned to the current genome. The pairwise alignments are displayed in the standard UCSC browser "dense" mode using a greyscale density gradient. The checkboxes in the track configuration section allow the exclusion of species from the pairwise display; however, this does not remove them from the conservation score display.
When zoomed-in to the base display level, the track shows the base composition of each alignment. The numbers and symbols on each species line indicate the lengths of gaps in the genome sequence at those alignment positions. If the gap size is greater than 9, the "+" symbol is displayed. To view detailed information about the alignments at a specific position, zoom the display in to 30,000 or fewer bases, then click on the alignment.
This track may be configured in a variety of ways to highlight different aspects of the displayed information. Click the "Graph configuration help" link for an explanation of the configuration options.
Best-in-genome blastz pairwise alignments of this genome to each of the other related species were multiply aligned using a program called Multiz. The resulting multiple alignments were then assigned conservation scores by phylo-HMM.
A phylo-HMM is a probabilistic model that describes both the process of DNA substitution at each site in a genome, and the way this process changes from one site to the next (Felsenstein and Churchill 1996, Yang 1995, Siepel and Haussler 2003, Siepel and Haussler 2004). A phylo-HMM can be thought of as a machine that generates a multiple alignment, in the same way that an ordinary hidden Markov model (HMM) generates an individual sequence. While the states of an ordinary HMM are associated with simple multinomial probability distributions, the states of a phylo-HMM are associated with more complex distributions defined by probabilistic phylogenetic models. These distributions can capture differences in the rates and patterns of nucleotide substitution observed in different types of genomic regions (e.g., coding or noncoding regions, conserved or nonconserved regions).
To compute a conservation score, we use a
k-state phylo-HMM, whose k associated phylogenetic
models differ only in overall evolutionary rate (Felsenstein and
Churchill 1996, Yang 1995). In the image at right, there are three
k states,
S1, S2, and S3, but in practice we
use k = 10.
A phylogenetic model is estimated globally, using the discrete gamma model
for rate variation (Yang 1994), then a scaled version of the estimated model
is associated with each state in a phylo-HMM. There is a
separate "rate constant", ri, for each state i,
which is multiplied by all branch lengths in the globally estimated model.
The transition probabilities between states allow for autocorrelation of
substitution rates, i.e., for adjacent sites to tend to exhibit similar
overall substitution rates. A single parameter, lambda, describes the
degree of autocorrelation and defines all transition probabilities.
Here, we have estimated the rate constants from the data,
similarly to Yang (1995) (Siepel and Haussler 2003), but have
allowed lambda to be treated as a tuning parameter. For the
conservation score, we use the posterior probability that each site was
"generated" by the state having the smallest rate constant. Because of
the way the rate categories are defined, the plotted values can be
thought of as approximately representing the posterior probability that
each site is among the 10% most conserved sites in the data set
(allowing for autocorrelation of substitution rates).
In this case, the general reversible (REV) substitution model was used in parameter estimation, and lambda was set to 0.9. Alignment gaps were treated as missing data, which sometimes has the effect of producing undesirably high posterior probabilities in gappy regions of the alignment. We are looking at several possible ways of improving the handling of alignment gaps.
This track was created at UCSC using the following programs:
Felsenstein J and Churchill GA (1996). A hidden Markov model approach to variation among sites in rate of evolution. Mol Biol Evol 13:93-104.
Siepel A and Haussler D (2003). Combining phylogenetic and hidden Markov models in biosequence analysis. In Proceedings of the Seventh Annual International Conference on Computational Molecular Biology (RECOMB 2003), pp. 277-286.
Siepel A and Haussler D (2004). Phylogenetic hidden Markov models. In R. Nielsen, ed., Statistical Methods in Molecular Evolution, Springer (in press).
Yang Z (1994). Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: approximate methods, J Mol Evol 39:306-314.
Yang Z (1995). A space-time process model for the evolution of DNA sequences. Genetics, 139:993-1005.
Kent WJ, Baertsch R, Hinrichs A, Miller W, and Haussler D (2003). Evolution's cauldron: Duplication, deletion, and rearrangement in the mouse and human genomes. Proc Natl Acad Sci USA 100(20):11484-11489 Sep 30 2003.
Blanchette M, Kent WJ, Riemer C, Elnitski L, Smit AFA, Roskin KM, Baertsch R, Rosenbloom K, Clawson H, Green ED, Haussler D, Miller W. (2004). Aligning Multiple Genomic Sequences with the Threaded Blockset Aligner. Genome Res. 14(4):708-15.
Chiaromonte F, Yap VB, and Miller W (2002). Scoring pairwise genomic sequence alignments. Pac Symp Biocomput 2002;:115-26.
Schwartz S, Kent WJ, Smit A, Zhang Z, Baertsch R, Hardison R, Haussler D, and Miller W. (2003). Human-Mouse Alignments with BLASTZ. Genome Res. 13(1):103-7.