This track shows a measure of evolutionary conservation in human, rat, and mouse that is based on a phylogenetic hidden Markov model (phylo-HMM). The alignments used were the multiz alignments of the mouse Feb. 2003 (mm3), rat Jun. 2003 (rn3), and human Jul. 2003 (hg16) assemblies.
A phylo-HMM is a generative probabilistic model that describes both the process of DNA substitution along the branches of a phylogeny at each site in an alignment, and changes in the mode of this process from one site to the next (Felsenstein and Churchill [1996], Yang [1995], Siepel and Haussler [2003], Siepel and Haussler [submitted]). Each of the states of a phylo-HMM may be associated with a different phylogenetic model, describing both the overall rates of substitution along the branches of the phylogeny, and the pattern of substitution (relative rates for all pairs of nucleotides).
In this case, the pattern of substitution is assumed to be the same for all states, but the rates are allowed to differ. The model is similar to the one proposed by Felsenstein and Churchill (1996). There are k states corresponding to k "rate categories," each of which is defined by a scaling constant that is applied uniformly to the branches of the tree. The transition probabilities between states allow for autocorrelation of substitution rates---i.e., for adjacent sites to tend to exhibit similar overall substitution rates. A single parameter lambda describes the degree of autocorrelation and defines all transition probabilities. Here, we have estimated the rate constants from the data, similarly to Yang (1995) (see Siepel and Haussler [2003]), but have allowed lambda to be treated as a tuning parameter.
A phylogenetic model was fitted by maximum likelihood to the human/mouse/rat alignments using the REV substitution model and the discrete gamma model for rate variation (Yang [1994]; k=10 rate categories were used). All sites in the alignments were included in the analysis and alignment gaps were treated as missing data. Branch lengths were estimated as follows, in units of expected substitutions per site:
(human:0.193,(mouse:0.076,rat:0.083):0.193)(As shown, the tree is arbitrarily rooted at the midpoint of the branch between human and the mouse/rat ancestor.) The shape parameter alpha for the discrete gamma model was estimated to have value 4.4. Next, a 10-state phylo-HMM was derived from the estimated phylogenetic model (with lambda = 0.9), and used to compute the posterior probability that each site was "generated" by the state having the smallest rate constant. It is these posterior probabilities that are plotted for the track. Because of the way the rate categories are defined, the plotted values can be thought of as representing the posterior probability that each site is among the 10% most conserved sites in the data set, allowing for autocorrelation of substitution rates.
Currently, gaps are treated as missing data by the phylo-HMM as well as in the estimation of the phylogenetic model, which sometimes leads to undesirably high posterior probabilities in gappy regions of the alignment. We are looking at several possible ways of improving the handling of alignment gaps.
This track was created by Adam Siepel, with suggestions from David Haussler and others. The display technology ("wiggle track") was developed by Hiram Clawson.
J. Felsenstein and G. A. Churchill. A hidden Markov model approach to variation among sites in rate of evolution. Mol. Biol. Evol. 13:93-104, 1996.
A. Siepel and D. Haussler. Combining phylogenetic and hidden Markov models in biosequence analysis. In Proceedings of the Seventh Annual International Conference on Computational Molecular Biology (RECOMB 2003), pages 277-286, 2003.
A. Siepel and D. Haussler. Phylogenetic hidden Markov models. Submitted book chapter.
Z. Yang. A space-time process model for the evolution of DNA sequences. Genetics, 139:993-1005, 1995.
Z. Yang. Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: approximate methods, J. Mol. Evol. 39:306-314, 1994.