| Bioinformatics: |
Hidden Markov Models in Computational Biology: Applications to Protein Modeling |
| Abstract |
Hidden Markov Models (HMMs) are applied to the problems of statistical modeling, database searching and multiple sequence alignment of protein families and protein domains. These methods are demonstrated on the globin family, the protein kinase catalytic domain, and the EF-hand calcium binding motif. In each case the parameters of an HMM are estimated from a training set of unaligned sequences. After the HMM is built, it is used to obtain a multiple alignment of all the training sequences. It is also used to search the SWISS-PROT 22 database for other sequences that are members of the given protein family, or contain the given domain. The HMM produces multiple alignments of good quality that agree closely with the alignments produced by programs that incorporate three-dimensional structural information. When employed in discrimination tests (by examining how closely the sequences in a database fit the globin, kinase and EF-hand HMMs), the HMM is able to distinguish members of these families from non-members with a high degree of accuracy. Both the HMM and PROFILESEARCH (a technique used to search for relationships between a protein sequence and multiply aligned sequences) perform better in these tests than PROSITE (a dictionary of sites and patterns in proteins). The HMM appears to have a slight advantage over PROFILESEARCH in terms of lower rates of false negatives and false positives, even though the HMM is trained using only unaligned sequences, whereas PROFILESEARCH requires aligned training sequences. Our results suggest the presence of an EF-hand calcium binding motif in a highly conserved and evolutionarily preserved putative intracellular region of 155 residues in the $\alpha$-1 subunit of L-type calcium channels which play an important role in excitation-contraction coupling. This region has been suggested to contain the functional domains that are typical or essential for all L-type calcium channels regardless of whether they couple to ryanodine receptors, conduct ions or both.
| Bibtex |
| Download |
Postscript: hmm.part1.ps and hmm.part2.ps
Stochastic Context-Free Grammars for tRNA Modeling |
| Abstract |
Stochastic context-free grammars (SCFGs) are applied to the problems of folding, aligning and modeling families of tRNA sequences. SCFGs capture the sequences' common primary and secondary structure and generalize the hidden Markov models (HMMs) used in related work on protein and DNA. Results show that after having been trained on as few as 20 tRNA sequences from only two tRNA subfamilies (mitochondrial and cytoplasmic), the model can discern general tRNA from similar-length RNA sequences of other kinds, can find secondary structure of new tRNA sequences, and can produce multiple alignments of large sets of tRNA sequences. Our results suggest potential improvements in the alignments of the D- and T-domains in some mitochdondrial tRNAs that cannot be fit into the canonical secondary structure.
| Bibtex |
| Download |
RNA Pseudoknot Modeling Using Intersections of Stochastic Context Free Grammars with Applications to Database Search |
| Abstract |
A model based on intersections of stochastic context free grammars is presented to allow for the modeling of RNA pseudoknot structures. The model runs relatively fast, having the same order running time as stochastic context free grammar parsers. The model is shown to be able to perform database searches and find RNA sequences which resemble RNA pseudoknots which bind biotin. The problem domain of RNA biotin binders has significance in the support of the RNA world model of early life on earth.
| Bibtex |
| Download |
Dirichlet Mixtures: A Method for Improving Detection of Weak but Significant Protein Sequence Homology. |
| Abstract |
This paper presents the mathematical foundations of Dirichlet mixtures, which have been used to improve database search results for homologous sequences, when a variable number of sequences from a protein family or domain are known. We present a method for condensing the information in a protein database into a mixture of Dirichlet densities. These mixtures are designed to be combined with observed amino acid frequencies, to form estimates of expected amino acid probabilities at each position in a profile, hidden Markov model, or other statistical model. These estimates give a statistical model greater generalization capacity, such that remotely related family members can be more reliably recognized by the model. Dirichlet mixtures have been shown to outperform substitution matrices and other methods for computing these expected amino acid distributions in database search, resulting in fewer false positives and false negatives for the families tested. This paper corrects a previously published formula for estimating these expected probabilities, and contains complete derivations of the Dirichlet mixture formulas, methods for optimizing the mixtures to match particular databases, and suggestions for efficient implementation.
| Bibtex |
| Download |
Postscript: dirichlet.ps
Postscript tech report: dirichletTech.ps
RNA Modeling Using Stochastic Context-Free Grammars |
| Abstract |
Recent developments in high-throughput biological technologies have created a wealth of biological sequence data. The immense size of these biological datasets has prompted the use of computational methods for their analysis. This work presents the theory and application of stochastic context-free grammars (SCFGs) to biological sequence analysis and specifically to the problem of RNA secondary structure modeling. SCFGs are a method of characterizing biological sequences that take into account the statistical identity of different sequence positions including pairwise interactions between positions. It is their ability to model pairwise interacting positions that make SCFGs a natural mathematical model of RNA secondary structure. SCFGs can automatically generate structural multiple alignments of RNA families that take into account basepairing interactions.SCFGs are presented as an extension of another probabilistic model used in biological sequence analysis, hidden Markov models. I present several SCFG algorithm developments including a SCFG constraint system that gives significant performance enhancements in both time and space and allows large SCFGs to be applied to large sequence analysis problems. I give a method using intersected SCFGs to model non-context-free structures. I also introduce a new method of sequence classification using a support vector machine framework and feature vectors generated from a SCFG.
I apply the SCFG method to an {\em in vitro} selected RNA pseudoknot that binds biotin. Even though SCFGs cannot model the RNA pseudoknot structure directly, I show that an approximation using two SCFGs can effectively perform database searches and find RNA pseudoknot structures. I then apply SCFGs to modeling small subunit ribosomal RNA, a large molecule that is important to the construction of phylogenetic trees of life. I compare the SCFG method to several other methods in constructing multiple alignments of this molecule and show that the SCFG outperforms the other methods, attaining a multiple alignment whose quality is close to hand-edited alignments. I apply SCFGs with support vector machines to a phylogenetic classification problem and show that they outperform a standard method. I describe the SCFG RNA modeling software, RNACAD, that was used in this work.
| Bibtex |
| Download |
Knowledge-based Analysis of Microarray Gene Expression Data Using Support Vector Machines |
| Abstract |
We introduce a new method of functionally classifying genes using gene expression data from DNA microarray hybridization experiments. The method is based on the theory of support vector machines. SVMs are considered a supervised computer learning method because they exploit prior knowledge of gene function to identify unknown genes of similar function from expression data. SVMs avoid several problems associated with unsupervised clustering methods such as hierarchical clustering methods and self organizing maps. SVMs have many mathematical features that make them attractive for gene expression analysis, including their flexibility in choosing a similarity function, sparseness of solution when dealing with large data sets, the ability to handle large feature spaces, and the ability to identify outliers. We test several SVMs that use different similarity metrics, as well as some other supervised learning methods, and find that the SVMs best identify sets of genes with a common function using expression data. Finally, we use SVMs to predict functional roles for uncharacterized yeast ORFs based on their expression data.
| Bibtex |
| Download |
Online:: http://www.pnas.org/cgi/content/abstract/97/1/262
Postscript tech report: genex.ps
Small Subunit Ribosomal RNA Modeling Using Stochastic Context-Free Grammars |
| Abstract |
We introduce a model based on stochastic context-free grammars (SCFGs) that can construct small subunit ribosomal RNA (SSU rRNA) multiple alignments. The method takes into account both primary sequence and secondary structure basepairing interactions. We show that this method produces multiple alignments of quality close to hand edited ones and outperforms several other methods. We also introduce a method of SCFG constraints that dramatically reduces the required computer resources needed to effectively use SCFGs on large problems such as SSU rRNA. Without such constraints, the required computer resources are infeasible for most computers. This work has applications to fields such as phylogenetic tree construction. {\bf Keywords}: Ribosomal RNA, Multiple Alignment, Stochastic Context-Free Grammar, HMM, Constraints
| Bibtex |
| Download |
title |
| Abstract |
abstract
| Bibtex |
| Download |