Next: Application of New Tools Up: Introduction Previous: Genome Annotation and Gene

Computational Detection of RNA genes

Most biologists and genome researchers concentrate solely on protein coding genes, thus are not aware of the special issues involved in detecting RNA genes. The variety of RNA genes known today is fairly small relative to protein coding genes, although the number of members within a single RNA gene family can be substantial. For example, the yeast S. cerevisiae contains 274 transfer RNAs [Lowe & Eddy, 1997], and to date, 65 small nucleolar RNAs (snoRNAs) [Samarsky & Fournier, 1999]. Taken together, these two RNA families comprise more than 5% of the estimated 6000 total protein coding genes in the yeast genome [Goffeau et al., 1996]. Thus, computational methods are certainly needed to identify these and other RNA genes which are otherwise hidden between and sometimes within protein coding regions (e.g., within introns).

RNA gene prediction presents a particularly challenging problem. Unlike for protein-coding genes, there are no generalized computational methods for identifying new classes of RNA genes. Even for well-known RNAs with homologs present in the database, detection via similarity search methods often fails since these methods only detect primary sequence conservation. Homologous RNA genes predominantly preserve secondary structure, which allows for base-paired nucleotides to change as long as a compensatory change in the partner maintains pairing (e.g., a C-G pair can change to G-C, A-T, or T-A pair). This property of RNA genes often precludes detection of other family members within the same genome or within other species' genomes.

Two brief examples illustrate this point. Transfer RNAs all share the same basic ``clover-leaf'' secondary structure and biological function. The Haemophilus influenzae genome has 58 annotated transfer RNAs [Fleischmann et al., 1995]. A WU-BLAST search [Gish, 1998] of the H. influenzae tRNA-Ser-3 gene against its own genome identifies only 2 other tRNAs with significant P-values (<0.05). The ribonuclease P (RNaseP) RNA, involved in the 5' end maturation of tRNA precursors, is a phylogenetically ubiquitous RNA with homologs from more than 250 species spanning all three domains of life [Brown, 1999]. The telomerase RNA, involved in maintaining eukaryotic chromosomal telomeres, has been identified in ciliates, yeast, and mammals. Neither the RNaseP RNA nor the telomerase RNA homologs have been identified by current computational methods in the completed C. elegans genome [C. elegans Sequencing Consortium, 1998], in spite of the fact that C. elegans is expected to require both.

Currently, the most effective methods for identifying RNA genes use primary and secondary structure information specific to each RNA gene family [Gautheret et al., 1990,Fichant & Burks, 1991,Sakakibara et al., 1994a,Eddy & Durbin, 1994]. The most accurate of these employ probabilistic RNA structural profiles, or ``covariance models''. Covariance models are able to capture both primary consensus and secondary structure information through the use of stochastic context-free grammars (SCFGs) [Grate, 1995,Sakakibara et al., 1994a,Eddy & Durbin, 1994]. Much like sequence profiles [Gribskov et al., 1990,Krogh et al., 1994], covariance models are constructed from multiple sequence alignments of family members. These SCFG-based methods have practical limitations due to the complexity of their exhaustive calculations, limiting the length of the target RNAs or the size of genome sequences that can be searched in a reasonable amount of time [Sakakibara et al., 1994a,Eddy & Durbin, 1994]. The success of tRNAscan-SE (Chapter 2) is due in large part to harnessing the power of covariance models while reducing their genome search space (thus time) by about ten-fold.

Aside from computational complexity issues, covariance models are not well-suited to represent a certain class RNA genes known as ``antisense RNAs''. This type of RNA interacts with other RNA molecules via short stretches of complementary bases. One example is the small nucleolar RNA (snoRNA) gene family. SnoRNAs direct highly specific nucleotide modifications via their antisense regions that pair with a target ribosomal RNA sequence (reviewed below). An alignment of snoRNAs for SCFG-based profile training does not capture the information contained within the rRNA complementary region, as these sequences change for each snoRNA and appear non-conserved. In fact, the ability for these regions to base pair to other RNAs is their most important, information-rich quality. For these reasons, SCFGs fail to detect snoRNAs and likely other antisense RNA gene families.

SCFG-based profile search methods are also championed because they are general. Instead of creating a completely new search program for each new type of RNA, profile SCFGs only require an alignment from which to create a new RNA gene search model. This quality can also be seen as limitation. RNA genes are different from proteins in that prediction of RNA gene function can be relatively simple based on examination of specific sequence characteristics. Prediction of tRNA identity and function is a good example. Covariance models are excellent at detecting tRNAs, but because they are general, they only produce gene boundaries and have no concept of anticodon sequence or gene function. Specialized programs like tRNAscan-SE can predict function automatically based on recognition of the anticodon sequence.

Next: Application of New Tools Up: Introduction Previous: Genome Annotation and Gene

Todd M. Lowe
2000-03-31