Transfer RNA (tRNA) genes are the single largest gene family. A typical eukaryotic genome contains hundreds of tRNA genes; the human genome contains an estimated 1,300 [Hatlen & Attardi, 1971]. In a time when complete genomes are being sequenced, one would like to have an accurate means of tRNA gene identification. The tRNA repertoire of an organism affects the codon bias seen in highly expressed protein coding genes. In extreme cases, selective pressure for extremely high or low genomic GC content may have caused loss of a tRNA, producing an unassigned codon [Oba et al., 1991,Kano et al., 1993]. Suppressor tRNAs are important genetic loci in many model organisms. In addition to authentic tRNA genes, tRNA-derived short interspersed nuclear elements (SINEs) have been identified in rodents and other mammals as likely mobile genetic elements [Daniels & Deininger, 1985,Deininger, 1989]. Detection and discrimination of these elements from true tRNAs is a desirable feature of tRNA identification methods.
It is commonly believed that the best RNA gene detection methods are custom-written programs that search for one type of RNA gene exclusively [Dandekar & Hentze, 1995]. Numerous tRNA search programs key on primary sequence patterns and/or secondary structure specific to tRNAs [Staden, 1980,Paolella & Russo, 1985,Shortridge et al., 1986,Marvel, 1986,Wozniak & Makalowski, 1990,Fichant & Burks, 1991,Pavesi et al., 1994,El-Mabrouk & Lisacek, 1996]. Why bother with specialized tRNA-detection software instead of using a fast, commonly available similarity search program such as BLAST [Altschul et al., 1990] or FASTA [Pearson & Lipman, 1988]? Since many functional RNA genes tend to conserve a common base-paired secondary structure better than a consensus primary sequence, the accuracy of RNA similarity searching is much improved by including secondary structure elements. A group of generalized RNA gene search tools look for specific combinations of primary and secondary structure motifs specified by the user [Saurin & Marliere, 1987,Staden, 1988,Gautheret et al., 1990,Sibbald et al., 1992,Laferriere et al., 1994,Billoud et al., 1996,Eddy & Durbin, 1994], although tRNA ``descriptors'' in these pattern-matching languages have typically under-performed custom-written programs.
tRNAscan 1.3 by Fichant & Burks [Fichant & Burks, 1991] is perhaps the most widely used tRNA detection program. It identifies approximately 97.5% of true tRNA genes and gives 0.37 false positives per million base pairs (Mbp) [Fichant & Burks, 1991]. The algorithm uses a hierarchical, rule-based system in which each potential tRNA must exceed empirically determined similarity thresholds for two intragenic promoters, plus have the ability to form base pairings present in tRNA stem-loop structures. The false positive rate of tRNAscan has been acceptable for small genomes, but for larger eukaryotic genomes it becomes a significant problem. It will produce around 1100 false positive tRNAs for the human genome (0.37 false pos/Mbp for 3000 Mbp); given that there are about 1300 true tRNAs in the genome, almost half of the tRNAs predicted by tRNAscan will be false positives.
Pavesi and colleagues have developed a different tRNA detection algorithm [Pavesi et al., 1994] which searches exclusively for linear sequence signals in the form of eukaryotic RNA polymerase III promoters and terminators. The sensitivity and selectivity of this algorithm is roughly comparable to tRNAscan 1.3 in detection of eukaryotic tRNAs. Notably, the Pavesi algorithm identifies tRNAs not detected by tRNAscan 1.3, and vice versa [Pavesi et al., 1994]. The combined sensitivities of these two programs exceed 99%; however, the combined false positive rate is about five times that of tRNAscan alone.
Eddy & Durbin [Eddy & Durbin, 1994] have developed a general RNA structure similarity search method employing probabilistic RNA structural profiles, or ``covariance models''. Covariance models are able to capture both primary consensus and secondary structure information through the use of stochastic context-free grammars (SCFGs) [Eddy & Durbin, 1994,Grate, 1995,Sakakibara et al., 1994b]. Much like sequence profiles [Gribskov, 1994,Krogh et al., 1994], covariance models are constructed from multiple sequence alignments. Sequences are searched against a given covariance model using a three-dimensional dynamic programming algorithm, similar to a Smith-Waterman alignment but including base-pairing terms as well. RNA covariance models have the advantages of high sensitivity, high specificity, and general applicability to any RNA sequence family of interest, obviating the need for custom-written software for each RNA family. However, covariance model dynamic programming algorithms are almost prohibitively CPU-intensive. A tRNA covariance model identifies >99.98% of true tRNAs, with a false positive rate of <0.2/Mbp [Eddy & Durbin, 1994], but searching the human genome with a tRNA covariance model would take about nine and a half CPU-years (based on benchmarks on an SGI Indigo2 R4400/200 CPU, 140 SPECint92).
We describe here a program, tRNAscan-SE, that combines three tRNA search methods to attain the specificity of covariance model analysis with the speed and sensitivities of optimized versions of tRNAscan 1.3 and the Pavesi search algorithm. tRNAscan-SE detects 99-100% of true tRNAs, giving fewer than one false positive per fifteen billion nucleotides of random sequence, at approximately 1,000 to 3,000 times the speed of searching with tRNA covariance models. Additional extensions to tRNAscan-SE allow detection and accurate secondary structure prediction of unusual tRNA species including both prokaryotic and eukaryotic selenocysteine tRNA genes, as well as tRNA-derived repetitive elements and pseudogenes.