Genome Analysis

Table 2.3: tRNAs identified in genomic databases by various search methods.

``Literature'' column represents the published number of tRNAs found within genomes. ``Tot'' columns indicate total number of tRNAs found in searches for each program. Numbers in parentheses in (%) columns indicate percentage of tRNAs detected relative to literature (H. influenzae, M. jannaschii, P. anserina), or when published tRNA annotation is incomplete or uncertain (M. genitalium, S. pombe, S. cerevisiae, C. elegans), detection percentages are relative to total tRNAs found by tRNA covariance model analysis and supported by manual inspection. ``fp'' = false positives determined by covariance model analysis and manual inspection (these do not include pseudogenes that have strong similarity to known tRNAs). ``ps'' = tRNA identifications which appear to be pseudogenes containing 5' truncations of 3-16 bp, large insertions or deletions elsewhere, or other characteristics of tRNA-derived repetitive elements. ``ip'' = tRNAs automatically identified by tRNAscan-SE as likely pseudogenes which have qualities similar to manually detected pseudogenes described above.

Sequence Source Size Literature tRNAscan 1.3 EufindtRNA tRNA CM tRNAscan-SE

(Kbp) tRNAs Tot (%) Tot (%) Tot (%) Tot (%)

M. genitalium
580 33 36 (100) 19 (52.8) 36 (100) 36 (100)

+ 1 fp

H. influenzae 1,830 56 55 (98.2) 42 (73.7) 58 (103.6) 58 (103.6)

+ 2 fp

M. jannaschii 1,730 37 36 (97.3) 20 (54.0) 37 (100) 37 (100)

+1 fp

S. pombe 4,176 - 45 (93.7) 46 (95.8) 48 48 (100)

(through 9/96) +4 fp +1 fp

S. cerevisiae 12,057 273 270 (98.5) 274 (100) 274 274 (100)

+4 fp +10 fp

+ 1 ps +1 ps +1 ps

C. elegans 58,402 - 389 (96.5) 400 (99.2) 403 403 (100)

(through 11/13/96) 16 fp +29 fp +355 fp +11 ip

+ 19 ps +23 ps + 8 ps

P. anserina mito. 100 27 18 (66.7) 11 (40.7) 27 (100) 22 (81.5)

Another measure of sensitivity was derived from searching complete or partial genomic sequence data from eubacterial, archaebacterial, yeast, and C. elegans sequencing projects (Table 2.3). For M. genitalium, 33 tRNAs were noted in the published [Fraser et al., 1995] and on-line gene identifications (http://www.tigr.org/tdb/mdb/mgdb/mgdb.html), whereas 36 tRNAs were detected by three tRNA detection methods (tRNAscan 1.3, tRNAscan-SE, covariance model analysis). The three tRNAs not appearing in the literature are for Arg (anticodon: CCT, bounds: 306615-306686, upper strand), Leu (anticodon: CAA, bounds: 448783-448861, upper strand), and Leu (anticodon: GAG, bounds: 446265-446181, reverse strand). For the completed H. influenzae genome, 56 tRNAs are noted in the literature and on-line gene identifications [Fleischmann et al., 1995]. tRNAscan-SE and covariance model analysis both identify the tRNAs noted in the literature, plus two potentially novel tRNAs not noted in the literature: SelCys (anticodon: TCA, bounds 753881-753791), and Leu (anticodon: GAG, bounds 1577041-1576960). The first is a selenocysteine tRNA and the other appears to be either a pseudogene or a true tRNA containing a short intron. [Note: Since publication of these results [Lowe & Eddy, 1997], TIGR has adopted our program for tRNA analysis, and updated their annotation.] The selenocysteine tRNA identification is not unexpected; BLAST searches identify two enzymes in the selenocysteine insertion pathway, as well formate dehydrogenase containing a 'UGA' selenocysteine-insertion codon. The evidence for the other potentially novel tRNA is less certain. The short 12 bp ``intron'' would presumably require protein-splicing to generate a functional tRNA, a feature that would be novel among eubacterial tRNAs. However, the covariance model score of 36.88 bits for the tRNA is well above the minimum cutoff of 20 bits, indicating that the sequence is likely to have evolutionary homology with tRNA. It is possible that it is a pseudogene. tRNAscan 1.3 identifies 55 of the 56 tRNAs noted in the literature (Gly-B, by TIGR nomenclature, is not detected), and does not detect either of the novel tRNAs detected by tRNAscan-SE and covariance model analysis.

The genomic sequence of the archaebacterium M. jannaschii was also analyzed. Both tRNAscan-SE and covariance model analysis identified all 37 tRNAs as given in the literature [Bult et al., 1996]. tRNAscan 1.3 identified 36 of the 37 tRNAs, missing the single selenocysteine tRNA in the set. We also scanned the recently completed genomic sequence of the budding yeast S. cerevisiae (12 Mbp). The covariance model search took 14 days to complete, and produced 275 tRNAs. Based either on inspection for ability to form correct tRNA secondary structure, or exact identity with previously characterized yeast tRNAs, we believe 274 predicted tRNAs are true tRNAs, and one is a pseudogene with an 7 bp 5' truncation. One of these 274 tRNAs was missing from the yeast genome project web site annotation http://www.mips.biochem.mpg.de/), but this is probably an oversight since a tRNA of identical sequence is correctly annotated elsewhere in the genome (tRNA_i_S (GCT)LR2). tRNAscan-SE took 19 minutes and detected the same 275 tRNAs found by covariance model analysis. EufindtRNA found the same 275 tRNAs in just over one minute. tRNAscan 1.3 took about 10 hours to complete, and missed 4 (2 pairs identical in sequence) of the 274 true tRNAs found by the other three methods. The 4 Mbp of available genomic sequence from Schizosaccharomyces pombe (fission yeast) was also analyzed. tRNAscan-SE and covariance model analysis both predict 48 tRNAs. tRNAscan 1.3 identifies 45 of the 48 predicted by covariance model analysis (2 of 3 missed were identical in sequence), whereas EufindtRNA identifies 46 of the 48 total tRNAs.

Finally, we scanned the largest set of genomic sequence currently available, 58.4 Mbp from the C. elegans genome project. Since only a handful of the tRNAs detected have been previously published in the literature, we again relied on covariance model detection of tRNAs as our best measure for ``true'' tRNAs. Conflicts in tRNA predictions between tRNAscan 1.3, tRNAscan-SE and covariance model analysis were all examined manually for highly conserved primary sequence motifs and proper secondary structure. As most tRNA species are multicopy in eukaryotes, BLAST similarity searches were used to help discern ``false positives'' from pseudogenes. We define false positives as predicted tRNAs which do not appear to be evolutionarily derived from true tRNAs. These false positives are assessed by failure to form recognizable tRNA secondary structure and the lack of related tRNAs elsewhere in the genome. Pseudogenes, on the other hand, usually have at least partial tRNA secondary structure, plus clear deletions or insertions relative to at least one related, intact tRNA elsewhere in the genome. tRNA-derived mobile elements also have recognizable primary sequence similarity to tRNAs, although most have poor tRNA secondary structure similarity. Of the 403 complete tRNAs detected by covariance model analysis, tRNAscan-SE detected all 403 tRNAs (100%), whereas tRNAscan 1.3 detected 389 (96.5%), and EufindtRNA found 400 (99.2%).

Taken together, the data analyzed from the M. genitalium, H. influenzae, M. jannaschii, S. cerevisiae, S. pombe, and C. elegans genomes, 100% of the 856 tRNAs detected by covariance model analysis were found by tRNAscan-SE. tRNAscan 1.3 detected 831, missing 25 tRNAs identified by covariance models, a 97.1 % detection rate. EufindtRNA detects 93.5% of the 856 tRNA set, but if only eukaryotic genomes are considered, the program finds 720 of 725 (99.3%).