Databases Tested

Next: ``Random'' Sequence Data Up: Methods Previous: Selenocysteine tRNA Identification

Databases Tested

tRNA detection rates were assessed primarily by searching two annotated databases: the 1995 release of the Sprinzl tRNA database (retrieved from ftp://ftp.ebi.ac.uk/pub/databases/trna [Steinberg et al., 1993]), and a tRNA sequence subset of Genbank (retrieved from the National Center for Biotechnology Information on 9/24/96). Genomic DNA was also searched from Haemophilus influenzae (v. 1.0, from the Institute for Genome Research (TIGR) [Fleischmann et al., 1995], Mycoplasma genitalium [Fraser et al., 1995], Methanococcus jannaschii [Bult et al., 1996], Saccharomyces cerevisiae (rel. 4/24/96), Schizosaccharomyces pombe (completed cosmids retrieved from http://www.sanger.ac.uk/~yeastpub/svw/pombe.html on 9/30/96), C. elegans (completed cosmids retrieved 11/13/96 from ftp://ftp.sanger.ac.uk/pub/C.elegans/sequences), and Human (completed cosmids retrieved 8/28/96 from ftp://ftp.sanger.ac.uk/pub/human).

The Sprinzl tRNA database is the most comprehensive tRNA database, containing 2700 entries from a wide variety of organisms [Steinberg et al., 1993]. It provides a set of trusted ``true positives'' for evaluating the sensitivity of a detection method. Since tRNAscan-SE was optimized for analyzing bacterial, archaeal, and eukaryotic genomic DNA, the 1144 tRNAs from species in these groups were chosen for analysis, excluding mitochondrial, chloroplast, and viral tRNA sequences. From this set, tRNAs that were used to train the TRNA2.cm covariance model (553 tRNAs in the 1993 release of the database) were removed to increase the independence between training and testing sequence data. Entries were restored to their correct primary sequence by combining the Sprinzl structural alignment with the atypical insertions that are annotated in a separate file. Introns, not present in the Sprinzl sequences or annotation, were not restored. Two prokaryotic sequences (DI1950, DR1420) were removed which would contain introns over 200 bp long had introns been included; none of the current tRNA search programs attempt to detect tRNA genes containing long group I or group II introns.

A broad sample of non-viral, non-organellar Genbank sequences indicating at least one tRNA in their feature tables was also analyzed. C. elegans and S. cerevisiae sequences were excluded since these genomic sequences were tested separately. The sequences were retrieved using the IRX query system at the National Center for Biotechnology Information (NCBI). Incomplete or synthetic tRNA sequences were removed, yielding a total of 1051 in the set. Genbank sequence annotation was not relied upon as a measure of the true number of tRNAs in the set since annotation quality is highly variable. Instead, tRNA detection by covariance model analysis was used to estimate the total number of tRNAs. Sequences with no tRNAs detected by covariance model analysis were manually examined to determine why annotated tRNAs were not detected, and six believed to be tRNAs were added to the covariance model-detected set. This method gave us a reasonable lower bound on the number of true positives in the Genbank subset.

Next: ``Random'' Sequence Data Up: Methods Previous: Selenocysteine tRNA Identification

Todd M. Lowe
2000-03-31