Selectivity

While the ``sensitivity'' of an algorithm is measured by the proportion of true positives identified in reference sequences, a method's ``selectivity'' is measured by its ability to avoid misidentifying unrelated sequences as true tRNAs. Increased sensitivity is usually gained at the expense of an increased false positive rate. A rate of one false positive per five to ten million bases of sequence has, in the past, been acceptable since the total amount of uncharacterized or non-protein coding sequence in the databases has been relatively small. However, with the advent of whole-genome sequencing projects on the megabase scale, this false positive rate is of much greater concern.

Assessing the ability of an algorithm to discriminate between true and false positives using biological sequence data can be difficult. At false positive rates of less than one per million bases, there is not enough well annotated sequence in the public databases to give a reliable indication of an algorithm's true performance. Even for the data that is available, it is uncertain whether or not an accurate prediction has been made in the absence of biochemical experimental evidence. An alternative strategy is to generate random nucleotide sequence which is known to have no biologically-derived genes. An unlimited amount of random sequence can be generated based on a general or species-specific genomic nucleotide frequency. Each identification of a tRNA gene in this random sequence can then be confidently counted as a false positive. False positives due to biologically-derived repetitive elements or pseudogenes are not taken into account in these synthetic test sequences, and must be addressed separately.

Table 2.4: False positive rates for actual & simulated genomes.

``Actual FP'' rows contain false positives detected in actual genomic sequence. ``Simulated FP'' rows contain the false positives found in whole-genome scale random sequence simulations (10 trials for C. elegans, 5 for human). For tRNA covariance model searches (tRNA CM), only one random C. elegans and no human genome simulations were performed due to extreme CPU demands (ND=not done).

Size (Mbp) tRNAscan 1.3 EufindtRNA tRNA CM tRNAscan-SE

FP FP/Mbp FP FP/Mbp FP FP/Mbp FP FP/Mbp

S. cerevisiae

Actual FP (completed genome) 12.0 4 0.33 10 0.83 0 < 0.08 0 < 0.08

C. elegans

Actual FP (portion completed) 58.4 29 0.50 355 6.08 0 < 0.03 0 < 0.03

Simulated FP (total genome) 100 42.5 0.42 26 0.26 0 < 0.01 0 < 0.001

Human

Actual FP (portion completed) 5.32 3 0.56 5 0.94 0 < 0.19 0 < 0.19

Simulated FP (total genome) 3000 1118 0.37 684 0.23 ND - 0 < 0.00007

We generated two types of random sequence sets to simulate the size and GC content of the C. elegans and human genomes (100 million and 3 billion bases of random sequence, respectively, as described in Methods). The number of false positives found with each algorithm appear in Table 2.4 along with false positive rates from actual genomic sequence (discussed below). Analysis of the simulated genomes gave consistent false positive rates between the various trials, at approximately 0.40 false positives per million bases for tRNAscan 1.3, a little more than half that for EufindtRNA, and zero for both tRNAscan-SE and covariance model analysis. In ten independent C. elegans genome simulations, an average of 42.5 tRNAs were identified by tRNAscan 1.4. The sequences for the false positive tRNAs were saved and analyzed with the original tRNAscan 1.3 program to confirm that false positives were due to the tRNAscan 1.3 algorithm, not the modifications introduced in tRNAscan 1.4. EufindtRNA misidentified an average of 26 false positives per simulated C. elegans genome. Both tRNAscan-SE and the tRNA covariance model searches found zero positives for every trial (only one genome simulation was searched with the tRNA covariance model due to the extreme CPU demands). As seen in Table 2.5, minor differences among analysis times for the various methods for microbial genomes become substantial when analyzing larger eukaryotic genomes. Analysis of the single C. elegans genome simulation with covariance models required almost four CPU-months.

Table 2.5: Analysis time in hours required for various complete genomes and tRNA search algorithms.

Actual genome scan times are given for tRNAscan-SE and EufindtRNA (genome simulation times used for human). Estimated scan times are given for tRNAscan 1.3 (400 bp/s) and tRNA covariance model analysis (tRNA CM; 20 bp/s).

Complete Size tRNAscan 1.3 EufindtRNA tRNA CM tRNAscan-SE

Genome (Mbp) (CPU hours) (CPU hours) (CPU hours) (CPU hours)

P. anserina mito 0.1 0.14 < 0.001 2.8 0.019

H. influenzae 1.8 2.54 < 0.001 51 0.069

S. cerevisiae 12 16.7 0.02 333 0.33

C. elegans 100 139 0.15 2,780 1.8

Human 3,000 >4170 7.1 83,300 36.6

For the five human genome simulations, tRNAscan 1.4 produced an average of 1118 false positives per genome (had tRNAscan 1.3 been used, it would have taken almost half a CPU year per trial). EufindtRNA searched the simulated genomes in just over seven hours per trial, giving an average of 684 falsely predicted tRNAs for each. Had we searched the entire 3 billion nucleotide human genome simulation with tRNA covariance model analysis, it would have taken over nine CPU-years for each trial (Table 2.5). Based on the histogram of covariance model scores against 500 million bases of simulated human sequence data (not shown), we estimate that the tRNA covariance model search of the simulated human genome would have produced zero false positives. tRNAscan-SE required an average of a day and a half to scan each of the three billion nucleotide test sets, and produced no false positives in any of the five trials (the exact same sequences were used as in the trials described above for tRNAscan 1.4 and EufindtRNA).

A concern not addressed by the random sequence genome simulations is the ``false positive'' rate caused by certain classes of SINEs that are suspected to be derived from tRNA genes [Daniels & Deininger, 1985,Deininger, 1989]. These elements have similarity to known tRNA genes and contain well conserved RNA polymerase III internal A and B box promoters. To assess tRNAscan-SE's ability to identify and exclude these types of pseudo-tRNAs, the repeat element database Repbase maintained by Jerzy Jurka (ftp://ncbi.nlm.nih.gov/repository/repbase) was scanned. Of the reference sequences searched, tRNAscan-SE did not produce any false positive tRNA identifications. Covariance model analysis, however, did misidentify 12 of 775 rodent B2 SINE sequences and two ALU-like sequences (bovine ALU-like repetitive element & rat ALU type III-like repetitive element), all with scores between 20 and 28 bits. Rat identifier (ID or R.dre.1) sequences, also known to have high similarity to alanine, proline, and other tRNAs, were searched within Genbank and dbEST (database of expressed sequence tags, [Boguski et al., 1993]). tRNAscan-SE misidentified four rat ID element sequences total, one from Genbank (RATRSIDH) and three from dbEST (R46943, R46943, R82886). The extreme sensitivity of covariance model analysis is also unable to distinguish between these SINEs and true tRNAs, giving bit scores between 24.5 and 33.1 bits. tRNAscan 1.3 requires strong adherence to secondary structure rules, thus does not call any of these pseudogenes as tRNAs. The rest of Repbase, including consensus and database collections of ALU, L1, THE, MIR, MIR2, THR, and B1 repetitive elements, were also searched with tRNAscan-SE, giving no other false positives.

The selectivity of tRNAscan has already affected genome sequence annotation detrimentally. In 58.4 Mbp of C. elegans genomic sequence, tRNAscan 1.3 produced 29 tRNAs which were judged to be false positives (0.50 fp /Mbp) based on searching with the tRNA covariance model, visual inspection of secondary structure, and lack of primary sequence similarity to any other tRNAs within the genome. Since both the Washington University Genome Sequencing Center (St. Louis) and the Sanger Center (Cambridge, UK) used tRNAscan 1.3 in semi-automated sequence annotation until very recently, 16 of these 29 false positives are annotated as tRNAs in finished, submitted Genbank entries. This false positive rate is very close to that seen in the random C. elegans genome simulation (0.42 fp/Mbp), giving additional confidence to the estimates based on simulated sequence data.

tRNAscan-SE produced no obvious false positives in the C. elegans genomic sequence, but did identify 8 tRNAs that were judged to be possible pseudogenes by manual inspection (Table 2.3). Eleven other tRNAs were automatically identified as pseudogenes via primary or secondary structure scores that fell below minimum values described in the methods. All 19 pseudogenes had strong similarity to other tRNAs within the genome, and contained unusual features such as 3-16 bp truncations of the 5' end of the gene, or other large insertions or deletions within the sequence. One could consider detection of these possible pseudogenes a desirable feature of tRNAscan-SE's sensitivity. Further studies of these unusual tRNAs may help better elucidate aspects of genome dynamics, genetic element mobility, and evolution.