Next: Computational Detection of RNA Up: Introduction Previous: Introduction

Genome Annotation and Gene Prediction

Not surprisingly, one of the most urgent tasks of ``computational geneticists'' today is to identify and infer function of new genes which have not been studied experimentally. Computational functional inference usually involves recognizing sequence similarity between an anonymous query and a characterized matching sequence. For protein-encoding genes, a handful of generalized computational tools such as BLAST [Altschul et al., 1990,Gish, 1998], FASTA [Pearson & Lipman, 1988], and HMMER [Eddy, 1996] are quite adept at recognizing distant evolutionary relationships based on primary sequence conservation. Comparisons of this type yield new information only if a previously studied homolog is present in the database. Dedicated gene-finding programs such as Glimmer [Salzberg et al., 1998], GeneMark [Hayes & Borodovsky, 1998], GEN-SCAN [Burge & Karlin, 1997] and others attempt to identify genes based on sequence features shared by all protein coding genes such as start and stop codons, and the periodicity and non-uniform frequency of codons. These gene predictions give potential gene boundaries but reveal nothing of function.

Once sequence annotators have performed their analyses on a new stretch of DNA, the inferred information is generally deposited in public or specialized databases for use by experimental biologists. The bulk of sequence in the public databases (Genbank [Benson et al., 1999], EMBL [Rice et al., 1993], DDBJ [Tateno & Gojobori, 1997]) is from the major genome centers which annotate millions of nucleotides of sequence each month. Because of the volume of sequence processed, it is necessary to use computational tools which require limited human supervision. Although some have argued that all annotation should be conducted ``on-the-fly'' [Wheelan & Boguski, 1998], final inspection by annotation specialists is critical for resolution of conflicting information from different sources, including similarity to expressed sequence tags (ESTs) and homologous genes, or gene boundary predictions from various gene finders. The goal, of course, is to present as many accurate predictions of true DNA/gene function as possible (sensitivity), while limiting the number of false predictions (selectivity).

My first project in the lab involved improving the selectivity of transfer RNA (tRNA) gene detection for large scale, automated genome analysis at the Genome Sequencing Center here at Washington University. The best existing program, tRNAscan 1.3 [Fichant & Burks, 1991], was expected to produce about one false positive for each correctly identified tRNA in the human genome. The new program I developed, tRNAscan-SE, significantly reduces false positives while increasing search sensitivity by combining the strengths of multiple tRNA search methods. This work was published [Lowe & Eddy, 1997] and is detailed in Chapter 2.

Next: Computational Detection of RNA Up: Introduction Previous: Introduction

Todd M. Lowe
2000-03-31