One of the goals of genome sequencing is to identify all the genes in an organism. Computational methods for protein coding gene identification are reasonably well developed, especially for compact genomes with few or no introns. Protein coding genes have open reading frames, codon bias, and other telltale statistical signals that can be recognized. On the basis of such algorithms and other genetic characterization, the yeast genome is said to contain 6000 genes and to have a coding density of about 75% [Goffeau et al., 1996].
These genefinding algorithms do not attempt to search for noncoding functional RNA genes. Examples of noncoding functional RNAs have been known for decades, but their diversity and numbers seem small. New discoveries of enigmatic noncoding RNA genes, such as the mammalian tumor suppressor H19 [Brannan et al., 1990] or the mammalian X-dosage compensation gene Xist [Brockdorff et al., 1992,Brown et al., 1992], are interesting but perhaps exceptional. However, it seems possible that, in fact, a large number of noncoding RNAs remain to be discovered; not only computational screens but experimental screens tend to be biased against RNAs. Many functional RNAs are not polyadenylated, so are not well represented in oligo-dT primed cDNA libraries or in EST sequencing projects. Many RNAs are small genes that occur in redundant copies, and RNAs are of course not affected by stop codons or frameshifts, so they are probably somewhat refractory to genetic screens. To date, most functional RNAs have probably been identified by biochemical means.
Here, we have extended the known gene family of methylation guide C/D box snoRNAs to 41 loci in yeast. Pseudouridylation guide snoRNAs are probably encoded by another large dispersed gene family [Ni et al., 1997,Gannot et al., 1997]. Yeast genome sequence analysts probably would not have guessed that careful computational analyses had missed the presence of two large gene families and almost 100 new genes. By themselves, the snoRNAs do not substantially alter the estimate of 6000 genes in yeast, nor the 75% coding fraction. However, given that one or two large gene families of functional RNAs escaped detection, how many others are there? How much ``extragenic'' DNA is actually encoding functional RNAs? How many of the systematic gene knockouts being generated in yeast will also knock out an unsuspected RNA gene (especially intronic ones), and thus superpose two genetic phenotypes on the resulting disruption? Using probabilistic models, we are beginning to gather the tools necessary to computationally screen genome sequences and answer some of these questions.