The genome of the yeast Saccharomyces cerevisiae has been completely sequenced, and is thought to contain about 6000 protein coding genes [Goffeau et al., 1996]. However, this is not the total number of genes in yeast. Some of the largest eukaryotic gene families produce functional RNAs rather than protein products. Yeast contains approximately 140 tandemly repeated copies of ribosomal RNA genes [Goffeau et al., 1996] and 274 dispersed transfer RNA genes [Lowe & Eddy, 1997]. The number of different identified functional RNAs is growing. In particular, a series of recent papers on small nucleolar RNAs (snoRNAs) has suggested the presence of large snoRNA gene families in eukaryotic genomes [Smith & Steitz, 1997,Tollervey & Kiss, 1997,Bachellerie & Cavaille, 1997].
snoRNAs appear to be involved at various stages of eukaryotic ribosome biogenesis, a complex process taking place in the nucleolus [Hadjiolov, 1985]. Ribosomal RNA (rRNA) undergoes cleavages and modifications before assembly with ribosomal proteins into the mature ribosome [Woolford, 1991]. Co-localized ribonucleoprotein particle complexes (RNPs) have been found to be essential for rRNA modifications [Tollervey et al., 1991,Mattaj et al., 1993]. The three most common rRNA modifications are ribose methylation, pseudouridylation, and base methylation [Maden, 1990]. The RNA component of these nucleolar RNPs, the small nucleolar RNAs, make up a diverse family of molecules that appear to fall into two major classes based on conserved sequence features: box H/ACA snoRNAs and box C/D snoRNAs [Balakin et al., 1996]. Some H/ACA snoRNAs are required for specific pseudouridylations [Gannot et al., 1997,Ni et al., 1997]. C/D box snoRNAs appear to have multiple roles in the nucleolus, one of which, rRNA ribose methylation, is the focus of this study.
Most C/D box snoRNAs contain one or more long 10-21 bp stretches of exact complementarity to ribosomal RNA [Bachellerie et al., 1995,Maxwell & Fournier, 1995]. Many of these complementary regions within rRNA contain 2'-O-methyl modifications, which initially suggested that these snoRNAs might be involved in specifying the location of these modifications. Genetic disruption of U24 snoRNA in S. cerevisiae causes loss of the predicted target methyl groups [Kiss-Laszlo et al., 1996]. The same study showed that alteration of the rRNA complementary region was sufficient to cause addition of a predictable ectopic methyl at a new position on rRNA. snoRNA depletion experiments in Xenopus oocytes have showed that methylation guide snoRNAs are necessary for specific methylation in vertebrates as well [Tycowski et al., 1996,Dunbar & Baserga, 1998].
The function of these ribose methylations remains unknown. The modifications are well conserved throughout eukaryotes, with more than 75% of 2'-O-methyl modified nucleotides in yeast aligning with homologous modified nucleotides in human ribosomal RNA [Maden, 1990]. The modifications are located non-randomly in the most phylogenetically conserved regions of rRNA [Raue et al., 1988]. Although their phylogenetic conservation suggests selective pressure, removal of two ribose methyls via genetic deletion of U24 snoRNA had no obvious effect on normal cell growth in yeast [Kiss-Laszlo et al., 1996].
The total number of rRNA ribose methyls in Saccharomyces carlsbergensis, a close relative of S. cerevisiae, has been estimated at 55 [Klootwijk & Planta, 1973]. Forty-two of these methyls have been placed to specific nucleotide positions in the rRNA [Veldman et al., 1981,Raue et al., 1988,Maden, 1990]. In S. cerevisiae, 11 previously isolated C/D box snoRNAs have been predicted to be responsible for methylations at 12 sites [Kiss-Laszlo et al., 1996,Smith & Steitz, 1997], fewer than one fourth of the total ribose methylations. Experimental evidence supporting these predictions is available only for U24 [Kiss-Laszlo et al., 1996]. If the hypothesis is correct that snoRNAs guide most or all ribose methylation in eukaryotes, most members of this gene family remain unidentified in S. cerevisiae.
Because the S. cerevisiaegenome is completely sequenced [Goffeau et al., 1996], it is reasonable to consider identifying methylation guide snoRNAs computationally. However, sequence similarity of snoRNAs across phyla and within the gene family is generally weak, thus methods such as BLAST [Altschul et al., 1990] and FASTA [Pearson & Lipman, 1988] fail to identify new genes by similarity to known snoRNAs. Attempts have been made to identify snoRNAs by pattern searches based on the rRNA complementary guide sequence and other conserved features, but feature consensus is poor. If searches are limited to snoRNAs that occur within introns and that target known methylation sites (so the complementary region in rRNA is known), this strategy has been somewhat effective [Nicoloso et al., 1994,Nicoloso et al., 1996] since the false positive rate is minimized. However, in S. cerevisiae, most snoRNAs do not occur in introns, and a quarter of the rRNA methylations have not been precisely mapped.
Formal probabilistic models, based in part on methods used in speech recognition and computational linguistics, have been introduced for searching for complicated consensus features in biological sequence (reviewed in [Durbin et al., 1998]). Hidden Markov models (reviewed in [Eddy, 1996]) are probably the best known of these approaches. Another class of model called stochastic context-free grammars (SCFGs) has been used to construct probabilistic profiles of RNA genes that allow sensitive searching for RNA secondary structure [Eddy & Durbin, 1994,Sakakibara et al., 1994a]. Using these probabilistic modeling techniques, we can produce an integrated model of snoRNAs that takes into account the rRNA complementary region, the consensus C, D and D' boxes, terminal stem base pairings, as well as the relative position of these features within the snoRNAs. Once defined, the snoRNA gene model can be trained on previously identified, ``trusted'' members of the gene family, and updated as new snoRNAs are found and verified.
The combination of probabilistic modeling approaches and the availability of the complete genome for S. cerevisiae has made it feasible for us to execute a ``computational genetic screen'' for the missing members of the methylation guide snoRNA family. In this study, we have identified 22 new guide snoRNAs, and experimentally verified guide function for all but one which appears to be genetically redundant. Combined with verification of new methyl target sites for other known snoRNAs, we can now assign a guide snoRNA to all but 4 of the 55 ribose methyl sites in S. cerevisiae rRNA.