The snoRNA search algorithm is diagrammed in Figure 4.1. The program sequentially searches for snoRNA features in the query sequence in the following order. A box D sequence matching the pattern ``(A/C)UGA'' is identified. The highest scoring box C sequence (7 bp pattern scored by log odds weight matrix) is located 35-200 bp upstream from box D. The intervening sequence is checked for an rRNA complementary sequence of 9 bp or greater, allowing a maximum of three mismatches and any number of G-U pairings. The highest scoring box D' sequence (four bp pattern scored by log odds weight matrix) is identified just 3' to the rRNA complementarity if the rRNA match is not immediately adjacent to the D box. Finally, the rRNA methylation site guided by the candidate snoRNA is predicted by counting five bp upstream of box D or D'.
Each candidate snoRNA alignment was then scored against our probabilistic model (Table 4.1). SnoRNAs were ranked based on a final log odds score [Barrett et al., 1997] that incorporated information from each of the snoRNA features. The initial model was trained on 35 human C/D box snoRNAs proposed to function as methylation guides [Kiss-Laszlo et al., 1996]. Nine previously isolated yeast snoRNAs were shown to match to this snoRNA gene model with significant scores (25.91 - 43.55 bits). In a search of randomly generated sequence4.4 equivalent in size to four complete yeast genomes, the maximum score for a false positive (29.65 bits) exceeded the score for only one of the nine known snoRNAs. Thus we believed we had sufficient training data to search for unidentified snoRNAs in the yeast genome.