We implemented a greedy search algorithm to identify 2'-O-methylation guide snoRNAs in genomic sequence. The program sequentially identifies six components characteristic of these genes (see Figure 4.1): box D, box C, a region of sequence complementary to ribosomal RNA, box D' if the rRNA complementary region is not directly adjacent to box D, the predicted methylation site within the rRNA based on the complementary region, and the terminal stem base pairings, if present. The program also notes the relative distance between identified features within the snoRNA, information we found critical to reducing the false positive identification rate.
Each candidate snoRNA alignment is scored against a probabilistic model (Figure 4.2) trained on experimentally verified yeast or human snoRNAs. snoRNAs are ranked based on the final log odds score [Barrett et al., 1997] incorporating information from each of the snoRNA features. Although a dynamic programming algorithm incorporating the probabilistic model at the initial search phase could have been used, we opted for a greedy search followed by probabilistic scoring in the interest of speed. A final report is generated for each snoRNA, including component features and scores plus the target rRNA methylation site. Initial profiles of snoRNA features were provided by Kiss-Laszlo et al. (1996) as a consensus structure for methylation guide snoRNAs that was based on 21 novel and 14 previously identified human snoRNAs. Nine previously isolated yeast snoRNAs were shown to conform to this snoRNA gene model, thus we believed we had sufficient training data to search for unidentified snoRNAs in the yeast genome. As new snoRNAs were identified and verified, they were added to the model training set.