Description

This track contains genome-wide operon predictions. These predictions are extensions of TIGR's operon predictions, with the addition of TATA-box and intergenic distance information. TIGR's operon predictions are based on gene order conservation across multiple species. The predictions are roughly based on the idea that if the order of a given set of genes on the same strand is conserved across numerous species, then these genes are more likely to belong to the same operon. TIGR's predictions are intended to have a low number of false positives, while giving less regard to false negatives. As such, their predictions only put genes together in an operon if there is overwhelming conservational evidence that they belong together. The purpose of this track is to take a more genome-wide approach and give the most likely operon structure, not merely limiting the operon predictions to ones that we can be absolutely certain of.

The algorithm considers one group of contiguous genes on the same strand at a time. If the gene furthest upstream has a TATA box (as determined by a position weight matrix) within a certain distance (between 15 and 50 bases upstream of the transcription start site), this gene is considered to be the leading gene of the operon. The operon is then extended one gene at a time, moving downstream from the leading gene. If the next gene has a TATA box and is less than 100 bases from the preceding gene, the current operon is ended and a new one begins with that gene. A new operon is also begun if the next gene is too far away (greater than 500 bases) and does not have a TATA box. The generous length of 500 bases was chosen to help allow for the possibility of an uncharacterized gene lying in such a region. If two adjacent genes are predicted by TIGR to be in the same operon, then the above algorithm is overridden (genes predicted by TIGR to be in the same operon are always predicted by this method to be in the same algorithm.) In this way, the algorithm takes TIGR's predictions as a seed and attempts to extend and improve upon them.

Questions, comments, and a more detailed explanation of the algorithm can be directed to:

Matt Weirauch
weirauch@soe.ucsc.edu