This track contains genome-wide operon predictions. These predictions are
extensions of TIGR's operon predictions, with the addition of TATA-box and
intergenic distance information. TIGR's operon predictions are based on
gene order conservation across multiple species. The predictions are
roughly based on the idea that if the order of a given set of genes on the
same strand is conserved across numerous species, then these genes are
more likely to belong to the same operon. TIGR's predictions are intended
to have a low number of false positives, while giving less regard to false
negatives. As such, their predictions only put genes together in an
operon if there is overwhelming conservational evidence that they belong
together. The purpose of this track is to take a more genome-wide
approach and give the most likely operon structure, not merely limiting
the operon predictions to ones that we can be absolutely certain of.
The algorithm considers one group of contiguous genes on the same strand
at a time. If the gene furthest upstream has a TATA box (as determined by
a position weight matrix) within a certain distance (between 15 and 50
bases upstream of the transcription start site), this gene is considered
to be the leading gene of the operon. The operon is then extended one
gene at a time, moving downstream from the leading gene. If the next gene
has a TATA box and is less than 100 bases from the preceding gene, the
current operon is ended and a new one begins with that gene. A new operon
is also begun if the next gene is too far away (greater than 500 bases)
and does not have a TATA box. The generous length of 500 bases was chosen
to help allow for the possibility of an uncharacterized gene lying in such
a region. If two adjacent genes are predicted by TIGR to be in the same
operon, then the above algorithm is overridden (genes predicted by TIGR to
be in the same operon are always predicted by this method to be in the
same algorithm.) In this way, the algorithm takes TIGR's predictions as a
seed and attempts to extend and improve upon them.
Questions, comments, and a more detailed explanation of the algorithm can
be directed to:
Matt Weirauch
weirauch@soe.ucsc.edu