Description

This track contains the location and score of transcription factor binding sites conserved in the human/mouse/rat alignment. A binding site is considered to be conserved across the alignment if its score meets the threshold score for that binding site in all 3 species. The score and threshold are computed with the Transfac Matrix Database created by Biobase. The data are purely computational, and as such not all binding sites listed here are biologically functional binding sites.

In the graphical display, each box represents one conserved tfbs. The darker the box, the better the match of the binding site. Clicking on a box brings up detailed information on the binding site, namely its Transfac I.D., its location in the human genome (chromosome, start, end, and strand), its length in bases, and its score.

Methods

A binding site is considered to be conserved across the alignment if its score meets the threshold score for that binding site in all 3 species. The following is a brief discussion of the scoring and threshold system used for these data.

The Transfac Matrix Database contains position-weight matrices for 336 transcription factor binding sites, as characterized through experimental results in the scientific literature. A typical (in this case fictitious) matrix will look something like:

        A      C      G      T
01     15     15     15     15      N
02     20     10     15     15      N
03      0      0     60      0      G
04     60      0      0      0      A
05      0      0      0     60      T
The above matrix specifies the results of 60 (the sum of each row) experiments. In the experiments, the first position of the binding site was A 15 times, C 15 times, G 15 times, and T 15 times (and so on for each position.) The consensus sequence of the above binding site as characterized by the matrix is NNGAT. The format of the consensus sequence is the deduced consensus in the IUPAC 15-letter code.

The score of a segment of DNA is computed in relation to a matrix as follows:

score = SUM over each position in the matrix of
matrix[position][nucleotide_in_segment_at_this_position].
For example, the sequence "CCGAT" would have a score of: 15 + 10 + 60 + 60 + 60 = 205 for the above matrix. A score in relation to a matrix of length n can be computed for every DNA segment of length n.

The threshold for a binding site is computed from its Transfac Matrix Database entry as follows:

          St = Smin + ((Smax - Smin) * C)
                                                                               
where     St is the target threshold score
          Smin is the minimum possible score
          Smax is the maximum possible score
          C is the cutoff value used by the scoring function
For example, the above matrix has a minimum score of 15 + 10 + 0 + 0 + 0 = 25 and a maximum score of 15 + 20+ 60 + 60 + 60 = 215. Using a cutoff value of 0.85 (the value used for this track), the threshold for the above matrix is:
25 + ((215 - 25) * 0.85) = 186.5
As such the sequence "CCGAT" from above would be recorded as a hit with a cutoff value of 0.85, since its score (215) exceeds the threshold for this particular binding site (186.5.)

The final score reported is the minimum cutoff value that the position would have been recorded as a hit (multiplied by 1000.) The final score of the above example is therefore:

((Score - Smin) / (Smax - Smin)) * 1000 = (205 - 25) / (215 - 25)) = 0.947 * 1000 = 947.
Therefore, the final score for the sequence "CCGAT" would be 947. Although the scores of all three species must exceed the threshold, the only final score that is reported for this track is the final score of the human sequence.

These data were obtained by running the program tfloc (Transcription Factor binding site LOCater) on multiz alignments of the Feb. 2003 mouse draft assembly (mm3) and the Jan. 2003 rat assembly (rn2) to the Apr. 2003 human genome assembly (hg15.) Tfloc was run on the subset of the Transfac Matrix Database containing human-related binding sites (164 total.)

Credits

These data were generated using the Transfac Matrix Database created by Biobase.

The tfloc program was developed at The Pennsylvania State University by Matt Weirauch.

This track was created by Matt Weirauch and Brian Raney at The University of California at Santa Cruz.