Triangulating on Regulatory Regions



Jim Kent, Charles Sugnet, Mark Deikhans, Krishna Roskin, Ryan Weber, and David Haussler

University of California Santa Cruz

Scott Schwartz, Laura Elnitski, Ross Hardison, and Webb Miller

Pennsylvania State University



Unraveling the intricacies of gene regulation is a fascinating  but often very difficult endeavor. Here we describe an approach which combines cDNA, genomic DNA, and microarray data to locate and characterize regulatory regions in the human genome. We identify a large set of transcription start sites by aligning G-cap selected ESTs and other cDNA data to the human genome using BLAT. We find regions conserved between the human and mouse genomes with BLASTZ and explore the pattern of conservation at various distances from transcription start. We cluster Affymetrix microarray data to find coregulated clusters of  genes. We describe the Improbizer program, which is available at (, and use Improbizer to find motifs in the regions upstream of coregulated genes, and examine the pattern of conservations at these motifs.







We’d like to thank Al Zahler, Terry Furey, and the Mouse Homology group for their feedback.  Many thanks to Affymetrix, GNF, the Human Genome Sequencing Consortium, the Mouse Genome Sequencing Consortium, the Mammalian Gene Collection, Riken, the RefSeq project, and Genbank contributors worldwide for providing the underlying data. Thanks to Michael Eisen for the expression clustering software.  Thanks to the National Human Genome Research Institute, the Howard Hughes Medical Institute, the Packard Foundation and the University of California for financial support.
Finding Transcription Start Sites


Figure 1. An RNA cluster visualized in the UCSC Genome Browser.



In general it is difficult experimentally to clone full-length mRNAs.  Many mRNAs in Genbank are truncated at the 5’ end. On chromosome 22 Sanger has made extensive efforts to find 5’ ends.  Though the Sanger 22 annotations are not available genome-wide, fortunately numerous large scale EST and mRNA sequencing projects particularly in the last year have been pushing towards the 5’ end.  G-cap selected libraries, which are used in the Riken mRNA and some of the Mammalian Gene Collection datasets are particularly useful.


To generate a large collection of transcription start sites we used BLAT to align all of the human mRNA in Genbank and dbEST against the genome.  To avoid genomic contamination in the mRNA databases we only considered mRNAs and ESTs that showed signs of splicing when aligned against the genome.  We clustered together mRNA and ESTs that shared exonic sequence.  This formed a total of 31888 clusters. 


Clustering Affy/GNF Expression Data


Figure 2. Expression data for three RNA clusters.  The bottom track shows alignments of the ESTs used to form the Affymetrix probes colored according to expression levels where red shows higher expression than the reference sample and green lower expression.  Only 15 of 85 sample types are shown here.  The rest are visible online at



5041 of the RNA cluster have expression data from the Affymetrix/GNF set of experiments with normal tissues.  Expression vectors were constructed based on the ratio of expression in the tissues vs. the expression in a pooled reference sample. In some cases multiple Affymetrics data points are associated with the same RNA cluster. We average the data to form a single expression vector for each cluster in these cases.  We used the K-means clustering method in Michael Eisen’s cluster program with K set to 200.  The result was 192 groups of expression vectors.  We extracted 1000 bases of DNA upstream of each RNA cluster in an expression group, and fed these group by group to the Improbizer motif finding program.

Finding Motifs with Improbizer


Improbizer searches for motifs in DNA or RNA sequences that occur with improbable frequency (to be just chance) using a variation of the expectation maximization (EM) algorithm.  Consider the input sequence:




Each K-mer of the foreground data set is used to create a probability profile.  The profile created by the fourth 8-mer in the top sequence (GCGATGGG) would be:

A .10 .10 .10 .70 .10 .10 .10 .10

C .10 .70 .10 .10 .10 .10 .10 .10

G .70 .10 .70 .10 .10 .70 .70 .70

T .10 .10 .10 .10 .70 .10 .10 .10

The best place in each sequence where the profile aligns is computed by comparing the probability of the bases at each position matching the profile vs. the probability of the bases matching a 2nd order Markov model constructed on the background sequence.

ACAGCGATGGGACAT .7 * .7 * .7 * .7 * .7 * .7 * .7 * .7 / background

TGCGTTGGCTCAGGT .7 * .7 * .7 * .1 * .7 * .7 * .7 * .1 / background

ATTAGGGAAGCGTTT .7 * .1 * .7 * .7 * .1 * .7 * .1 * .7 / background

The best hits are lined up against each other




A second generation of profile is created from the columns

A .01 .01 .01 .66 .33 .01 .01 .01

C .01 .66 .01 .01 .01 .01 .33 .33

G .97 .33 .97 .01 .01 .97 .66 .66

T .01 .01 .01 .33 .66 .01 .01 .01

The process is repeated until the profile converges.  The process gets a bit more complex than this to allow some sequences to not have the motif, for multiple copies of the motif to happen, for the motif to happen on either strand, and so forth.  Improbizer dynamically expands or shrinks the motif during convergence based on the properties of bases next to the best motif alignments.  Separating the foreground sequence from the background sequence is crucial as it allows the use of higher order Markov background models.  Without these the motif finders tend to converge on poly-A a lot!

Motif Results


The Improbizer runs were run using the settings: initial motif size of 6, expecting up to two occurences of a motif per sequence, searching both the forward and reverse strand, ignoring location, and using 1000 base regions for all RNA clusters with expression data to form a 2nd order Markov background model.


Improbizer will find motifs in any sequence set that it is given.  Not all of the motifs are significant.  Improbizer associates a score with each motif.  However many factors including the number and size of the input sequences have an effect on the score.  Therefore to select significant motifs we generated random sequence according to the background model of the same number and size as the input sequence and did control runs of Improbizer against these random sequences.  Only when a motif scored higher than any of 100 control runs did we consider it significant.


In our first run we found 320 significant motifs in our 192 expression groups.  Unfortunately it turned out that most of these motifs were to repeating elements.  We ran RepeatMasker on the upstream regions and reran the experiment.  This time we found 19 significant motifs.


Figure 3 – Motif track in browser.  Gold area shows upstream 1000 bases.  Gray area shows motif hit

Figure 4 – Motif details page in browser.
 Regions Conserved in Human and Mouse


Figure 5. Conservation levels near three RNA clusters.  The bottom track shows conservation levels in overlapping 100 base windows using  an evolutionary model that accounts for the effect of neighboring bases on the mutation rate.


A nucleotide level alignment of the human genome vs. the mouse genome was done with Penn State’s blastz program on the Parasol cluster at UCSC.  We compared the level of conservation of the motifs found with this method with the level of conservation of potential regulatory regions found by a number of other methods.  Initial results are promising.





















% Ali









% ID









% Both











% Ali









% ID









% Both









Table 1. Conservation and regulatory regions.  The rows show percentage of aligning bases and the percentage base identity within the aligning regions of mouse/human blastz alignments. The % Both row is the product of the % Ali and % ID.  The top panel reflects the conservation in areas that are not covered by RepeatMasker, while the bottom panel is unmasked. The Known reg column is from a set of 100 regulatory regions from the biological literature. The upstream regions are relative to the 5’ ends of RNA clusters.  Expr Tri shows regions found by the process describe here. Tri Motif shows the Expr Tri motifs applied to all Upstr 1000 elements, not  just those in the expresssion group where the motif was found.  Tranfac motif shows known human motifs applied to all Upstr 1000 elements.

Future Directions

We’d like to refine virtually every step of this process.  In the RNA clustering phase we’d like to filter out RAGE ESTs, which are created from randomly inserting powerful promoters in the genome.  These can cause transcription to begin upstream of where it would begin in nature, thus invalidating the 5’ ends of our clusters.  It is becoming increasingly apparent that the use of alternative promoters is very common in nature.  Our method currently will always select the most 5’ promoter.  It may be more appropriate to use instead the most highly expressed promoter.  While masking is necessary to prevent the program from forming motifs out of interspersed repeats, RepeatMasker also often masks GC rich sequence as degenerate, which may not be appropriate for this application. We’d also like to experiment with other expression clustering parameters and techniques.  The Improbizer runs were constructed to search for tissue-specific motifs that can occur in either strand.  It would be good to also search for general motifs, for motifs that only occur on one strand, motifs where the location relative to transcription start is relevant. Once we have optimized the procedure we’ll be looking eagerly for a wet lab collaborator to help us test the motifs in various ways.