Triangulating on Regulatory Regions

Jim
Kent, Charles Sugnet, Mark Deikhans, Krishna Roskin, Ryan Weber, and David
Haussler
Scott
Schwartz, Laura Elnitski, Ross Hardison, and Webb Miller
Unraveling
the intricacies of gene regulation is a fascinating but often very difficult endeavor. Here we describe an approach
which combines cDNA, genomic DNA, and microarray data to locate and
characterize regulatory regions in the human genome. We identify a large set of
transcription start sites by aligning G-cap selected ESTs and other cDNA data
to the human genome using BLAT. We find regions conserved between the human and
mouse genomes with BLASTZ and explore the pattern of conservation at various
distances from transcription start. We cluster Affymetrix microarray data to
find coregulated clusters of genes. We
describe the Improbizer program, which is available at (http://www.soe.ucsc.edu/~kent/improbizer/),
and use Improbizer to find motifs in the regions upstream of coregulated genes,
and examine the pattern of conservations at these motifs.
Acknowledgements
We’d
like to thank Al Zahler, Terry Furey, and the Mouse Homology group for their
feedback. Many thanks to Affymetrix,
GNF, the Human Genome Sequencing Consortium, the Mouse Genome Sequencing
Consortium, the Mammalian Gene Collection, Riken, the RefSeq project, and
Genbank contributors worldwide for providing the underlying data. Thanks to
Michael Eisen for the expression clustering software. Thanks to the National Human Genome Research Institute, the
Howard Hughes Medical Institute, the Packard Foundation and the University of California
for financial support.
Finding Transcription Start Sites

Figure 1. An RNA cluster visualized in the UCSC Genome Browser.
In general it is difficult experimentally to clone full-length mRNAs. Many mRNAs in Genbank are truncated at the 5’ end. On chromosome 22 Sanger has made extensive efforts to find 5’ ends. Though the Sanger 22 annotations are not available genome-wide, fortunately numerous large scale EST and mRNA sequencing projects particularly in the last year have been pushing towards the 5’ end. G-cap selected libraries, which are used in the Riken mRNA and some of the Mammalian Gene Collection datasets are particularly useful.
To
generate a large collection of transcription start sites we used BLAT to align
all of the human mRNA in Genbank and dbEST against the genome. To avoid genomic contamination in the mRNA
databases we only considered mRNAs and ESTs that showed signs of splicing when
aligned against the genome. We
clustered together mRNA and ESTs that shared exonic sequence. This formed a total of 31888 clusters.
Clustering
Affy/GNF Expression Data

Figure 2. Expression data for three RNA clusters. The bottom track shows alignments of the ESTs used to form the Affymetrix probes colored according to expression levels where red shows higher expression than the reference sample and green lower expression. Only 15 of 85 sample types are shown here. The rest are visible online at genome.ucsc.edu.
5041
of the RNA cluster have expression data from the Affymetrix/GNF set of experiments
with normal tissues. Expression vectors
were constructed based on the ratio of expression in the tissues vs. the
expression in a pooled reference sample. In some cases multiple Affymetrics
data points are associated with the same RNA cluster. We average the data to
form a single expression vector for each cluster in these cases. We used the K-means clustering method in
Michael Eisen’s cluster program with K set to 200. The result was 192 groups of expression vectors. We extracted 1000 bases of DNA upstream of
each RNA cluster in an expression group, and fed these group by group to the
Improbizer motif finding program.
Finding
Motifs with Improbizer
Improbizer searches for motifs in DNA or RNA sequences that occur with improbable frequency (to be just chance) using a variation of the expectation maximization (EM) algorithm. Consider the input sequence:
ACAGCGATGGGACAT
TGCGTTGGCTCAGGT
ATTAGGGAAGCGTTT
Each K-mer of the foreground data set is used to create a probability profile. The profile created by the fourth 8-mer in the top sequence (GCGATGGG) would be:
A .10 .10 .10 .70 .10 .10 .10 .10
C .10 .70 .10 .10 .10 .10 .10 .10
G .70 .10 .70 .10 .10 .70 .70 .70
T .10 .10 .10 .10 .70 .10 .10 .10
The best place in each sequence where the profile aligns is computed by comparing the probability of the bases at each position matching the profile vs. the probability of the bases matching a 2nd order Markov model constructed on the background sequence.
ACAGCGATGGGACAT .7 * .7 * .7 * .7 * .7 * .7 * .7 * .7 / background
TGCGTTGGCTCAGGT .7 * .7 * .7 * .1 * .7 * .7 * .7 * .1 / background
ATTAGGGAAGCGTTT .7 * .1 * .7 * .7 * .1 * .7 * .1 * .7 / background
The best hits are lined up against each other
GCGATGGG
GCGTTGGC
GGGAAGCG
A second generation of profile is created from the
columns
A .01 .01 .01 .66 .33 .01 .01 .01
C .01 .66 .01 .01 .01 .01 .33 .33
G .97 .33 .97 .01 .01 .97 .66 .66
T .01 .01 .01 .33 .66 .01 .01 .01
The process is repeated until the profile converges. The process gets a bit more complex than this to allow some sequences to not have the motif, for multiple copies of the motif to happen, for the motif to happen on either strand, and so forth. Improbizer dynamically expands or shrinks the motif during convergence based on the properties of bases next to the best motif alignments. Separating the foreground sequence from the background sequence is crucial as it allows the use of higher order Markov background models. Without these the motif finders tend to converge on poly-A a lot!
Motif
Results
The Improbizer runs were run using the settings: initial motif size of 6, expecting up to two occurences of a motif per sequence, searching both the forward and reverse strand, ignoring location, and using 1000 base regions for all RNA clusters with expression data to form a 2nd order Markov background model.
Improbizer will find motifs in any sequence set that it is given. Not all of the motifs are significant. Improbizer associates a score with each motif. However many factors including the number and size of the input sequences have an effect on the score. Therefore to select significant motifs we generated random sequence according to the background model of the same number and size as the input sequence and did control runs of Improbizer against these random sequences. Only when a motif scored higher than any of 100 control runs did we consider it significant.
In our first run we found 320 significant motifs in our 192 expression groups. Unfortunately it turned out that most of these motifs were to repeating elements. We ran RepeatMasker on the upstream regions and reran the experiment. This time we found 19 significant motifs.

Figure 3 – Motif track in browser. Gold area shows upstream 1000 bases. Gray area shows motif hit
Figure
4 – Motif details page in browser.
Regions Conserved in Human and Mouse

Figure 5. Conservation levels near three RNA clusters. The bottom track shows conservation levels in overlapping 100 base windows using an evolutionary model that accounts for the effect of neighboring bases on the mutation rate.
A
nucleotide level alignment of the human genome vs. the mouse genome was done
with Penn State’s blastz program on the Parasol cluster at UCSC. We compared the level of conservation of the
motifs found with this method with the level of conservation of potential
regulatory regions found by a number of other methods. Initial results are promising.
|
|
Known reg |
Upstr 500 |
Upstr 1000 |
Expre Tri |
Tri Motif |
Tranfa Motif |
Cpg Island |
Whole Genome |
Masked |
|
|||||||
|
%
Ali |
83.26 |
61.24 |
56.67 |
62.92 |
56.54 |
57.29 |
65.54 |
35.95 |
|
%
ID |
76.09 |
71.83 |
71.43 |
74.22 |
71.58 |
72.86 |
74.66 |
70.51 |
|
%
Both |
63.35 |
43.99 |
40.48 |
46.70 |
40.47 |
41.74 |
48.93 |
25.35 |
Unmasked |
|
|||||||
|
%
Ali |
78.05 |
52.50 |
45.75 |
62.84 |
49.58 |
46.18 |
49.71 |
27.69 |
|
%
ID |
75.66 |
73.58 |
71.19 |
73.65 |
71.29 |
73.26 |
74.71 |
69.88 |
|
%
Both |
59.05 |
38.63 |
32.57 |
46.28 |
35.66 |
33.83 |
37.14 |
19.35 |
Table 1. Conservation and regulatory regions. The rows show percentage of aligning bases and the percentage base identity within the aligning regions of mouse/human blastz alignments. The % Both row is the product of the % Ali and % ID. The top panel reflects the conservation in areas that are not covered by RepeatMasker, while the bottom panel is unmasked. The Known reg column is from a set of 100 regulatory regions from the biological literature. The upstream regions are relative to the 5’ ends of RNA clusters. Expr Tri shows regions found by the process describe here. Tri Motif shows the Expr Tri motifs applied to all Upstr 1000 elements, not just those in the expresssion group where the motif was found. Tranfac motif shows known human motifs applied to all Upstr 1000 elements.
Future
Directions
We’d
like to refine virtually every step of this process. In the RNA clustering phase we’d like to filter out RAGE ESTs,
which are created from randomly inserting powerful promoters in the
genome. These can cause transcription
to begin upstream of where it would begin in nature, thus invalidating the 5’
ends of our clusters. It is becoming
increasingly apparent that the use of alternative promoters is very common in
nature. Our method currently will
always select the most 5’ promoter. It
may be more appropriate to use instead the most highly expressed promoter. While masking is necessary to prevent the
program from forming motifs out of interspersed repeats, RepeatMasker also
often masks GC rich sequence as degenerate, which may not be appropriate for
this application. We’d also like to experiment with other expression clustering
parameters and techniques. The
Improbizer runs were constructed to search for tissue-specific motifs that can
occur in either strand. It would be
good to also search for general motifs, for motifs that only occur on one
strand, motifs where the location relative to transcription start is relevant.
Once we have optimized the procedure we’ll be looking eagerly for a wet lab
collaborator to help us test the motifs in various ways.