Alastair Fyfe

CMPS244 Project – Spring ‘01

Web Copy : http://www.cse.ucsc.edu/~afyfe/244p2.htm

Track on the Class Mirror Browser : c2h2zf

 

C2H2 Zinc Fingers in the Human Genome: Motif Occurrences and Binding Site Prediction

 

    1. Introduction
    2. Cysteine-2 Histidine-2 Zinc Fingers
    3. The Search for a Recognition Code
    4. Finding Instances of the C2H2ZF Motif in the Human Genome
    5. Quantifying Binding Preferences
    6. Predicting Binding Sites
    7. Discussion
    8. Further Work
    9. Acknowledgements
    10. References

 

 

Introduction.

Protein binding to nucleic acids is an essential step in many biological processes including gene expression, DNA duplication and DNA packing. Proteins have evolved a wide array of mechanisms for binding to DNA, each fit for a specific task. Mechanisms can be analyzed and classified by various criteria of which two of the most important are affinity and specificity. For example, the processive DNA polymerases responsible for the bulk of DNA duplication bind single-stranded DNA with great affinity but little specificity. Other proteins, such as the large T antigen protein responsible for recognizing the origin of replication in duplication of SV40 DNA,bind relatively weakly but with great specificity. A recent survey of DNA binding proteins [2] organized the 240 DNA binding proteins whose solved structures are available in the Protein Data Bank into 8 overall classes which are in turn divided into 54 structural families.

In this report I will concentrate on the cysteine-2 histidine-2 zinc finger (C2H2ZF) DNA-binding motif. This motif occurs very commonly in eukaryotic transcription factors and is among the best characterized of the DNA binding motifs. Its modular structure and relatively simple mode of binding have motivated the search for a "recognition code" whose discovery would enable the engineering of proteins capable of binding to any desired DNA sequence. Development of this capability is an essential prerequisite for therapeutic technologies, such as gene therapy, that depend on being able to target specific elements within the human genome.

The three main goals of the project were to:

    1. discover instances of C2H2ZF motifs in the current draft of the human genome and analyze their characteristics,
    2. review current literature on the status of a C2H2ZF "recognition code" and try to adapt applicable results into a tool capable of predicting likely binding sites,
    3. use this tool to scan the genome and characterize the properties of the reported putative binding sites

The next two sections review the C2H2ZF motif with the first focusing on its structure and properties and the second on a review of efforts to uncover a DNA recognition code. The fourth section summarizes some properties of the C2H2ZF motif occurrences found in the human genome and explains how those motifs were identified. The next two sections focus on the prediction of binding sites. The first of the two describes the methods used to adapt existing studies into a tool for predicting C2H2ZF binding sites and the second describes the predictions obtained for a particular protein, ZNF151, The final two sections discuss some of the strengths and limitations of the approach used and suggest some interesting areas for additional work

 

Cysteine-2, Histidine–2 Zinc Fingers

The first characterization of the C2H2ZF motif was done by Aaron Klug at MRC, Cambridge University, in 1985 [1] The motif was found in the TFIIIA transcription factor for the 5S ribosomal RNA gene from Xenopus laevis, a protein that has the distinction of being the first known eukaryotic transcription factor. Inspection of the protein revealed the presence of 9 repeats of a 30 residue pattern and an abundance of zinc. Klug correctly predicted a structure in which the two Cys and His residues conserved in each repeat  tetrahedrally coordinate a zinc ion.

This was the first member of what has become a class of zinc-coordinated binding proteins that includes the hormone-nuclear receptor and GAL4-type families[2]. Hence the C2H2ZF motif  is often referred to as the "classic" zinc finger.

The motif consists of about 30 amino acids. Its secondary structure elements are an a-helix and two anti-parallel b sheets. The two sheets and the helix are held in a fixed conformation by the coordinated zinc ion and by 3 conserved hydrophobic residues [3].

The two model systems that have been studied most extensively are zif268, a three-finger transcription factor from mice involved in early stages of development, and TFIIIA a nine-finger transcription factor essential for transcription of 5S ribosomal RNA in Xenopus.  The first crystal structure of a C2H2 zinc finger was published by Pavletich and  Pabo in  ’91 [6] and a refined structure at 1.6A was published in ’96 [10] with PDB entry 1AAY. An image drawn from the 1AAY coordinates and available in the Protein-Nucleic Acid Complex Database, http://www.rtc.riken.go.jp/jouhou/3dinsight/complexdb.html,  is shown below.

From the image it is apparent that residues in the alpha helix are in close contact with Watson-Crick base pairs in the DNA major groove. The conventional numbering for the a-helix residues references the first residue immediately before the helix as –1.

C2H2ZF motifs possess a number of interesting characteristics:

  1. They are widespread : over a thousand instances of the motif have been identified, often in transcription factor proteins.  They have been found in all eukaryotes studied to date and apparently do not occur in prokaryotes.
  2. They rely on a relatively simple DNA-binding structure which reads a  triplet of bases from the DNA major groove. Fingers can be combined, via “linkers”, to form polydactyl motifs capable of recognizing longer sequences.
  3. They have been the subject of extensive study and are among the best understood of the DNA-binding proteins.
  4. Synthetically engineered C2H2ZF proteins have shown great success in binding targeted DNA binding sites with considerable specificity and affinity.
  5. C2H2ZF proteins have also been studied in RNA and protein-protein binding.

 

The Search for a Recognition Code

The study of how proteins recognize specific DNA sequences has long fascinated molecular biologists. This section reviews progress in answering this question in the context of the C2H2ZF motif.

 A 1976 paper by Seeman, Rosenberg and Rich[11] set out some important principles. The authors asked what pattern of hydrogen bonding would allow an amino acid inserted in the major groove of B-DNA to distinguish among the 4 possible Watson-Crick pairs. From inspection of the stereochemistry they concluded that, on the basis of hydrogen-bond based discrimination alone,  (a) a single hydrogen bond was inadequate for discrimination and that (b) a few two-hydrogen bond interactions were likely. These include a two-bond system between either asparagine or glutamine and the adenine side of a U(T)-A pair and another two-bond system between arginine and the guanine  in a GC pair. Subsequent analysis has borne out these predictions. This paper led to an early appreciation of the fact that only a few of the interactions necessary for protein- DNA recognition are readily identifiable.

In 1991, Pavletich and Pabo [6] solved the structure of the three finger zif 268 protein complexed with DNA. Their crystallographic analysis provided experimental support for the predictions made by Seeman et al. The three triples in the 5’ GCG TGG GCG 3’ sequence used in the formation of the protein-DNA complex was rich in guanines and five of the six observed protein-DNA hydrogen bonds occurred between arginine and guanine.

In a study published in 1992 Jacobs [8] analyzed 1340 C2H2ZF motifs obtained from 221 proteins, primarily transcription factors obtained by scanning protein databases and the literature for the appropriate pattern. Jacobs analyzed the pattern of variation and reasoned that sites that tended to vary significantly within a multi-finger protein but be relatively conserved across similar proteins were probably involved in DNA binding, i.e. in sequence recognition. He concluded that positions –1, 3 and 6 were most likely to be involved in DNA recognition. These results agreed with analysis of the crystal structure and suggested that a protein<>DNA recognition code could be found by focusing on three specific positions within the C2H2ZF motif.

A number of different studies in the early ‘90s used combinatorial chemistry techniques to try to elucidate this code. The most comprehensive was reported in two papers by Yen Choo and Aaron Klug [4][5]. These studies built variants of zif268 by randomly altering the amino acids at helix positions –1 to 8 (excluding the conserved Leu and His ) of the middle finger. The ability of variants to bind to all possible 64 DNA triples was then tested by means of phage-display based selection. Their results supported the identification of residues –1, 3 and 6 as the key determinants of binding specificity, though position 2 was also found to play an important and less direct role. The data published in this study forms the basis of the binding-site predictions computed in this project as explained in the section “Quantifying Binding Preferences”.

More recent studies have made apparent that no simple C2H2ZF recognition code is likely to be found. The solution of the structure of the GLI protein and detailed comparison of the geometry of different C2H2ZF docking arrangements in known structures[12] indicates that a number of distinct arrangements of the motif exist, each with a distinct recognition code. Even for the canonical zif268 docking arrangement, factors such as linker length, and both inter-finger and intra-finger correlation have been found to play a significant role. Nevertheless, while the overall recognition code, even for zif268, may not be simple, there is good reason to expect that accumulation of sufficient experimental data will yield a database from which high-quality predictions can be computed.

Finding instances of the C2H2ZF motif in the human genome

This section describes how occurrences of C2H2ZF motifs in the human genome were identified and summarizes some properties of those occurrences. This data is already available elsewhere: several of the existing annotations of human genome data include information about the protein family membership of gene products. For example, querying the UCSC genome browser for "zinc finger" returns information about 460 mRNA associated gene locations. However, for this project I was interested in investigating the mechanics of locating a particular protein motif and thus did not rely on the existing classifications.

Overall, the search strategy consisted of three steps. First, the protein sequences that correspond to the translation product of known genes were extracted. This collection of protein sequences was then searched with the HMMER[14] search tools using the PFAM[13] hmm for the C2H2ZF motif. This search yielded a set of protein subsequences that were likely instances of the motif. Finally, the highest scoring entries in this set were post-processed to aggregate adjoining fingers into polydactyl motifs.

HMMER is a collection of hidden Markov model (HMM) tools developed and distributed by Sean Eddy at the University of Washington. The suite includes tools for building profile HMMs from a multiple alignment, estimating the distribution function of scores calculated by an HMM and searching a database of sequences using an existing HMM. I used version 2.1.1 of the package

The PFAM database[13] maintains a collection of hand-curated multiple alignments associated with known protein folds. For each such alignment, the database also provides a profile HMM whose states and emission probabilities are derived from the alignment. The HMM used in this project is named "zf-C2H2" with accession number PF00096. The associated alignment includes 10312 proteins, a number that attests to the widespread distribution and well-characterized nature of this domain.

The protein sequences used for the project were obtained for the October 7, 2000 version of genome database using annotation "genieKnownPep" These are 8243 protein sequences of known genes. Application of the zf-C2H2 HMM yielded 1825 possible motif instances in 404 genes. If the score value for a motif was low enough that the expected number of occurrences with that score value was 1 or greater, the score was not used. This cutoff reduced the number of motifs to 1524.

The next step was to assemble adjacent fingers into separate polydactyl motifs. The distribution of linkers is shown below where linkers longer than 21 have been aggregated.

Linker length

Occurrences

0

209

1

1

2

1

3

9

4

7

5

23

6

1042

7

21

8

17

9

16

10

6

11

4

12

2

13

1

16

3

17

2

18

4

19

3

20

1

21 and higher

150

 

The common occurrence of a linker of size six is consistent with the canonical five-residue linker in zif268, though it is not yet clear whether the single residue discrepancy is real or simply due to the way adjacent alignments were calculated for this project. A sequence logo, drawn by the WebLogo server (http://www.bio.cam.ac.uk/seqlogo/) ,  for the 1042 linkers of size 6 is shown below. Again, the composition of the linker is in very good agreement with the canonical  PKEGT linker sequence that occurs in more than half[3] of 5 residue linkers in the Transcription Factors Database.

 

For purposes of prediction, one of the most interesting characteristic of  C2H2ZF motifs is the distribution of  residues at the key –1, 3 and 6 positions of the a-helix.

 

Helix Position:

-1

3

6

Ala

18

102

89

Cys

41

17

6

Asp

81

84

28

Glu

44

84

95

Phe

15

25

6

Gly

24

55

21

His

108

228

21

Ile

6

17

94

Lys

69

53

171

Leu

41

38

72

Met

18

17

26

Asn

68

224

73

Gln

324

95

211

Arg

294

39

290

Ser

120

245

83

Thr

105

90

89

Val

31

42

119

Trp

52

1

1

X(err)

3

1

0

Tyr

62

60

28

Pro

0

7

1

 

The large number of Asn, Glu, Gln, Arg and His residues is consistent with the expected occurrence of residues capable of forming two hydrogen bonds with Watson-Crick pairs in the major groove. The common occurrence of Ser is more difficult to account for.

 

Quantifying Binding Preferences

The "binding sites signatures" reported in Choo and Klug in [5] provide the most comprehensive published data to date on the specificity of DNA triplet binding obtained for a zif268-like C2H2ZF motif from a background of randomly selected amino acids. This section briefly reviews the methods they used and then describes how their published data was adapted for use in this project.

The phage display protocol used by Choo and Klug [4] entailed three steps. In the first step, a library of about 2.6E6 variants of wild-type zif268 were cloned and ligated into the gene encoding the protein coat of fd phage. These combinatorial variations of zif268 were designed by randomly varying the nucleotides responsible for encoding positions –1 to 8 of the middle finger of zif268. Following transcription and translation, the resulting fd phages displayed one of the synthetic C2H2ZF fingers on their protein coat. The second step of the protocol involved affinity purification. A library of oligodeoxynucleotides was prepared which included the appropriate sequences for binding by fingers 1 and 3 of the modified zif268 and all 64 possible combinations for the middle, finger two,  binding site. The oligodeoxynucleotides were attached to magnetic beads by biotynilation and were put in contact with variant fd phage. After further purification and sequencing, the results were summarized in a table listing 16 of the 64 possible DNA triplets and a mere 33 of the 2.6XE6 possible C2H2ZF motifs.

To further strengthen their conclusions, the authors extended the analysis in an accompanying paper[5]. In this paper they created a 33x12 array whose rows consisted of the C2H2ZF variants described above and whose columns were the elements of a library of DNA triplets with one position  fixed and two randomized. Thus for the example, the nucleotides in the first row all had G in the first position and random nucleotides in the last two. The authors measured the strength of binding association at each of the 396 entries in the array and reported the results in a figure shown, in part,  below.

 

 

By analyzing the binding preference of each entry in the array, Choo and Klug obtained a  "binding site signatures" displayed in the far right column. For example, the 2nd row from the top indicates that the C2H2ZF motif with arginine at position –1 and  alanine at positions 3  and 6 of the alpha helix displays a fairly specific preference for the triple GTG. The authors went on to summarize the binding site signatures in a 4x3 table that set out the major trends apparent in the 33 binding site signatures (not shown).

 

While very informative, the results presented in these two papers could not be used directly to provide automated and approximate prediction of binding sites. To adapt them to this purpose, the image shown above was scanned and a 12x12 pixel sample of the gray-scale valued pixels displayed for each array entry was selected. The locations of the sample points varied somewhat as shown below, however, each sample remained within the corresponding array entry.

 

 

An average value was obtained from the 144 pixels sampled and used as a measurement of the strength of binding. The values of the four columns that correspond to a particular base position in the DNA triplet were converted to probabilities, which are shown below (multiplied by 1000).

 

RSDHLTTHIR

331

68

519

79

800

113

86

0

750

94

77

77

RVDALEAHRR

465

209

204

119

150

124

601

124

572

151

144

131

DRASLASHMR

461

50

461

25

68

106

694

129

59

752

153

34

NRDTLTRHSK

738

103

79

79

90

71

603

233

77

124

461

336

QKGHLTEHRK

289

152

280

277

411

194

207

186

160

289

276

273

QSVHLQSHSR

246

120

497

136

550

137

161

149

221

364

217

196

RLDGLRTHLK

385

101

406

105

159

283

280

277

560

158

140

140

TPGNLTRHGR

552

164

135

147

213

396

198

190

167

161

520

149

NGGNLGRHMK

571

142

118

167

161

507

169

161

176

183

476

163

RADALMVHKR

393

96

393

116

124

237

513

124

554

176

121

146

NQSNLERHHR

423

190

190

194

207

411

186

194

198

215

387

198

DRSNLERHTR

618

118

112

150

144

566

157

132

150

162

137

550

RSDTLKKHGK

607

126

120

145

121

121

146

611

611

146

121

121

QQSNLVRHQR

590

131

125

152

129

557

163

149

109

240

429

219

NGANLERHRR

573

146

111

167

189

409

209

189

191

222

414

171

RGDALTSHER

413

78

413

95

200

179

432

187

657

140

107

93

RGDHLKDHIK

344

151

344

158

632

101

132

132

638

151

98

111

RGPDLARHGR

591

134

128

146

145

94

139

620

595

159

128

116

REDVLIRHGK

619

122

116

141

100

94

256

547

593

168

118

118

RSDLLQRHHK

611

121

114

152

80

68

425

425

634

147

108

108

RQDTLVGHER

352

89

375

182

145

142

355

355

603

152

121

121

RAADLNRHVR

649

121

85

142

109

95

136

657

622

139

125

111

SQGNLQRHGR

588

139

107

164

123

450

212

212

373

100

357

169

TGGSLARHER

407

203

176

212

187

196

415

200

213

205

388

192

DHANLARHTR

713

147

38

100

136

623

150

89

83

83

151

681

LQSNLVRHQR

415

187

191

205

213

432

130

223

169

173

335

320

RKDVLVSHVR

353

160

381

104

79

75

422

422

785

115

57

41

RRDVLMNHIR

462

206

206

125

98

77

412

412

592

172

117

117

QGGNLVRHLR

388

184

184

242

148

507

164

179

144

433

289

132

SRDVLRRHNR

598

138

111

151

372

77

163

385

110

229

440

220

EKATLARHMK

761

102

11

125

72

144

185

597

156

165

521

156

QAQTLQRHLK

496

138

175

189

198

162

162

477

148

170

540

140

IASNLLRHQR

609

113

113

162

174

412

224

187

86

228

469

216

 

The last part of this adaptation involved combining the probabilities to obtain numerical estimates of the likelihood of the individual binding site signatures. The results of this step are shown below. These may be interpreted as the estimated probability of observing a particular amino acid given a particular base and position. For example,

Pr( residue at helix position 6 =Arg | 5' base = G ) = 0.136

The probabilities are preceded by the number of instances of that residue observed at that helix position.  Thus, this table presents a quantified expression of the authors’ summary of binding site signatures.

 

 

 

 

5'

 

 

 

Mid

 

 

 

3'

 

 

 

 

 

 

 

 

 

 

 

 

 

A

1

0.11

 

A

3

0.093

 

R

14

0.332

 

R

20

0.136

 

N

11

0.098

 

N

4

0.085

 

N

1

0.109

 

D

2

0.075

 

D

3

0.052

 

D

1

0.082

 

G

1

0.093

 

Q

5

0.083

G

E

1

0.069

 

H

4

0.35

 

E

1

0.083

 

G

1

0.084

 

L

1

0.047

 

I

1

0.046

 

K

1

0.144

 

S

2

0.075

 

L

1

0.09

 

S

4

0.087

 

T

5

0.073

 

S

2

0.128

 

T

2

0.085

 

V

4

0.095

 

T

2

0.101

 

V

1

0.093

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

A

1

0.154

 

A

3

0.112

 

R

14

0.079

 

R

20

0.105

 

N

11

0.299

 

N

4

0.099

 

N

1

0.151

 

D

2

0.06

 

D

3

0.177

 

D

1

0.111

 

G

1

0.176

 

Q

5

0.159

A

E

1

0.112

 

H

4

0.085

 

E

1

0.088

 

G

1

0.065

 

L

1

0.042

 

I

1

0.121

 

K

1

0.093

 

S

2

0.094

 

L

1

0.092

 

S

4

0.075

 

T

5

0.08

 

S

2

0.088

 

T

2

0.063

 

V

4

0.051

 

T

2

0.097

 

V

1

0.07

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

A

1

0.07

 

A

3

0.181

 

R

14

0.035

 

R

20

0.041

 

N

11

0.063

 

N

4

0.135

 

N

1

0.07

 

D

2

0.048

 

D

3

0.046

 

D

1

0.117

 

G

1

0.098

 

Q

5

0.109

T

E

1

0.095

 

H

4

0.052

 

E

1

0.162

 

G

1

0.127

 

L

1

0.15

 

I

1

0.145

 

K

1

0.041

 

S

2

0.195

 

L

1

0.104

 

S

4

0.149

 

T

5

0.102

 

S

2

0.124

 

T

2

0.157

 

V

4

0.11

 

T

2

0.141

 

V

1

0.134

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

A

1

0.082

 

A

3

0.051

 

R

14

0.056

 

R

20

0.109

 

N

11

0.061

 

N

4

0.109

 

N

1

0.086

 

D

2

0.225

 

D

3

0.211

 

D

1

0.108

 

G

1

0.098

 

Q

5

0.096

C

E

1

0.189

 

H

4

0.041

 

E

1

0.078

 

G

1

0.124

 

L

1

0.15

 

I

1

0.108

 

K

1

0.099

 

S

2

0.058

 

L

1

0.16

 

S

4

0.061

 

T

5

0.16

 

S

2

0.097

 

T

2

0.063

 

V

4

0.155

 

T

2

0.085

 

V

1

0.079

 

 

 

 

 

 

 

 

 

For purposes of prediction however, we are more interested in the probability of a base at a given position given a specific amino acid at a particular helix position. These are summarized in the following table.

 

 

 

G

A

T

C

 

A

0.465

0.21

0.205

0.12

 

R

0.576

0.143

0.121

0.16

 

N

0.462

0.206

0.206

0.126

 

D

0.345

0.152

0.345

0.159

5'

E

0.29

0.153

0.28

0.277

 

G

0.353

0.089

0.376

0.182

 

K

0.607

0.127

0.12

0.146

 

S

0.369

0.103

0.439

0.09

 

T

0.358

0.085

0.463

0.093

 

V

0.394

0.096

0.394

0.116

 

 

 

 

 

 

 

A

0.158

0.18

0.516

0.145

 

N

0.168

0.479

0.179

0.174

 

D

0.128

0.095

0.138

0.639

 

G

0.159

0.283

0.28

0.277

Mid

H

0.599

0.137

0.147

0.117

 

L

0.081

0.068

0.426

0.426

 

S

0.128

0.151

0.555

0.165

 

T

0.126

0.128

0.291

0.455

 

V

0.163

0.082

0.314

0.442

 

 

 

 

 

 

 

R

0.627

0.148

0.114

0.111

 

N

0.161

0.186

0.435

0.218

 

D

0.098

0.332

0.148

0.422

 

Q

0.157

0.3

0.351

0.193

3'

E

0.157

0.165

0.521

0.157

 

I

0.086

0.228

0.469

0.216

 

L

0.17

0.174

0.336

0.321

 

S

0.241

0.165

0.399

0.194

 

T

0.191

0.183

0.455

0.171

 

In summary, the quantification of binding site preferences involved two steps. First the underlying affinity data was converted from a gray-scale value to a numerical index. There no doubt is significant loss of  precision in this step, but it is reasonable to assume the error introduced is unbiased.  In the second step marginal probabilities were used to supplant the authors' assessment of frequent occurrences among binding site signatures.  This requires an assumption of independence among the three helix positions which is certainly false: "Structural studies, mutagenesis, statistical analysis of sequences, and design studies all show that the amino acids at positions –1, 2, 3 and 6 do not play fully independent roles in DNA recognition." [3].  However there is good agreement overall, between the table of conditional probabilities above and the authors' published summary of binding site signatures. Thus correlation among helix positions does not appear to play a dominant role in determining binding specificity

 

Predicting Binding Sites

The last table displayed above estimates the probability of   observing a  base in a particular position of the DNA triplet given the occurrence of a particular amino acid in a given position of the C2H2ZF helix. From this table it is possible to construct a model which computes the likelihood of binding for a given DNA sequence and C2H2ZF occurrence. To apply this approach I selected the ZNF151 protein from the approximately 400 available motifs. ZNF151 was recently identified as a chromosome 1 gene for a C2H2ZF transcription factor  [15] though little is known about its role or binding sites. It was selected for this project because:

·        its twelve fingers should provide enough discrimination to select specific binding sites;

·        all twelve fingers are joined by canonical linkers;

·        two of the twelve fingers rely on a-helices that occur among  the 33 combinations in the  Choo and Klug [5]   binding study. Thus for these two fingers it was possible to obtain binding probabilities directly based on the data. That is, probabilities that take into account possible correlation among the three helix positions. For the remaining ten fingers, probabilities are based on the assumption of independence among the three helix positions.

 

The distribution of scores calculated for the 282,193,629 possible binding sites in chromosome 1 is shown below. The approximately 5500 top scores calculated across all chromosomes have been tabulated as  track“C2H2ZF” in the class version of the genome browser. The relatively sharp drop in the upper tail of the distribution indicates that the tabulated probabilities provide significant discriminative resolution.

 

Discussion

This project has investigated the feasibility of predicting the binding sites of a particular family of DNA-binding proteins, C2H2 zinc fingers. The ability to accurately predict such binding sites would be useful for both conceptual and applied studies.  An important requirement for gene therapy applications such as regulation of gene expression is the ability to explicitly target a unique site among the genome’s 3 billion base pairs and to bind to that site with high affinity. At a conceptual level, this capability would enable correlation of transcription factors with their associated binding sites and thus, perhaps, to the genes being regulated. Because of their modular nature and relatively simple binding mechanism C2H2ZF motifs seems a good model system for investigation of this problem.

The approach investigated essentially extrapolates binding data obtained from a combinatorial phage-display study into a set of probabilities that can be used for scoring the likelihood of a binding site given a particular C2H2ZF occurrence. The approach has been demonstrated by predicting the binding sites of a particular protein ZNF151. This approach is consistent with the recommendations of various researchers that a database of binding site be gathered to help map the intricacies of C2H2ZF binding.

The following limitations apply to the work done for this project :

·        There was no attempt to incorporate the effect of the residue at helix position 2. This has been shown to play a significant role in binding by association with a base on the complementary strand in the triple associated with the adjoining finger.

·        As noted earlier, the assumption of independence among helix positions –1, 3 and 6 is invalid, though it is not yet clear to what extent this correlation affects overall binding;

·        Evidence from the GLI structure and from recent studies of linker composition [16] indicate that not all fingers necessarily play a role in DNA binding.  It is unlikely that all twelve fingers in  ZNF151 are involved in DNA binding.

Notwithstanding some early optimism, the current consensus among researchers is that it is unlikely that a simple “recognition code” can be elucidated even for as comparably simple a system as zif268-like C2H2ZF motifs. There clearly are areas of the overall  (amino-acid) x (DNA) space where sharp correlations can be observed and others where binding associations are more diffuse. The possibilities of this combinatorial space are immense: the number of fingers in a poly-dactyl motif  x 20 **9 x 4**3. Data gathering and statistical analysis can help separate these regions and indicate where additional crystallographic/NMR structural data or binding-affinity data can contribute the most additional predictive power.

Further Work

A number off additional areas of investigation became apparent in the course of this project.

·        The major omission in the present analysis is the absence of any validation. The usefulness of this approach cannot be assessed without estimating whether the observed binding sites are actually among those predicted.

·        Binding site predictions should be repeated with a C2H2ZF protein more similar to zif268. This would address the problem of distinguishing binding from non-binding fingers.

·        The differences in length and composition of linkers in the human genome and in similar organisms should be investigated.

·        The geometric analysis of C2H2ZF structures by Nekludova and Pabo [12] indicate that the docking geometry of C2H2ZF fingers can be classified into distinct groups, each of which will exhibit a distinct “recognition code”. Learning how to predict docking geometry is thus likely to be an important part of overall prediction.

 

Acknowledgements

I gratefully acknowledge the general help, relevant references and useful advice I received from Professors William Scott and Grant Harzog and from Dr. Yael Mandel-Gutfreund. I am particularly helpful to Bill and Yael for taking the time to try and obtain additional data.

 

References

[1] Miller, J, McLachlan, AD, Klug, A, 1985, "Repetititive zinc-binding domains in the protein transcription factor IIIA from Xenopus oocytes", EMBO J., 4:1609-14

[2] Luscombe, N, Austin, S, Berman, H, Thornton J, 2000, "An overview of the structures of protein –DNA complexes", Genome Biology 1(1):1-37

[3] Wolfe, S, Nekludova, L, Pabo, C, 2000, "DNA recognition by Cys2His2 Zinc finger proteins", Annual Review of Biophysics and Biomolecular Structure, 29:183-212

[4] Choo, Y, Klug A, 1994, "Toward a code for the interaction of zinc fingers with DNA: Selection of randomized fingers displayed on phage", Proc. Natnl. Acad. Sci., USA 91:11168-11172

 [5] Choo, Y, Klug A, 1994, "Selection of DNA binding sites for zinc fingers using rationally randomized DNA reveals coded interactions", Proc. Natnl. Acad. Sci., USA 91:11168-11172

[6]Pavletich, NP, Pabo, CO. 1991."Zinc-finger DNA recognition : Crystal structure of a Zif 268-DNA complex at 2.1 A." Science 252:809-17.

[8] Jacobs, GH. 1992. "Determination of the base recognition positions of zinc fingers from sequence analysis". EMBO J. 11:4507-4517

[10] Elrod-Erickson, M, Rould MA, Nekludova L, Pabo CO. 1996. “Zif268 protein-DNA complex refined at 1.6A : a model system for understanding zinc finger-DNA interactions”. Structure 4:1171-80.

[11] Seeman, ND, Rosenberg, JM, Rich A. 1976. “Sequence-specific recognition of double helical nucleic acids by proteins”. Proc. Natnl. Acad. Sci. USA 73:804-808

[12] Pabo CO, Nekludova L. 2000. “Geometric analysis and comparison of protein-DNA interfaces: Why is there no simple code for recognition?”, J. Mol. Bio. 301:597-624.

[13]A. Bateman, E. Birney, R. Durbin, S.R. Eddy, K.L. Howe, and E.L.L. Sonnhammer.2000. Annual NAR database issue.  Nucleic Acids Research, 28:263-266

[14] Eddy, SR.1997. “Hidden Markov Models and large scale genome analysis”. Transactions of the American Crystallographic Association

[15] Tommerup,N. and Vissing,H.1995.” Isolation and fine mapping of 16 novel human zinc finger-encoding  cDNAs identify putative candidate genes for developmental and malignant disorders” Genomics 27: 259-264.

[16] Moore, M, Klug A, Choo Y. 2001. “Improved DNA binding specificity from polyzinc finger peptides by using strings of two-finger units”. Proc. Natnl. Acad. Sci. USA 98:1437-1441.