To: casp5team In-reply-to: <200212191957.UAA06418@tau.EMBL-Heidelberg.DE> (message from Rob Russell on Thu, 19 Dec 2002 20:57:40 +0100 (MET)) Subject: Re: We need your help assessing CASP5 References: <200212191957.UAA06418@tau.EMBL-Heidelberg.DE> --text follows this line-- Here are the answers to the new-fold questionnaire. Note: if anyone on the team wants to correct or clarify any of these answers, send mail both to me and to Rob Russell russell@embl.de > QUESTIONAIRE > > The targets we are considering: > > New Fold Targets (NF): > T0129 T0149_2 T0161 T0162_3 T0181 > > New Fold / Fold Recognition Borderline Targets (NF/FR): > T0146_1 T0146_2 T0146_3 T0172_2 T0173 > T0186_3 T0187_1 T0170 > > Please give a concise summary of the method you tended to use for NF or > NF/FR targets. > > 1. Would you classify your method as exclusively fold recognition (i.e > always using an alignment to a template that covers most of the > target sequence)? My methods were not EXCLUSIVELY fold recognition, though I always did a fold-recognition step as part of the process. > 2. Did you alter your prediction method based on some prior-classification > of targets (considering only those above)? > If 'YES', could you let us know how you classified the targets. No, I ran all targets through the same process (which hurt my performance considerably on the CM targets, where stopping after the fold-recognition step would have been much better, as evidenced by the better performance of the SAM-T02 server compared with SAM-T02-human). > 3. Does your method use secondary structure prediction? > If 'YES', what method did you use? (eg. PSIpred, PHD, etc.) Yes, we used 4 different neural nets to predict local structure, all usinga thinned SAM-T2K multiple alignment as input. The 4 networks were trained for the following predictions: STRIDE; DSSP; STR, our extended DSSP alphabet that subdivides the beta strands; and ALPHA11, based on torsion angles between adjacent CA atoms. See http://www.soe.ucsc.edu/research/compbio/SAM_T02/sam-t02-faq.html#secondary-meaning for an explanation of the alphabets. The method could be tersely described as "SAM-T02", though the ALPHA11 prediction is not currently done by the SAM-T02 web server. > 4. Does your method have any manual intervention (eg. adjusting > alignments, inspecting models, etc.)? > If 'YES', please describe how much manual intervention was used. We had extensive manual invervention on the harder problems, having to assemble the beta sheets by hand in many cases. We tended to go through a cycle of looking at the low cost models provided by undertaker, modifying the cost function (often adding constraints by hand, or modifying previously added constraints), reoptimizing, and looking at the new models. I chose the "best" few models to submit personally, though I often considered recommendations from other members of the team for their favorite models. > 5. Does your method use homologous sequence information for the target > sequence? > If 'YES', please desribe how. Most definitely. We start by building a multiple alignment of probable homologs using the SAM-T2K iterative search, then use that multiple alignment for secondary structure prediction and for building an HMM. We use the HMM (and multi-track HMMs built from it plus the output from the secondary structure predictors) to do fold recognition and to generate fragments for assembly by undertaker. > 6. Did you split target sequences into domains before predicting > structure? > If 'YES', what method did you use to define domains? We generally started from a whole-chain prediction, but in some cases we tried breaking the target up into smaller pieces and repeating the fold recognition on the pieces. These were not always domain-based---in some cases we used arbitrary overlapping pieces. > 7. Is your method a fragment-based approach to predicting structure? > If 'YES', what fragment library did you use? YES. Undertaker uses 3 sources of fragments: 1) a "generic" fragment library consisting of all 1-, 2-, 3-, and 4-residue fragments in a set of about 500 PDB files. These are indexed by the sequence of residues in the fragments. The generic residues are included to allow essentially arbitrary conformations to be in the search space, but are not relied on to produce "good" conformations. 2) a "specific" fragment library generated by fragfinder, which used a 2-track HMM (amino acids plus predicted STR) to look for fragments of length about 9 in our template library of about 7000 PDB files. We kept the top-scoring 6 fragments centered on each residue. 3) a long-fragment library (mixed with the specific library internally) consisting of gap-free segments from the alignments to templates suggested by the fold-recognition process. We generally had about 20 alignments to each of about a dozen templates included for each target. In addition to the fragments from the alignments, the whole alignments could be inserted into the conformation during the fragment assembly process, so that full fold-recognition is not really distinguishable from new-fold assembly by the program. > 8. Did you use any publically-available server to assist in your > predictions? > If 'YES', which servers did you use? What were they used for? (ie. > screening targets, finding templates/fragments, etc.). We looked at the CAFASP results manually to see whether our template selections were consistent with what other people were getting. In some cases we included the Robetta models as possible conformations to modify in our optimization process. > 9. Does your method include lattice-based representations of proteins? Nope, I have no faith in lattice-based representation. > 10.Does your method include threading-type potentials (e.g. pair-potentials, etc.) > If 'YES', please describe the potentials used. We have a cost function, but the only pairwise terms are for cysteine residues. Most of the terms in the cost function are local environment properties, especially our notions about hydrophobic burial. We did not even have a hydrogen-bonding term in the cost function, which meant we needed to add hydrogen bonds as manual constraints, to keep the sheets from being blown apart in optimization. > 11.Does your method include steps for relaxation, optimization, > minimization step? > If 'YES', please describe. I'm not sure what the difference is between "relaxation", "optimization", and "minimization". We were certainly doing a stochastic search of conformation space using a cost function. We did NOT follow this optimization with another smaller-motion optimization using a different cost function (which is what I THINK you mean by "relaxation"). > Please add any other information about your method that you feel is important. We did not cluster the conformations we generated, but looked only at the low-cost ones. Since we often had several cost functions for a given target (using different weights for the components or different manual constraints), we looked at the low-cost conformations under several different cost functions and chose which to submit manually. Info about individual targets: T0129 The first model was picked as the best scoring with no manual constraints, though helix formation was favored by the cost function. Our best model (model3) was one where we liked the packing, but did not like the way that helix5 was messed up. We did not include the robetta models in the optimization that led to models 2 and 3, though they had been considered in generating models 1 and 4. We did not divide this protein into domains, though we had thought that the first 3 helices were one subdomain. It turns out that the 4th helix should be included with them and not with the last 3 as we had thought. We did use manual constraints in all the optimizations to try to get the helices to bundle together, though model 1 was reoptimized without the manual constraints. T0149_2 (203-318) We recognized that fold-recognition had found only the N-terminal domain and tried predicting 191-318 as a separate domain. We also tried doing fold recognition for 1-200, 151-200, and 181-318. We got weak fold-recogntion matches to SCOP domain c.37.1 (the same fold as the Nterminus) for the C terminus, and conjectured a tandem repeat, but could not get this to work in undertaker. We tried adding constraints to hand-assemble the beta strands of the second domain. We tried assembling the two domains separately then joining them together. T0161 We found no homologs in the SAM-T2K iterative search, so the HMM and secondary-structure prediction were expected to be bad. We tried packing predicted strands by hand, using the Robetta models for hints. We also added straightness constraints for some predicted strands. Undertaker kept wanting to change this into a 4-helix bundle. I haven't seen the correct solution yet, so I don't know whether any of our models were in the right general direction. T0162_3 (114-281) We had no strong fold-recognition signals here, not even for the easier domains. We used the Robetta models as well as our own as starting points for optimization. We guessed at sheet topologies and added constraints. We considered trying to form a disulphide bridge C141-C157. I still haven't seen the correct structure, so I don't know whether any of our explorations were on the right track. We made no attempt to break this into domains for prediction. T0181 This is one of the ones I was asked to talk about (though I didn't know that until I was actually at CASP5, and I still haven't seen the correct structure). We had some weak fold-recognition hits, though we believed them to be too weak ro be much use for anything but super-secondary structure. We did try to use the hits to get guesses about how the predicted anti-parallel strands were linked. Our model 1 (try22-opt) was the best scoring with no manual constraints and with three differenet sets of manual constraints. Our model 2(try18-opt) was best-scoring with two other sets of manual constraints. T0146 We tried doing some subdomains: 1-80, 1-180, 120-220, 180-325. We predicted a domain boundary around 130-140 (which now seems a bit too high). We liked the looks of the robetta2 and robetta3 models, which had one big sheet, but we were not able to get a big sheet to form. By our 18th try we had the beginnings of a beta sandwich. Our model1 was heavily based on robetta's model 2. We had substantially better decoys in our set for each of the 4 domains, though none were really good. CA RMSD whole _1 _2 _3 _4 best for each 19.0241 14.4597 10.0453 14.4931 10.5645 submitted models: 1 try3-robetta2.1.60 20.1361 17.4053 12.0392 17.3951 12.6009 2 try18.3.80 21.3237 20.5213 12.8935 15.3085 14.3686 3 try20-opt-scwrl 21.6017 20.8086 12.8589 18.3672 14.3224 4 try11-al10.1.40 21.7395 17.4999 13.0246 15.7971 13.3209 5 try21.0.80 21.6111 20.4989 13.4218 15.0706 14.1969 For domain 1 (1-24,114-196), model 4 is best, model 1 second best. Real beta sheet is antiparallel 345216, with 1 coming from a distant part of the sequence. Strands 2,3, and 6 were not predicted by str neural net. Model 4 has no beta sheet here. Model 1 has one hairpin (for strands 4-5). For domain 2 (25-113), model 4 is best, model 1 second best. In real_2 the sheet is antiparallel in order 51432 and the predicted helix between 1 and 2 is only half there, to let 1 and 2 run in opposite directions. In model 1, the sheet is order 1^2v3^4v5v (so 234 are ok, but 1 needs to be moved between 4 and 5). Also 5 is almost at right angles to the sheet, not really parallel or antiparallel. In model 4, only 234 are in the anti-parallel sheet. 1 separated about the right distance but is oriented the wrong direction and 5 is well-away from the sheet. For domain 3 (244-299), model 5 is best, model 1 second best. Not much similarity between the correct structure and model 1, and secondary structure prediction is poor. Model 5 is not really much better, still having wrong secondary structure. For domain 4 (197-243 not new fold?), model 4 is best, model 1 second best. Secondary structure prediction ok, but even model 4 is not a good fit. T0172_2 (116-216) We did try a subdomain 106-222 with constraints on the two ends, to try to model the inserted domain. domain 1: We submitted one straight fold-recognition prediction (as model 3) which did the best of our models on the first domain. I was predicting that that topology of domain 1 would be 3^2^1^4^5^6^7 instead of 3^2^1^4^5^7v6^ as in 1dusA. Unfortunately, the 1dusA template was correct in the topology. I did predict that C46-C49 would be a disulphide bond (though for a while I entertained the idea that they were part of a metal-binding site. domain 2: My secondary structure prediction for domain2 was pretty good, but I was having a lot of difficulty getting the helices to pack up near the other domain. I don't think we really managed to do anything with our submitted models on this one. We did have a better decoy in our set, though it did not score well. CA RMSD whole _2 best: try4.13.25 11.3483 12.1486 model 1: try26-opt 20.8816 17.4086 model 2: try29-opt 20.9515 17.4927 model 3: T0172-1dusA.pdb incomplete model 4: try20-opt 20.9603 19.9299 T0173 We predicted a parallel topology and fairly arbitrarily tried to force the protein to be a TIM-barrel. We tried to cluster residues we believed to be part of an active site: H13,D15,D16,E44,R68,E71, D95, H144, D146,H147, H244, Q247, E271. We did not have 8 predicted strands, only 7, and fishing around for another one resulted in rather disordered structures. We had terrible secondary structure prediction for the c-terminus, missing all the strands. I have not seen the correct structure yet, so I don't know if any of our decoys that were not submitted were any good. T0186_3 (257-292) We broke this target into two domains (1-43,331-364 and 44-330). We did not see 257-292 as a separate domain. The undertaker optimization of the domains messed up the good alignments of the fold recognizer (our server did much better on this target), so that good performance on the long loop (or domain3) was impossible---the endpoints were not in the right places. T0187_1 (4-22,250-417) We tried arbitrary subdomains 1-150, 91-240, 201-350, 301-417, but none of them seem to have been able to fold, so we used them just as sources for more fragments doing fragment packing on the whole structure. We tried guessing strand topologies and adding constraints to try to get beta sheets to form, but were not particularly successful. T0170 Secondary structure prediction was pretty good, but we botched the packing of the helices. The best-fitting decoy in our whole set was the robetta model 3, though our cost function liked our models much better than it like the robetta ones. We predicted an N-terminus helix (that didn't exist) packing where the C-terminal helix should have been.