Thu Dec 16 16:34:03 PST 2004 Kevin Karplus See assessment.html for comparison of our results to those of other groups. ------------------------------------------------------------ 6 June 2004 Kevin Karplus This directory is for the SAM-T04 predictions for CASP6. Wed Sep 15 10:26:17 PDT 2004 Kevin Karplus The SAM-T04 team was lead by Kevin Karplus and had the following members: Kevin Karplus Professor of Biomolecular Engineering Sol Katzmann 0th year grad student in bioinformatics George Shackelford 1st year grad student in bioinformatics Martina Koeva 1st year grad student in bioinformatics Jenny Draper 3rd year grad student in bioinformatics Bret Barnes senior in bioinformatics Marcia Soriano senior in bioinformatics Richard Hughey Professor of Computer Engineering Note: all final submissions of 3d predictions were the responisbility of Kevin Karplus. All residue-residue contact predictions were the responsibility of George Shackelford. SAM software maintenance and development were the responsibility of Richard Hughey. Everyone (except Prof. Hughey) participated in looking at and trying to improve the 3D models. -------------------------------------------------------------------------- CONTENTS OF THIS FILE: Subject: Introduction and general procedures Subject: generating Hbond constraints Subject: constraints for beta strands Subject: new SheetConstraint undertaker command Subject: SheetConstraint checked out and checked in Subject: new PrintConformSheets undertaker command Subject: rasmol hint -- define selected sets, save in a script Subject: conformations and alignments process, redundancy Subject: T0196 polishing Subject: t04 multiple alignments vs. t04 method Subject: Robetta models for CASP-6 available to public Subject: Robetta models Subject: fetch_robetta make target implemented Subject: examining clashes Subject: USEFUL URLS Subject: optimizing multimeric models Subject: homomultimers and gromacs Subject: Rosetta's energy function. Subject: tips for rasmol scripts Subject: Re: tips for rasmol scripts Subject: Making undertaker runs shorter -- how many decoys? Subject: speeding up undertaker -- SCWRL Subject: new PrintConformHelices undertaker command Subject: info from the literature about targets Subject: SwissProt most useful links Subject: using structure-structure aligners Subject: creating new domain directories: A -- acrobat distiller Subject: creating new domain directories: B -- passphraseless access Subject: creating new domain directories: C -- create subdirectory Subject: nested loops of conformation generation genetic algorithm Subject: rosetta repack of dimer models Subject: hbond costs rescaled Subject: new undertaker with "bonus" Subject: new undertaker installed Subject: t2k or t04 fragments Subject: Current machine list Subject: adding VAST alignments Subject: making a dimer model from a try monomer model Subject: starting up new targets Subject: merging domains in DeepView -- simple monomers Subject: merging domains in DeepView -- dimers/trimers Subject: prettyscore Subject: putting domains back together Subject: printing sheet constraints from any model Subject: Getting SheetConstraints from an existing PDB file Subject: How to close stubborn breaks Subject: excluding templates Subject: Using gromacs Subject: Evaluating our predictions Subject: GDT evaluation of casp6 targets Subject: GDT curves Subject: Contact cost function (clens) -------------------------------------------------------------------------- 6 June 2004 Kevin Karplus Subject: Introduction and general procedures >> Sol: For an overview of the casp6 process, see the file >> "generic.method" in this directory. To make a prediction for a new target: 1) set up a directory using scripts/new-target t0196 (or whatever the target number is---capitalization shouldn't matter) 2) connect to new directory and run make cd T1096 make -k >& make.log & Note: because some of the script calls for running predict-2nd and undertaker, these cannot be run on the Condor cluster until it is upgraded to the new compiler. 3) After try1 has run, the Template.atoms file should be gzipped, and subsequent try*.under files should comment out the PrintTemplateAtoms line. Note: generally Kevin Karplus will do these initial runs. Also, Kevin should be the only one to change Make.main, except in an emergency. After the initial run is done subsequent work will consist mainly of creating new try*.under and try*.costfcn files, in attempts to improve the 3D models. All notes on the protein should be collected in the README file for the particular target. If multiple people are working on the same protein, we'll have to be careful not to step on each other's edits to the README file. If this starts to be a problem, we'll have separate README.karplus, README.ggshack, ... files for a target. I'd rather keep the notes in one file per target if we can. The initial automatic run creates a summary.html file, which can be viewed with any web browser. I do NOT what these pages to appear on the web until after the deadline for the target has passed---please don't link to them or to any directories in the casp6 heirarchy from your .html directory! I have gathered "ToDo" lists in various places: ToDo in this directory /projects/compbio/experiments/undertaker/ToDo http://www.soe.ucsc.edu/~karplus/protein-projects.html inside the source code for various programs (look for TODO or BUG) on scraps of paper and e-mail to myself. I'll try to get the "to-do" items that are in inaccessible places transcribed into one of the main lists. When you are looking for a project, big or small, these lists are a good place to start. If you have items that you want to add or to have me check off the lists, let me know. The main work this summer (other than program development) will be to try to improve the initial models that are produced by the automatic programs. This will consist of hours of looking at models (mainly with rasmol), trying to figure out what is wrong with them and how they can be improved, doing literature searches to find out more about the protein or family, seeing what the CAFASP servers and metaservers have done with the target, and modifying the models. I don't expect to do much direct modification of the models---mostly we'll be tweaking the cost function for undertaker by changing the weights of the different components of the cost and by editing the constraints in the try*.costfcn file. Each run of undertaker will have its own files: inputs: tryX.under, tryX.costfcn outputs: tryX.log, decoys/TARGET.tryX-opt2.pdb The first run (try1) is supposed to be created and run automatically as part of the initial make. Subsequent runs will be done by copying the try1 files to try2, try3, ..., and editing them. PLEASE DO NOT EDIT a try*.under or try*.under or .costfcn file once the run for it has been done. Note: we could combine the tryX.under and tryX.costfcn file into a single file, but I've found it handy to have the constraints and cost function separate from the optimization and I/O, so that they can be used in score-all.under also. >> Sol: recommended method for tryNN in target directory casp6/Txxxx: >> /projects/compbio/programs/undertaker/undertaker < tryNN.under >& tryNN.log Some of the earlier runs of earlier targets (only T0196-T0198) use a different split of the input script, with just the constraints in a separate file. I've found I have to play with the weights on different components of the cost function enough that it is better to have the weights and constraints together. We can evaluate the models we have created so far by running make decoys/score-all.rdb which creates a number of files in the decoys directory. The score-all.under file needs to be edited when the cost function or constraints change. We can superimpose several models for easier comparison with the make -k best-models.pdb.gz The models are specified in the superimpose-best.under script. I would like the final submission to be from the best-models.pdb.gz file. For a try2 run, I generally make try2.constraints from the TARGET.t02.dssp-ehl2.constraints file, plus any constraints from the .mi constraint files that look reasonable from the try1 model. When using the .mi constraints (or any very speculative constraints), it is probably a good idea to use the bonus_constraints cost function, rather than the constraints cost function. -------------------------------------------------------------------------- Date: Thu, 17 Jun 2004 16:58:10 -0700 From: Kevin Karplus Subject: generating Hbond constraints Martina asked where the Hbond constraints in the .under files come from. I think that everyone needs to hear my answer. Essentially ALL the Hbond constraints are hand generated. If I see some beta strands in a model (say foo.try33-opt2.pdb) that I like, I can look in the decoys/try33-opt2.beta-hbonds file to extract those hbonds. There is also a try33-opt2.all-hbonds file generated, but I don't generally use the Hbonds from there (except for an i.O i+3.N Hbond in a beta hairpin). Sometimes I edit the list of Hbonds, if I think that undertaker is introducing an unneeded bulge, or missing a bulge that is needed. Often I use the Hbond lists to try to orient and align strands. Looking at the predicted burial patterns with "script burial" in rasmol is often useful for helping to decide the orientation of the strand, though "select polar" and "color yellow" can also show a different view of the hydrophobicity patterns. The Hbond lists, unless they are in all the templates in comparative models, should be regarded as conjectures. I sometimes use Hbond lists to hold models together while trying to reduce clashes---this turned out to be necessary for T0207, which tended to pop open a bit to relieve clashes. The Hbond list (taken from the highly similar templates) helped keep it compact. ------------------------------------------------------------ Date: Sat, 19 Jun 2004 20:54:06 -0700 From: Kevin Karplus Subject: constraints for beta strands When we have a conjectured pairing of beta strands, there are two ways we can constrain the backbones: 1) using Hbond commands to pair the appropriate N and O atoms. This works pretty well, but requires deciding not only which residues line up, but also which way the hbonds stick out of the backbone. 2) using Constraint commands on the CB atoms of the strands. According to the histograms in pce/undertaker/output/ dunbrack-1332-beta-CB-antiparallel-bonded-dist.hist dunbrack-1332-beta-CB-antiparallel-unbonded-dist.hist dunbrack-1332-beta-CB-parallel-dist.hist The mean distance is CB-CB CA-CA antiparallel bonded 5.02 5.25 antiparallel unbonded 5.58 4.43 antiparallel combined 5.17 5.03 parallel 5.14 4.84 Constraining the CBs with Constraint XX.CB YY.CB 4.8 5.15 5.5 1.0 should work reasonably well with both parallel and anti-parallel strands. Once the Hbonding pattern is determined, it is better to use the Hbond constraints instead of the CB ones (at least refine the optimal CB distances as suggested above). -------------------------------------------------------------------------------- Date: Tue, 22 Jun 2004 06:46:00 -0700 From: Kevin Karplus Subject: new SheetConstraint undertaker command I got tired of constructing sets of sheet constraints by hand, so I've added a new command to undertaker for specifying a set of constraints: SheetConstraint F11 H15 G33 G37 1.0 specifies a parallel pairing 11 with 33, 12 with 34, 13 with 35, ... and unknown hbonding SheetConstraint F11 H15 G33 G37 hbond H15 1.0 specifies parallel pairing with all odd numbered residues on first strand having hbonds. (Any residue on either strand can be given to specify the phase of the hbonding.) To get antiparallel strands, just reverse the order of one strand: SheetConstraint H15 F11 G33 G37 hbond H15 1.0 pairs H15 with G33, 14 with 34, ... hbonding the odd-numbered residues of the first strand. One known bug: hbond specification is not quite right for separation=3 O-N on antiparallel strands, since that is not a standard antiparallel Hbond. The usual "Hbond" command gets it right, but I did not include the lookup for hbond distance in the SheetConstraint command. Warning: I've not tested the SheetConstraint command yet. Use it with extreme caution--check the constraints (using PrintConstraints) to make sure that they are being generated correctly. ------------------ Date: Tue, 22 Jun 2004 12:01:44 -0700 From: Kevin Karplus Subject: SheetConstraint checked out and checked in I've also checked out SheetConstraint a bit, and it seems to be working. I'll check it into the cvs version. -------------------------------------------------------------------------------- Date: Wed, 23 Jun 2004 14:25:12 -0700 From: Kevin Karplus Subject: new PrintConformSheets undertaker command I've just added a new command: PrintConformSheets filename that prints out the hbond-ladders in a conformation as SheetConstraint commands. This can be easier to read than a long list of hbonds, and can be used to incorporate structures that you like into cost functions to freeze them in. I'm going to start using it instead of PrintConformHbonds to report on the beta structure. -------------------------------------------------------------------------------- Date: Fri, 18 Jun 2004 22:45:28 -0700 From: Kevin Karplus Subject: rasmol hint -- define selected sets, save in a script You can create rasmol scripts that define a set of atoms, then use that set name in future select commands. Someday I'll get around to re-writing the programs that create the rasmol scripts to do this more cleanly. For example, one could write something like select 55 or 57 or 76-77 or 79 or 103 or 106-107 or 109 select selected or 117 or 119 or 121 or 166 define bindb selected then later could do select bindb and backbone or select bindb and */2 and (nitrogen or oxygen) This feature of rasmol is very handy when you've identified some active site residues or conserved residues, or some other set you want to highlight. -------------------------------------------------------------------------------- Date: Tue, 22 Jun 2004 15:01:27 -0700 (PDT) From: Kevin Karplus Subject: conformations and alignments process, redundancy Sol asked a couple of good questions: > At some point in the process are we supposed to switch from using TryAllAlign to reading > in the conformations from previous tries? Can you explain that process a little more? When you use ReadConformPDB to read in conformations, rather than TryAllAlign, you tend to do minor optimizations to the best-scoring input conformation. That is, you generally stay in the local minimum you've found. So, if you are just trying to choose betwen existing models and polish them a bit, use ReadConformPDB and turn up the initial pseudocounts for the Crossover method. If you are trying to find new conformations, start from TryAllAlign. If you are looking for REALLY new conformations, just start from the initial random conformation (and run for much longer, since undertaker rarely forms significant chunks of structure). > In T0204 I 'include read-alignment.under' for all the high-scoring templates. Some of > these are highly redundant. Are there pros and cons to this redundancy? Any performance > implications? It takes some extra time to read in all the alignments, but only occasionally does this cause much problem (when SCWRL gets wedged). Having extra alignments that are identical raises the probability of that alignment or its fragments getting used. Having extra alignments that are slightly different allows more thorough search of conformation space in the neighborhood of a particular conformation. I plan to INCREASE the number of initial alignments, hoping that we'll increase the probability of having a good alignment in the initial set and of choosing a good alignment. > Speaking of redundancy, why do we still have so much redundancy in our template set? > > See T0204/clustalw.1hxqA.1hxpA.1guqA.aln > T0204/clustalw.1kpf.1rzyA.4rhn.1fit.2fhi.1emsACterm.aln We have a lot of redundancy in the t2k template set for historical reasons---we only remove templates when the underlying PDB file becomes obsolete. The t04 template set has 2 sources of redundancy: 1) important test sets were included, even if newer templates made them "unnecessary" 2) we deliberately use up to 90% identical sequences if the resolution is high enough. I pick up Dunbrack's reduced redundancy sets weekly (pcem/select-sets/dunbrack/), and use higher % thresholds for finer-resolution structures. Incidentally, I don't encourage using clustalw. If you want a global multiple alignment other than those provided by the SAM tools, try "muscle" instead of clustalw. -------------------------------------------------------------------------------- Date: Fri, 18 Jun 2004 22:14:35 -0700 From: Kevin Karplus Subject: T0196 polishing ... You can also make a different attempt at sidechain packing by making decoys/T0196.try13-opt2.repack-nonPC.pdb, which will use a version of Rosetta to repack the sidechains. Sometimes Rosetta does a better job, sometimes not (Rosetta and undertaker have somewhat different views on what constitutes a clash---I spent quite a bit of time this spring comparing the two clash checkers and finding bugs in both). You may need to turn up the weights of soft_clashes and break to start polishing the model. ... ------------------ Wed Jun 23 16:55:58 PDT 2004 Kevin Karplus It might be worth doing a polishing try without constraints (copy unconstrained.costfcn to a tryXX.costfn file) starting from all decoys. ... There are sitll some bad clashes in try14-opt2, so it might be worth doing Rosetta sidechain repacking for the top few models before doing a polishing run. -------------------------------------------------------------------------------- Date: Tue, 22 Jun 2004 12:50:01 -0700 From: Kevin Karplus Subject: t04 multiple alignments vs. t04 method There are two things that I refer to as "t04": 1) the target04 multiple alignment script, which was intended to replace the target2k multiple alignment script. In CASP6, it seems that sometimes t2k and sometimes t04 comes up with a better multiple alignment, so for now, we'll use both. The iterated search and multiple alignment script does not use structure (predicted or actual) anywhere. >> Sol: Thus the t04 multiple alignments are based on 1-trk HMMs (AA only) >> and do not require the t04 neural net predictions of other alphabets 2) the whole method: multiple alignment, neural nets, profile-sequence and profile-profile alignment, undertaker, ... -------------------------------------------------------------------------------- Date: Mon, 14 Jun 2004 19:20:57 -0700 From: Dylan Chivian Subject: Robetta models for CASP-6 available to public Hi folks, Until there's a central source for server-CASP models to assist human groups' prediction efforts, we want you all to be aware that the Robetta models are available. Hopefully, the CASP organizers will make the server models accessible from one central location before too long, so we don't have to be dependent on arne's submissions to the BioInfo meta server, which doesn't capture all the servers. http://robetta.bakerlab.org/queue.jsp?UserName=casp6 One thing to note is that you can get more detailed information about the predictions, such as the alignment for homology modeled domains, or MAMMOTH hits for de novo domains, by clicking on the domain number in the Ginzu table. We apologize, but please don't resubmit regions of the CASP targets to Robetta, as the server is overloaded, and you probably wouldn't get your models quickly enough to be useful. ------------------ Date: Tue, 15 Jun 2004 06:14:56 -0700 From: Kevin Karplus Subject: Robetta models As the message I just forwarded says, the Robetta models are publically available. In the past I have sometimes found it useful to include Robetta models (with ReadConformPDB) as starting points in the undertaker optimizations. We should probably pick up the Robetta models for each target and store them in decoys/robetta1.pdb, decoys/robetta2.pdb, ... ------------------ Date: Tue, 22 Jun 2004 12:01:44 -0700 From: Kevin Karplus Subject: fetch_robetta make target implemented I implemented the "fetch_robetta" target to get the publically available Robetta targets, and have fetched all the currently availabl models. You can find the models in casp6/T*/decoys as robetta-model1.pdb.gz through robetta-model10.pdb.gz. (Note: there are only 5 models when Robetta did the modeling by homology rather than as a new fold.) -------------------------------------------------------------------------------- Date: Thu, 24 Jun 2004 08:19:24 -0700 From: Kevin Karplus Subject: examining clashes It is sometimes useful to know where clashes are happening, and not just how many or how bad they are. There are two ways to get a this information: 1) from undertaker via the ClashConform command (which SHOULD have been called PrintConformClashes for consistency with other commands), except that it does all conformations by default, rather than just the current conformation. I really need to clean up the "Print" commands and make a single "Print" command with lots of options. ClashConform is called automatically by the score-all.under script, creating an all.clashes file in decoys/ I just fixed the command so that it sorts the clashes worst first. It is a good idea to look at the worst clashes to see if they are caused by providing inconsistent constraints, or if adding a constraint can help pry things apart. 2) from Rosetta If you make a .repack_nonPC.pdb or .score.pdb file, then there is a lot of extra information in the file about how well Rosetta likes the residues. The "Erep" column reports the "repulsive" term of the Van der Waals energy, so it roughly the equivalent of our clash cost (though scaled differently). Rosetta and undertaker have somewhat different ideas about clashes, but usually if one hates a clash, the other will too. (Please report to me any cases where Rosetta sees a bad clash and undertaker doesn't---it probably represents a bug in the clash definition that I need to fix.) -------------------------------------------------------------------------------- Subject: USEFUL URLS CAFASP ok? http://cafasp4.bioinformatics.buffalo.edu/validmodels/status.php CASP ok? http://predictioncenter.llnl.gov/casp6/models/casp6-models.html Bioinfo.pl predictions? http://bioinfo.pl/Meta/status.pl Robetta ready? http://robetta.bakerlab.org/queue.jsp?UserName=casp6&rpp=100 Info from the literature? http://cassandra.bio.uniroma1.it/Casp6/Targetinfo.html beta release of SCOP: http://www.mrc-cpe.cam.ac.uk/SCOP_Parseable_Files_for_CASP6.html (downloaded and in pc/data/scop) Status check for servers: http://predictioncenter2.llnl.gov/Serv/ -------------------------------------------------------------------------------- Date: Mon, 28 Jun 2004 14:13:29 -0700 From: Kevin Karplus Subject: optimizing multimeric models Right now, there is no way for undertaker to handle multimers when optimizing, so we can't do the opbious thing of adding sheet constraints between multimers. Someday ... That means that we have to fake anything that requires multimeric contacts. One (ugly) way to do this is to add scaffolding constraints, specifying distances from several residues. It is very difficult to get these right, and probably not worth the trouble. We're probably better off taking a small hit on the positions of the multimeric contacts. Where we have severe conflicts, we may be able to add a bit of scaffolding to good effect (say to tuck a terminal helix away out of the oligimer interface), but positioning things accurately enough to get the right multimeric contacts will probably have to wait until undertaker can explicitly handle multimers. ------------------------------------------------------------ Sun Jan 23 04:08:34 PST 2005 Kevin Karplus Subject: homomultimers and gromacs Use the unpack-multimer script to convert a single-chain numbering of a homomultimer into its separate chains. The simplest way is to define MONOMER_LENGTH in the Makefile, and use 'make foo.unpack.pdb.gz' to split up foo.pdb.gz into monomers of the given length. If you want to use gromacs on a multimer, be sure that the input has the monomers in separate chains, or gromacs will try to merge them. Defining MONOMER_LENGTH and using the target Txxxx.multgro5 instead of Txxx.gro5 will do the unpacking before and after gromacs. -------------------------------------------------------------------------------- Date: Mon, 28 Jun 2004 14:13:29 -0700 From: Kevin Karplus Subject: Rosetta's energy function. Sol asked about Rosetta's energy function. Rosetta's enery function in the .repack-nonPC.rdb files has several components: Eatr Erep Esol Eh2o_sol Eaa Edun Eintra Ehbnd Epair Eref Eh2o Eh2o_hb Eres Eatr is the attractive part of the Van der Waals contacts. Erep is the repulsive part of the Van der Waals contacts (i.e. clashes) Esol is a solvation term Eh2o_sol I've no idea. Probably some other solvation term. Eaa I've no idea. Edun is a term for the popularity of the rotamer the residue is in. Eintra is a term for intra-residue clashes, which are included in the Erep term. Epair some sort of pairwise residue-residue energy term Eref I've no idea Eh2o,Eh2o_hb I've no idea, but they seem to be always 0 in the repackings. Eres is the total of the rest. >> Sol: That is to say, Eres is the sum of all the named terms above, >> and thus represents the overall evaluation. ------------------------------------------------------------ From karplus@soe.ucsc.edu Sun Jul 11 12:07:05 2004 Date: Sun, 11 Jul 2004 12:07:03 -0700 From: Kevin Karplus Subject: tips for rasmol scripts George asked for tips on using the rasmol scripts. Currently, there are rasmol scripts automatically generated for secondary structure and burial predictions, and for conservation. Soft-links are provided to avoid having to type long file names, but the long file names are available if you want to look at the predictions that don't currently have softlinks. For example, script str2 will show the molecule in cartoon format with the predictions for the str2 alphabet (yellowish for strands, reddish for helices, blue for turns), with darker yellow where the strands are predicted to be parallel. The other secondary structure alphabets (stride, dssp, ehl2, alpha) are similarly colored, except for bys, where I picked a different color scheme to match the colors on the sequence logo---I should probably change the color schemes so that all the secondary-structure predictors use essentially the same colors, both for rasmol and for sequence logos. Note that the sequence logo conveys more information than the coloring script, since it expresses the confidence in the prediction, as well as the most probable letter. The burial scripts "burial" and "near" color residues so that purple is most exposed, then blue, green, ... down through the rainbow to red. The colors aren't exactly spectral (brown instead of yellow, for example), but the sequence is usually fairly clear. The colors may need tweaking for the "near" scripts, as there are more levels of burial there and I am still using the first guess at colors for those. What I usually do on looking at a new model is to run "script str2" or "script ehl2" to see if the strands and helices are roughly right. If I want to check burial, I then do "script near", often followed by switching to spacefill to see that the buried residues really do get buried. If I want to check hbonds, I may do "script str2", "restrict backbone", and switch to sticks display in the menu. The "conserved_t04" and "conserved_t2k" scripts highlight the conserved residues. It is usually best to have a cartoon display before doing them. They also define sets (with the same names as the scripts), so that you can do "select conserved_t2k or conserved_t04" after running the scripts. From karplus@soe.ucsc.edu Sun Jul 11 14:35:20 2004 Subject: Re: tips for rasmol scripts In response to a question from George: I don't have any scripts for displaying pairwise constraints in rasmol. What I do is manually type select 85 or 93 color green (or any color not currently in use) If that isn't clear enough, I also change to sticks display for the selected pair. When the screen gets too cluttered with pairs and I can't think of more colors to use, I do "script str2" to clear the existing pairs and start over. I can generally do 5 or 6 pairs before things get too confusing, but if the same residue is involved in many pairs, then the clustering is hard to view---I'll sometimes select all the residues of the cluster and highlight them with the same color. If I want to look a particular sheet constraint, I select the two ranges and color with a currently unused color select 85-95,103-113 color blue If I want to see the strands highlighted and still have the current coloring (say from "script burial" or "script near", then I'll often do select (85-95,103-113) and backbone then switch to sticks display to see the backbone. If I want to see which way a strand is oriented, I'll sometimes do select (85-95,103-113) and *.CB then switch to ball-and-stick display. Selecting particular residues in cartoon mode is often difficult, so I'll often switch to backbone mode to get easier selection. ------------------------------------------------------------ From karplus@soe.ucsc.edu Mon Jul 12 20:36:58 2004 Subject: Making undertaker runs shorter -- how many decoys? Jenny asked > What do you suggest changing to make shorter undertaker runs? The time for a run is roughly proportional to the number of decoys made and scored, which is super_iter*num_gen*gen_size + super_num_gen*gen_size (unless super_iter=1, then it is just num_gen*gen_size as no second pass is done) If you are doing polishing on an almost completed model, I think it is best to do a few long runs (super_iter =1 or super_iter=2), but if you are trying to get some big change to happen, it is probably better to do many short runs (super_iter>5). The gen_size should be bigger than the pool_size. ------------------------------------------------------------ From karplus@soe.ucsc.edu Mon Jul 12 21:09:37 2004 Subject: speeding up undertaker -- SCWRL One big time sink is calling SCWRL for each alignment in all-align.a2m. This makes for better specific fragments, but on T0208 may take an hour to complete (even with the time-outs for SCWRL). You can either remove the SCWRL keyword from the ReadFragmentAlignment command, or you can reduce the number of alignments you use by using InfilePrefix 4xis/ include read-alignments-scwrl.under for just the templates you want to include (going further, you can use the read-alignments-noscwrl.under script instead). Be sure to run "make read_alignments" first, to make sure that the scripts are up to date. ------------------------------------------------------------ Date: Thu, 8 Jul 2004 08:05:20 -0700 From: Kevin Karplus Subject: new PrintConformHelices undertaker command I added another new command to undertaker today: PrintConformHelices It is like PrintConformSheets, except that it looks for alpha helices, which it defines a sequence of 2 or more O(i)-N(i+4) hydrogen bonds. ------------------------------------------------------------- Date: Fri, 9 Jul 2004 09:55:32 -0700 From: Kevin Karplus Subject: info from the literature about targets Don't forget to check http://cassandra.bio.uniroma1.it/Casp6/Targetinfo.html to see if anyone has reported information from the literature about a target. -------------------------------------------------------------- Date: Fri, 9 Jul 2004 10:47:11 -0700 From: Kevin Karplus Subject: SwissProt most useful links Marcia asked which links from a SwissProt entry are most useful. That all depends on what you want. I've found the "features" section to be useful for telling me about known active sites, binding sites, cleavage points, and so forth. The GO (gene ontology) classification gives you function in a standardized format. Protonet and Pfam may tell you about conserved domains (some of whose functions are known). Basically, it is a matter of exploring and looking for tidbits of useful info. -------------------------------------------------------------- Date: Fri, 9 Jul 2004 20:55:34 -0700 From: Kevin Karplus Subject: using structure-structure aligners Martina tried using DALI to get a structure-structure alignment for T0212. Not a bad idea, but she then had trouble translating the DALI alignment into something undertaker can use. I've had pretty good luck taking VAST alignments in the mFASTA format and converting them to a2m. The pairwise ones just need have the names fixed and the "-" converted to ".". To get the alignment into undertaker, I put it into the appropriate subdirectory, then run make read_alignments all-align.* to update the all-align.a2m.gz file and the read-pdb scripts. Then the undertaker run can have either ReadFragmentAlignment all.a2m or InFilePrefix 1jh5A/ include read-pdb-scwrl.under to include the alignment in the set it tries. -------------------------------------------------------------------------- 6 June 2004 Kevin Karplus Subject: creating new domain directories: A -- acrobat distiller Before you run make in your new target directory, make sure you have a file called "AcrobatDistiller" in your home directory which contains the text: AcrobatDistiller.compatibilityWarningIssued: YES AcrobatDistiller.compatibilityLevel: 3.000000 ... you can copy this file from ~karplus/AcrobatDistiller If you don't have this, your make run will freeze, as AcrobatDistiller (used to make the .pdf -logo files) will pop up a "pick a pdf version" menu -- which you can't choose from! .................................................................... 30 July 2004 Sol Katzman Subject: creating new domain directories: B -- passphraseless access You will need passphraseless access to the machine that runs acrobat distiller. From your home directory: ssh-keygen -t dsa (just press Enter as a response to all questions) cd .ssh cp id_dsa.pub authorized_keys2 .................................................................... 30 July 2004 Sol Katzman Subject: creating new domain directories: C -- create subdirectory this example is to create subdomain for residues 243-330 of T0219 # make the domain subdirectory and its decoys subdirectory cd ~/casp6/T0219 mkdir 243-330 mkdir 243-330/decoys # copy necessary starter-directory files from T0219 parent directory # (see the latest list of files in ~/casp6/starter-directory for any others) cp Makefile 243-330 cp paths.txt 243-330 cp show-align.under 243-330 cp superimpose-best.under 243-330 cp try1.costfcn 243-330 cp try1.under 243-330 # copy the original sequence and blank pdb from T0219 parent directory cp T0219.a2m 243-330 cp T0219.blank.pdb.gz 243-330 # enter the domain subdirectory cd 243-330 # edit the new Makefile, sequence and blank pdb edit Makefile, adding a line like this: START_COL := 243 edit T0219.a2m, removing all residues EXCEPT 243-330 gunzip T0219.blank.pdb.gz edit T0219.blank.pdb, removing all residues EXCEPT 243-330 gzip T0219.blank.pdb IMPORTANT, unless gzipped, rasmol scripts will be bad # make try1 for the domain IN THE SUBDIRECTORY 243-330 # (from the directions for starting a new target) (make -k >& make.log ; gzip -9f *.log *.atoms) & # Now proceed to use the domain as per a new target. # In particular going from try1.under to try2.under, comment-out # the PrintTemplateAtoms command. -------------------------------------------------------------------------- Date: 15 Jul 2004 From: Sol Katzman Subject: nested loops of conformation generation genetic algorithm In first step, super iterations are performed in the triply nested loop: for si in 1:super_iter { start from initial pool of conformations for ng in 1:num_gen { for gs in 1:gen_size { generate a conformation } score and select new pool (some from prev pool, some newly generated) } save best[si] } In second step (only if super_iter > 1) use the super-iter best from the first step in the doubly nested loop: start from initial pool of conformations PLUS the best from each iteration from previous step for sng in 1:super_num_gen { for gs in 1:gen_size { generate a conformation } score and select new pool (some from prev pool, some newly generated) } select best If "keep_all" is set, then then best from each iteration is not only saved for the final "super iteration" but is also pushed on the conformation stack and will be used in subsequent "CostConform" and "OptConform use_all" commands. -------------------------------------------------------------------------- Date: 20 July 2004 From: Sol Katzman Subject: rosetta repack of dimer models First you must have a dimerized version of the prediction, for example: ~/casp6/T0229/dimer-1ml8A-try6.pdb Since score-all looks at ALL the pdb files in the decoys directory, we do not want to have the dimer models in the decoys directory. The following command makes the dimer repack version, which must then be hand edited and moved to the decoys directory. cd ~/casp6/T0229 make dimer-1ml8A-try6.repack-nonPC.pdb cp dimer-1ml8A-try6.repack-nonPC.pdb decoys/T0229.monomer-from-dimer-try6.repack.pdb hand-edit decoys/T0229.monomer-from-dimer-try6.repack.pdb, as follows: All the ATOM records have chain A, but there are twice as many residues as you want. Just delete the second half of the ATOM records. Now you can run score-all as usual. -------------------------------------------------------------------------- Date: Wed, 21 Jul 2004 08:18:23 -0700 From: Kevin Karplus Subject: hbond costs rescaled I just rescaled all hbond costs, in preparation for merging them with constraints in a more uniform framework. To get the same effect as before, rescale the hbond weights up by a factor of 13. Since the weights were not chosen in any sort of optimal way, this scaling can be approximate. I'll be using hbond_geom 1 \ hbond_geom_backbone 1 \ hbond_geom_beta 5 \ hbond_geom_beta_pair 10 for try1 in future. -------------------------------------------------------------------------- Date: Wed, 21 Jul 2004 18:03:11 -0700 From: Kevin Karplus Subject: new undertaker with "bonus" I have created a new version of undertaker called ~karplus/dna/undertaker/undertaker that eliminated bonus_constraints as a cost function, and instead allows the "bonus" keyword to be added to any Hbond, SSBond, Constraint, HelixConstraint, StrandConstraint, or SheetConstraint command. The "help" commands should explain the new syntax. Also, the known_ssbond cost function is rather redundant now, since SSbonds, like Hbonds, are handled internally as constraints---just giving the SSbond command (with a weight) adds it to the constraints. I've just got it to compile and link, and have not done ANY testing yet, but I need to go home for supper. If anyone wants to try it out, feel free---just send me bug reports! (The SSBond costs may be rather bogus---I haven't checked them for compatibility with the other constraint costs.) Kevin Date: Thu, 22 Jul 2004 07:54:25 -0700 From: "Kevin Karplus" View Contact Details Subject: new undertaker installed The new version of undertaker seems to be working well enough that I installed it in ~/pcpr/undertaker/undertaker If you need to roll back to the old version, I tagged it "pre-bonus" in the cvs system. Biggest changes: bonus_constraints as a function is gone---instead you can add the "bonus" keyword to the end of any constraint commands (including Hbond, SSbond, SheetConstraint, ...) I'm not sure that SSbond bonus does anything useful right now, because I think that the ssbond costs are not scaled appropriately for that to work. I'll try to look into that before I go to ISMB, but SSbond without the bonus option should still be ok. ------------------------------------------------------------------------ Date: Tue, 20 Jul 2004 15:51:25 -0700 From: Kevin Karplus Subject: t2k or t04 fragments I was just rethinking my advice on the fragment libraries. More fragments is probably better, so we should include BOTH: ReadFragmentAlignment TXXXX.t2k.many.frag ReadFragmentAlignment TXXXX.t04.many.frag Further thought: If we switch to t04.many.frag or add it, we should save the Template.atoms file again, since there are probably templates used for the fragments that weren't used before. >> Sol: In that case you should remove the old Template.atoms.gz, >> and gzip the new Template.atoms ------------------------------------------------------------------------ From: George Shackelford Subject: Current machine list Date: Thu, 22 Jul 2004 13:59:09 -0700 Below is a list of machines as I have compiled them. I spoke to Todd, and he said we could use his machines as long as they aren't heavy on bandwitdth and we run them 'nice' (like 5). - George Don't mess with Firas or Adam! post-doc -- whinny 4770 gill -- ribbit 5177 post-doc -- gobble 3380 terry -- caw 5177 katie Chuck's world -- crow 4770 corner -- quack 3380 chuck! -- meow 3380 bret 1st year's row -- bark 3580 martina -- woof 3580? dan -- squawk 4770 firas! -- peep 5910 george -- tweet 1600 bernard Jenny's world -- croak 5910 krishna -- baa 3380 ? -- calico xxxx Windows Robert's world -- coo 5910 robert -- cluck 4770 ryan Sol's row -- squeal 4770 yontau -- bleat xxxx adam! -- chirp 4770 sol Todd's world Todd says "ok, just Make the processes nice." He let us know if there's any problem. -- abyss 4780(2) -- vashon 4780(2) -- hori 4780(2) -- halo 4780(2) -- meth? not running? -------------------------------------------------- Same as above, sorted by bogomips Bogomips are retrieved by 'cat /proc/cpuinfo' host bogomips whose --- ------- ---- cheep 6370 karplus coo 5910 robert Robert's world croak 5910 krishna Jenny's world peep 5910 george 1st year's row caw 5177 katie post-doc ribbit 5177 post-doc halo 4780(2) Todd's world hori 4780(2) Todd's world vashon 4780(2) Todd's world abyss 4780(2) Todd's world chirp 4770 sol Sol's row squeal 4770 Yontau Sol's row cluck 4770 ryan Robert's world squawk 4770 firas! 1st year's row crow 4770 corner Chuck's world whinny 4770 gill post-doc woof 3580? dan 1st year's row bark 3580 martina 1st year's row baa 3380 ? Jenny's world meow 3380 bret Chuck's world quack 3380 chuck! Chuck's world gobble 3380 terry post-doc tweet 1600 bernard 1st year's row bleat xxxx adam! Sol's row calico xxxx Windows Jenny's world meth? not running? Todd's world ------------------------------------------------------------ Date: Thu, 22 Jul 2004 16:40:07 -0700 From: Kevin Karplus Subject: adding VAST alignments Each of the top few vast hits should be fetched. P-value is probably more important than length. I notice that the try11 vast run for T0198 is basically returning a gapless alignment to 1oshA, which is one of our templates already---the top hit from the t04 alignments. I cut-and-pasted the mFASTA version of the VAST alignment into 1oshA/T0198-1oshA-vast-try11.a2m the replaced the >id lines with names we can use (T0198 and 1oshA). It is a good idea to look at pcd/pdb/dunbrack-pdbaa to find out which name we are using for a sequence that occurs repeatedly in the PDB database---it is often NOT the one that VAST returns. I also replace all "-" characters with "." in the vast alignment, to make it a legal a2m file. After you have added the vast alignment to a subdirectory, you should make sure that the chain id appears in the MANUAL_TOP_HITS macro in the Makefile. If it is a new template, you can run "make extra_alignments" to make sure that all the alignment methods are run to create alignments for that template. Then run "make all-align.a2m.gz" to re-create the all-align.a2m.gz file that many undertaker scripts use to get long fragments and alignments. ------------------------------------------------------------ Date: Mon, 26 Jul 2004 16:40:07 -0700 From: Sol Katzman Subject: making a dimer model from a try monomer model To make the superposed dimer model for a try, assuming that you have a dimer template to use Example: try: T0228.try15-opt2.pdb template: 1qpoA 1) pick an alignment of the target to the template, and then duplicate the chain of the template to make a dimer "alignment". A favorite is to use the alignment from the 3trk model: aa.str2.CB_burial_14_7 with weights 0.4 on the last two tracks. cd T0228/1qpoA/ cp T0228-1qpoA-t04-global-str2+CB_burial_14_7-0.4+0.4-adpstyle5.a2m \ 1qpo.2mer-a2m edit T0228/1qpo/1qpo.2mer-a2m, making a copy of 1qpoA, calling the copy 1qpoB 2) create the undertaker script to: read in try15-opt2.pdb, AddId the two template chains, read the alignment that you created in step (1), Print out the dimer. See the example T0228/make-dimer-try15.under 3) run undertaker with the script from step (2) /projects/compbio/programs/undertaker/undertaker < make-dimer-try15.under ------------------------------------------------------------ Date: Mon, 26 Jul 2004 21:27:22 -0700 From: Kevin Karplus Subject: starting up new targets I won't be able to start the next three and possibly next six targets. To start a target: 1) cd ~karplus/casp6 2) scripts/new-target T0250 3) cd T0250 4) (make -k >& make.log ; gzip -9f *.log *.atoms) & 5) edit casp6/status to include new target in the right place 6) edit T0250/README to add DUE date 7) if your umask is not set right, or just to be sure, do "fixmode T0250" and "chmod -R g+w T0250" ------------------------------------------------------------ Date: Mon, 2 Aug 2004 From: Sol Katzman Subject: merging domains in DeepView -- simple monomers Example: T0219.try15.pdb full model residues 1-330 T0219.243-330.pdb model of domain residues 243-330 Want to use the domain model to replace residues 243-330 in the full model. Start DeepView and load the two models (if you get a dialog box during open, make sure you load only 1 model from each file) File..OpenPDBFile.. T0219.try15.pdb (will become layer 1) File..OpenPDBFile.. T0219.243-330.pdb (will become layer 2) In Wind:LayersInfo, select Layer 1 (try15) In Wind:ControlPanel, select residues 1-242 by click,shift-click first and last desired residue names (in the "Group" column) In Wind:LayersInfo, select Layer 2 (243-330) In Wind:Main Select..All verify that in Wind:LayersInfo, the correct number of residues is shown selected for each of the two layers (the "Sel" column) In Wind:Main Edit..CreateMergedLayerFromSelection In Wind:LayersInfo, select Layer 3 (_merge_) In Wind:Main Edit..RenameCurrentLayer.. should be "_merge_", change it to T0219.dv.merge To be safe, now you can delete the non-merged layers: In Wind:LayersInfo, select try15, press DELETE key (not Backspace) In Wind:LayersInfo, select 243-330, press DELETE key (not Backspace) Now you can manipulate the merged layer, typically by selecting the domain residues (click,shift-click residue names in Wind:ControlPanel), and toggle movement control (in Wind:Main, under movement icons) from "Move All" to "Move Selection". When you are finished moving, you can save the merged layer: File..Save..Layer.. T0219.dv.merge.pdb The atoms and residues will be numbered properly. (If running DeepView on a PC, fix the newlines to unix format). If you want to be nice to rasmol users, then edit the file, delete the records near the end that start with SPDBV, and add a record near the top (after the REMARKs) like this: EXPDTA dv.merge try15 1-242 + domain try1 243-330 --------------------------------------------------------------------- Date: Sat, 14 Aug 2004 From: Sol Katzman Subject: merging domains in DeepView -- dimers/trimers If you have dimerized or trimerized your subdomains and want to merge them into a chimeric dimer, you follow the same procedure as for monomers to create a _merge_ layer. This means that before merging, you have to select the proper residues from ALL chains in each layer. In the DeepView control panel, this can be hard to do with the shift,control,command keys. An easier alternative is to use the "v" column in the control panel, since each click in that column just toggles on/off the residue in question (shift-click toggles the whole group). Then you can use the main menu to "select visible". Okay, so you have merged selected residues from all chains in your subdomains. Manipulate the result until it looks like you want it. The next problem is to save it in the right order. If you simply Save..Layer, you will get the order from the merge, which when I did it for a trimer, gave me domain 1, chains A,B,C followed by domain 2, chains A,B,C. Even though all the ATOM records have their correct chain, when I opened this in rasmol, it thought there were 6 (not 3) chains, because it saw 5 changes in chain id. A workaround for this is to save each chain as a separate file, then read them back in and merge again. To save chain A, in the control panel, click on any "A" in the leftmost column, then Save..SelectedResidues. Similarly for chain B and chain C. Now read back in the files that you just saved and each will go into a separate layer. For each of those layers, Select..All. For all the other layers, Select..None. Now Edit..CreateMergedLayerFromSelection. Switch to the new _merge_ layer in the LayerInfo window, then Edit..RenameCurrentLayer, and (finally) Save..Layer. In this output, all the ATOM records for chain A come first, with consecutive numbering of the atoms, followed by all the ATOM records for chain B, etc. If you did this on a PC, do not forget to convert to the unix newline format. In emacs, visit the file, then: C-x RET f unix RET And add an EXPDTA record with the name of your multimer chimera. Quick summary of the above: 1) Select desired residues in ALL chains of each subdomain. (trick is to use controlpanel "v" column and Select..visible) 2) Merge selected (Edit..CreateMergedLayerFromSelection.) 3) Manipulate your multimer chimera. 4) Select chain A only. Save..Selected name: T0205.A.d1.try5.d2.try8.dv1.pdb 5) Select chain B only. Save..Selected name: T0205.B.d1.try5.d2.try8.dv1.pdb 6) repeat step 4 for chain C... 7) Select..None for all layers (or delete them all) 8) Open..PDB T0205.A.d1.try5.d2.try8.dv1.pdb 9) Select..All 10)Open..PDB T0205.B.d1.try5.d2.try8.dv1.pdb 11)Select..All 12) repeat steps 9,10 for chain C... 13) Merge the chains into a new layer (Edit..CreateMergedLayerFromSelection.) 14) go to the new merged layer (LayerInfo window) 15) Save..Layer name: T0205.ABC.d1.try5.d2.try8.dv1.pdb 16) convert to unix newlines 17) add EXPDTA record --------------------------------------------------------------------- Date: 11 Aug 2004 From: Sol Katzman Subject: prettyscore I wrote a little script to reformat the score-all.rdb file so that: a) you can fit all the output on a wide screen without wraparound b) you can read the cost factors directly above the cost columns c) you can see the cost weights directly above the cost columns This makes it easier to compare results and see how to tweak the cost function weights to get different results. A truncated example is below. For the full monty, see one of these: T0243/score-all.unconstrained.pretty T0251/score-all.try8.pretty As you can see, I cut down the precision to the bare minimum. The script is run as follows: ~/casp6/scripts/prettyscore < score-all.try8.rdb > score-all.try8.pretty There are a couple of options to override some of the script shortcuts. To see what they are do this: ~/casp6/scripts/prettyscore -help ---------------------------------------------------------------------------------- wet6 near way_ dry5 brea pred know hbon hbon hbon hbon cost .5 _bac back k _alp n_ss d_ge d_ge d_ge d_ge kbon ha04 bond om om_b om_b om_b e ackb eta eta_ one pair WEIGHTS--> 10.0 3.0 3.0 10.0 50.0 7.0 1.0 1.0 1.0 5.0 10.0 *try10-opt2.pdb.gz 0.6 1.6 1.7 3.0 0.1 -0.1 -8.3 -0.5 -0.3 -0.1 -0.1 187.60 *try10-opt1.pdb.gz 0.6 1.6 1.7 3.0 0.2 -0.2 -8.3 -0.4 -0.3 -0.1 -0.1 190.55 *try10-opt2.repack- 0.6 1.6 1.7 3.2 0.1 -0.1 -8.3 -0.4 -0.3 -0.1 -0.1 192.38 *try8-opt2.pdb.gz 0.6 1.6 1.7 3.0 0.2 -0.2 -8.3 -0.4 -0.3 -0.1 -0.1 193.70 *try6-opt2.pdb.gz 0.6 1.6 1.7 3.0 0.2 -0.1 -8.3 -0.4 -0.3 -0.1 -0.0 194.62 *try7-opt2.pdb.gz 0.6 1.6 1.7 3.1 0.0 0.3 -8.3 -0.5 -0.3 -0.0 -0.0 194.88 --------------------------------------------------------------------------------- Date: Sat, 14 Aug 2004 16:14:04 -0700 From: Kevin Karplus Subject: putting domains back together I've never used DeepView to put domains back together. What I have done is superimpose the domains either on each other (if they overlap) or on a model that has the domains in roughly the right places, using a script very similar to the superimpose-best.under script. In fact, that is what I'd do first---copy superimpose-best.under and change the ReadConformPDB commands to read in the domains you are interested in. Change the PrintAllConformPDB command to print to a different pdb file (not in decoys!). You can specify atoms in the overlap regions for the printAllConformPDB command, to make them be the focus of initial superpositions. You can add some full-length models to try to get overall positioning of the domains, but this is likely to hurt more than it helps if the full-lenght model is bad. >> Sol: As of 8/25/04, if you are merging more than two domains, >> the first ReadConformPDB must be for a domain that has >> residues in common with all others. Kevin may fix undertaker >> in the future to relax this requirement. >> Also, you can now specify residues (not just atoms) of >> the overlap region in the PrintAllConformPDB command. If the resulting models seem to fit against each other reasonably, then cut and paste the ATOM records to make a single chimeric model. This model can then be reoptimized in a polishing run that starts just from that model. If superposition with undertaker doesn't get you a reasonable positioning, then you have a few choices: 1) try any non-overlapping positioning of the domains and try using undertker to pack them, relying on OptSubtree mainly to reduce the gaps 2) do the positioning manually in deepview. (Jenny and Sol both have experience with this.) ------------------------------------------------------------------ Date: Sat, 7 Aug 2004 17:33:08 -0700 From: Kevin Karplus Subject: printing sheet constraints from any model To create sheet constraints from a model, you read the model into undertaker with a ReadConformPDB command, then do a PrintConformSheets command. I generally have a number of these in the superimpose-best.under script, so that I can get sheet constraints from the initial alignments. Sometimes I add the robetta models when I want constraints from them. ------------------------------------------------------------ Date: Mon, 16 Aug 2004 20:13:35 -0700 From: Kevin Karplus Subject: Getting SheetConstraints from an existing PDB file You can get SheetConstraints for an existing protein, but then you have to translate all the residue names and numbers. Do ReadconformPDB /projects/compbio/data/pdb/1ihr.pdb.gz chain A PrintConformSheets 1ihrA.sheets You'll need some of the initial stuff (as in the casp6/starter_directory/superimpose-best.under script) to have appropriate libraries and such before you can do the ReadConformPDB. You shouldn't have ReadTargetSeq or ReadTargetPDB command before the ReadConformPDB (unless it happens to be for the identical sequence). Kevin ============================================================ Date: Wed, 18 Aug 2004 12:46:11 -0700 From: Kevin Karplus Subject: How to close stubborn breaks In some of the comparative models I've been working on (like T0279) I had some stubborn breaks that just would not close. I finally found a trick that worked great. I added distance constraints like constraint N213.C T214.N 1.3 1.329 1.4 1.0 constraint N213.CA T214.CA 3.6 3.803 4.0 1.0 for the worst breaks. The first constraint is the bond distance for the peptide bond, the second is the normal CA-CA distance (both are based on the second residue NOT being a proline, the numbers are slightly different before a proline). I got the numbers from the prototypical backbone unit that is printed out at the beginning of every .log file from undertaker when the training set is read in. The constraints have very little slop in them, so provide very strong "force" when violated. Also, listing them as constraints provides information to some of the conformation-change operators, which can assign more importance to getting these gaps to close. Cutting and pasting the following lines may make editing easier: constraint X.C Y.N 1.3 1.329 1.4 1.0 constraint X.CA Y.CA 3.6 3.803 4.0 1.0 Before a proline, the distances should be constraint X.C P.N 1.3 1.335 1.4 1.0 constraint X.CA P.CA 3.6 3.819 4.0 1.0 Sat Aug 21 19:57:41 PDT 2004 Kevin Karplus The gap-closing constraints above are very strong, but won't necessarily fix the gaps the way you want. In many cases they'll simple move the gap 3 or 4 residues over, so that it falls outside the added constraints, without actually closing it. It may also move the gap in a different direction from what you would choose, if you were trying to move the gap into a loop region. ------------------------------------------------------------ Date: Sat, 21 Aug 2004 19:55:56 -0700 From: Kevin Karplus Subject: Re: Excluding a template George asked if there was to exclude a specific template. Sure, you exclude templates by not including them. That is, instead of using all-align.a2m, use the alternative syntax in the .under files of using "include read-alignment-scwrl.under" lines. (See any of the .under files, it is included in the try1.under files in a commented-out form.) You may need to run "make read_alignments" to make sure you have all the read-alignment-scwrl.under files you need. ------------------------------------------------------------ Date: Thu, 02 Sep 2004 11:05:12 From: George Shackelford Using gromacs [ Sat Jan 22 21:50:28 PST 2005 Kevin Karplus THESE INSTRUCTIONS NOW OBSOLETE. To get a gromacs optimization of decoys/foo.try99-opt1.pdb.gz just say 'make decoys/foo.try99-opt1.gromacs0.pdb.gz' or 'make decoys/foo.try99-opt1.gromacs4.pdb.gz' to optimize with forcefield 0 or 4. ] GROMACS is a collection of programs for doing simulated molecular dynamics. It is possible for it to handle simulations of proteins in water by simulating the water molecules; it is that fast. I have written a simple script, 'dogromacs.sh,' for running a energy minimization. It will undo clashes and 'fix' short backbone breaks while trying to resolve sidechains. It has been used in place of SCWRL during the polishing stage with mixed results. It scores very badly but can be used as a template to input to SCWRL or Rosetta for re-packing. To use it, you'll need to have gromacs installed. If requested, I will install a version in the /projects/compbio/bin. Then copy the 'gromacs' directory in the 'casp6' directory to your target directory (i.e., T0200/gromacs). Switch to your gromacs directory and enter: dogromacs.sh TXXXX.tryNN substituting your target and try. This assumes that there is a decoys/TXXXX.tryNN-opt2.pdb.gz file available. # While running it will ask for two different inputs - # The first specifies the force field to use. # Answer the one that says: # OPLS-AA/L all-atom force field (2001 aminoacid dihedrals) # or whatever is closest to that answer (mine is '3'). # The second wants to know about the format of the output # enter '2' which should be "blah blah (Protein-H) blah" # This means do the protein but leave out the hydrogens. When it is finished running there should be a decoys/TXXXX.tryNN-opt2.gromacs.pdb.gz file. Take a look and make sure it looks ok. I hope to do a script that will do a simulated annealing - heating the protein up, 'cooking' it at 300 K. for several picoseconds, and minimizing again. This may produce some decent results. 13 Sept 2004 Kevin Karplus You have to edit the GROMACS output to change the illegal 'O1 ' and 'O2 ' atoms at the end to 'O ' and 'OXT' respectively. Sun Jan 16 10:56:02 PST 2005 Kevin Karplus I've written casp6/scripts/fix-gromacs that does the necessary atom renaming for gromacs. In addition to the O1 and O2 bug above, there is also a misnaming of CD1 in ILE residues as CD. Tue Jan 18 09:29:51 PST 2005 Kevin Karplus CAVEAT: gromacs renumbers the chain, so has to be used with caution when the residues are not numbered sequentially from 1. Sigh, gromacs reads in a PDB file, why can't it remember the PDB numbers? Sat Jan 22 21:52:23 PST 2005 Kevin Karplus I wrote a run-gromacs script and used it in Make.main to create the gromacs0 and gromacs4 files. You no longer need to create a gromacs directory---the script creates a temporary directory and cleans up afterwards. ============================================================ Sun Sep 19 09:52:00 PDT 2004 Kevin Karplus Subject: Evaluating our predictions I've added some new make commands to Make.main so that crude evaluation of our predictions will be fairly straightforward. We'll probably need to add more for detailed evaluation. To use the new make commands, you need to add two lines to the Makefile, illustrated here by the values for T0198: REAL_PDB:=1sumB FINAL_COSTFCN:=try23 Then "make decoys/evaluate.rdb decoys/evaluate.pretty" This will produce a file very similar to the score-all.try23.rdb file, but with extra columns for rmsd and rmsd_ca. This evaluation is only on whole-chain rmsd, so is not as good as the "find the best submodel" evaluations used by the CASP assessors. I'm thinking about adding some more "cheating costfcn" functions to undertaker for doing such evaluations, but have not written them yet. We may also want to do a superposition of our top model(s) with the real structure, which can easily be done by a modification of superimpose-best.under. Note: don't change superimpose-best.under itself, as it was used to generate the submissions, and I don't want to create bad dependencies between files! ------------------------------------------------------------ Date: Wed, 22 Sep 2004 05:00:52 -0700 From: Kevin Karplus Subject: GDT evaluation of casp6 targets I implemented in undertaker a version of the "GDT" cost function that is used by the CASP assessors. It took less than 200 lines of code, including comments, blank lines, .h files, and interface to the CostFcn mechanism. It is a little slow, since it does hundreds of transformations and distance computations per conformation that it evaluates. The idea of GDT is to try many different superpositions of the prediction on the actual structure. For each superposition, measure the distances for each pair of CA atoms, and sort these distances. The superposition has k CA atoms superimposed <= the kth distance. Take the minimum over all transformations tried---this gives the tightest superposition of k CA atoms found for each k. The GDT score is the average percentage of the CA atoms that can be superimposed to <=1, <=2, <=4, or <=8 residues. A pretty good fold-recognition result will have around 60%, a crummy new fold around 20%, and an easy comparative model around 95%. The GDTcost costfcn reported by undertaker is just the negative of the GDT score. Sometime soon I'll also provide output for making GDT plots, which are useful for comparing models and seeing how much was really right. The outputs of the evaluations for all the targets that are already in PDB are being computed now. The results can be found in casp6/T*/decoys/evaluate.rdb (Note: there is an evaluate.pretty, but it is less useful, as the rmsd and rmsd_CA fields get multiplied by 0 weights, so disappear.) -------------------------------------------------- Date: Thu, 23 Sep 2004 11:26:43 -0700 Subject: GDT curves I now have a feature in undertaker to generate GDT plots (well, data files for gnuplot to produce plots). I implemented a version of the GDT scoring function in undertaker yesterday, and noticed that there were differences in score bewtween chains that I knew were identical up to a rigid transformation and rounding error. I was worried that I wasn't sampling enough superpositions (though I was sampling so many that computing one GDT curve was taking 3 or 4 seconds for a 235-residue protein, and the time grows quadratically with the length of the protein). I decided, though, that the problem was not in generating the GDT curve, but in creating the summary statistic that looks at the average of the number of points that can be superimposed to <=1, <=2, <=4, and <=8 angstroms. The problem is that tiny differences can move a point to one side or the other of the threshold, making fairly large differences in the count. We can recast the GDT score function as a normalized sum of a goodness function sum_{1<=i<=num_residues} good(dist(i)) / (num_residues*good(0)) where dist(i) is the minimum over all superpositions tried of the ith largest distance between corresponding CA atoms, and good(a) = a<=1? 4: (a<=2? 3: (a<=4?2: (a<=8? 1:0))) We can define a smooth curve to replace good(a): sgood(a) = a>maxa? 0: (amaxa? 0: (a To: karplus@soe.ucsc.edu, sol@soe.ucsc.edu, ggshack@soe.ucsc.edu, learithe@soe.ucsc.edu, martina@soe.ucsc.edu, bbarnes@ucsc.edu, marcias@ucsc.edu, rph@soe.ucsc.edu Subject: GDT curves fixed I found a small bug in the GDT curves produced by undertaker yesterday, and fixed it this morning. You can see a good example with cd casp6/T0229/decoys gnuplot load "../../evaluate.gnuplot" This plot file is rather generic---it does not provide a specific title that mentions the target, and the key may need to be moved around to avoid conflict with the cruves for some targets. ------------------------------------------------------------ Tue Jul 5 21:41:03 PDT 2005 Kevin Karplus I fixed up Tim Dreszer's implementation of "clens" the contact cost function for undertaker. It correlates (somewhat non-linearly) with gdt and smooth gdt. Here are a few of the outliers to analyze to see which is better: clens surprisingly low name length missing_atoms cost clens c3lens rmsd rmsd_ca GDT smooth_GDT real_cost T0246.try2-opt1.pdb.gz 354 0 0 0.1992 0.1609 2.6058 2.2450 -73.0932 -68.4466 -68.4466 T0246.try7-opt2.repack-nonPC 354 0 0 0.2998 0.2651 5.0012 4.6052 -66.1723 -61.7544 -61.7544 T0268.try3-opt1.pdb.gz 285 0 0 0.3173 0.2844 3.6833 2.9610 -71.7082 -66.9435 -66.9435 T0264.try7-opt2.pdb.gz 294 0 0 0.6542 0.6403 7.3723 6.5638 -42.3875 -40.1914 -40.1914 T0249.try1-opt1-scwrl.pdb.gz 209 0 0 0.5845 0.5688 15.8796 15.4277 -31.6358 -30.2987 -30.2987 (T0249)robetta-model2.pdb.gz 209 0 0 0.7156 0.7056 15.2886 14.4983 -23.4568 -22.3325 -22.3325 T0199.try7-opt1-scwrl.pdb.gz 338 0 0 0.7971 0.7878 27.6570 27.0903 -20.2786 -19.2020 -19.2020 T0199.try9-opt2.pdb.gz 338 0 0 0.8052 0.7963 27.6092 27.0085 -20.4334 -19.2182 -19.2182 clens surprisingly high name length missing_atoms cost clens c3lens rmsd rmsd_ca GDT smooth_GDT real_cost (T0229_1)robetta-model5.pdb.gz 138 0 0 0.9116 0.9139 5.9076 3.8014 -60.4167 -54.7948 -54.7948 (T0222_2)robetta-model3.pdb.gz 373 0 0 0.7210 0.7136 4.7750 3.4980 -66.7969 -62.4880 -62.4880 (T0229_1)align2 138 85.0000 0.0000 0.5528 0.5297 3.4227 1.9082 -77.0833 -72.5546 -72.5546 T0229.try2-opt1.pdb.gz 138 0 0 0.4871 0.4739 1.8103 0.9245 -94.7917 -89.0830 -89.0830 T0229.try6-opt2.repack-nonPC 138 0 0 0.3456 0.3091 1.6930 0.8645 -94.7917 -90.5837 -90.5837 The unusually high clens values are mostly from domain T0229_1, which is only 24 residues, with 3 short beta strands. The small size makes GDT values a little generous and the small number of contacts makes the clens values a bit strict. There is also a high-clens model from T0222_2, which is a 64-residue domain. The basic shape of T0222_2 is correct in robetta-model3, but one loop os folded wrong and makes a number of incorrect contacts.