Mon Mar 31 17:51:41 PDT 2008 Kevin Karplus This will be the directory for hand predictions and for model quality assessment results. CONTENTS: Group names and identifiers Directory structure When the moai cluster gets wedged, what can we do? Handling a homodimer (or homotrimer) Building QA targets Using undertaker as a meta-server Analyzing the CASP8 results Group names and identifiers HUMAN groups: SAM-T08-human 4008-1775-0004 is the human prediction group for TS predictions only. Who should sign up: anyone who actually works on hand predictions this spring or summer. SAM-T08-MQAO 3724-8702-5528 MQA server using alignment only Additional members: Martin Paluszewski SAM-T08-MQAU 7072-3475-1278 MQA server using alignment and other costfcn Additional members: Martin Paluszewski and John Archie SAM-T08-MQAC not registered yet MQA server using alignment/undertaker/consensus Additional members: John Archie The MQA groups should download a tarball and run scripts to evaluate the models for *all* targets, not just the human-prediction subset. It is possible to create an MQA server, but I don't think that the effort is worth it, since we need to run most of a prediction before we can do the MQA computation. SERVER groups: SAM-T02-server 7768-4665-5533 No additional members. SAM-T06-server 6290-5691-1801 Additional members: George Shackelford SAM-T08-server 7957-1341-9349 Additional members: George Shackelford and Grant Thiltgen SAM-T08-2stage 9060-0201-6142 Additional members: George Shackelford SAM-T08-MQAC 2165-1648-9790 Consensus MQA. Additional members: John Archie Directory structure Fri Apr 4 15:08:43 PDT 2008 Kevin Karplus The targets will each have their own directory, starting with T0387. The directory starter-directory/ has the seeds for producing new predictions. Within T0xxx, there will always be a subdirectory "decoys". Underneath decoys there will be subdirectories for server predictions: SAM_T06 all files created for SAM-T06-server SAM_T08 all files created for SAM-T08-server server the contents of the tarball from CASP of all servers For those predictions identified as ones for hand prediction, the T0xxx directory will also have all the work we do on the prediction, which will include "metaserver" models built starting from the server models, as well as our own models. When the moai cluster gets wedged, what can we do? Mon Apr 14 18:04:09 PDT 2008 Kevin Karplus To look for phantom batches that are running but have lost their controlling directories (probably because the web server thought that the jobs was finished), run pcem/scripts/find-phantom-batches on moai. To find out what farmer jobs are running on what machines, run do-all-condor-cluster 'ps -fwu farmer |grep query' on moai. If some of the jobs there need killing, the process ids are the two numbers after "farmer". Mon May 5 14:33:46 PDT 2008 Kevin Karplus Some targets (like T0387) are homomultimers. Optimizing them as such may help get the monomers right. If the instructions below get confusing, try looking at T0387/dimer as an example. This method assumes that you have a pretty good monomer that you want to dimerize based on a template with an existing dimer and then optimize. It is not intended for creating dimers from scratch. 1) run "make make_dimer" which creates a subdirectory "dimer/", a target a2m file (double length), a Makefile (with MONOMER_LENGTH set), a costfcn-init.under file (with KnownBreak added, and constraint sets coming from ../ files) all the local-structure .rdb files (doubled, with renumbering) Mon Jun 9 15:37:59 PDT 2008 Kevin Karplus I fixed the bug in generating dimer/costfcn-init.under, so that it should now have a KnownBreak command at the end with the first residue of the second copy, including the one-letter amino-acid code. I also added a make_trimer target which does essentially the same thing as make_dimer, but with a tripled copy. 2) In the dimer directory, do "make make-dimer.under" to get an initial dimerization script. That you will have to edit. You will need to replace the 1xxxA words with the monomers of your template. and the YYYY words with the name of the model you want to dimerize. This script needs to have a properly dimerized template to copy the positioning from and a monomer to dimerize. 3) Create an alignment file that has the target and copies of the best alignment. For example, for T0284, we have T0284/1mumA/1mumA.dimer-a2m modified from T0284-1mumA-t04-local-str2+CB_burial_14_7-1.0+0.4+0.4-adpstyle5.a2m : >T0284 PA4872, Pseudomonas aeruginosa PAO1, 287 res MHRASHHELRAMFRALLDSSRCYHTASVFDPMSARIAADLGFECGILGGS VASLQVLAAPDFALITLSEFVEQATRIGRVARLPVIADADHGYGNALNVM RTVVELERAGIAALTIEDTLLPAQFGRKSTDLICVEEGVGKIRAALEARV DPALTIIARTNAELIDVDAVIQRTLAYQEAGADGICLVGVRDFAHLEAIA EHLHIPLMLVTYGNPQLRDDARLARLGVRVVVNGHAAYFAAIKATYDCLR EERGAVASDLTASELSKKYTFPEEYQAWARDYMEVKE >1mumA sl------HSPGKAFRAALTKENPLQIVGTINANHALLAQRAGYQAIYLS GGGVAAGSLGLPDLGISTLDDVLTDIRRITDVCSLPLLVDADIGFGsSAF NVARTVKSMIKAGAAGLHIEDQVGAKRCGHrPNKAIVSKEEMVDRIRAAV DAKTDPDFVIMARTDALAvEGLDAAIERAQAYVEAGAEMLFPEAITELAM YRQFADAVQVPIlaNITEFGATPLFTTDELRSAHVAMALYPLSAFRAMNR AAEHVYNVLRQegtqksVIDTMQTRNELYESINYYQYEEKLDNL------ farsqvk >1mumB sl------HSPGKAFRAALTKENPLQIVGTINANHALLAQRAGYQAIYLS GGGVAAGSLGLPDLGISTLDDVLTDIRRITDVCSLPLLVDADIGFGsSAF NVARTVKSMIKAGAAGLHIEDQVGAKRCGHrPNKAIVSKEEMVDRIRAAV DAKTDPDFVIMARTDALAvEGLDAAIERAQAYVEAGAEMLFPEAITELAM YRQFADAVQVPIlaNITEFGATPLFTTDELRSAHVAMALYPLSAFRAMNR AAEHVYNVLRQegtqksVIDTMQTRNELYESINYYQYEEKLDNL------ farsqvk 4) In the dimer directory, make try1.costfcn, or copy a costfcn from the parent directory and edit it. If you want any constraints on the optimization, it is necessary to make multiple copies in the cost function, renumbering the constraints in the later chains (a real pain). Alternatively, you can compute the constraints only on the first monomer. If the monomers are identical, this should not cause any problems. Once you have an acceptable dimer, you want to optimize it, keeping it dimerized in roughly the same orientations. If you read in a dimer with ReadConformPDB, be sure to mark it as a dimer by following the read command with Multimer 2 as a separate command to label the dimer as a cyclic dimer. Note: if the multimer is *not* cyclic then *don't* label it, as undertaker will try to symmetrize it. You can do the optimization as usual, but use "multimer 2" in the OptConform arguments. Any alignments (for fragments and the like) can be gotten from the original monomeric runs. You probably want to reduce the duration of the run (by reducing num_gen, gen_size, super_iter, and/or super_num_gen), because multimeric runs take longer than monomeric ones. You can also read the Template.atoms file from the monomeric directory, avoiding duplicating that file. You might want to turn off TweakMultimer at first if you are trying to pack a tight interface, as it will tend to move monomers apart to reduce clashes. But if you have a loose interface, you definitely want TweakMultimer on to try to tighten up the interface. It may be necessary to add some inter-chain constraints to hold the dimer together. Even without TweakMultimer on, undertaker may find a way to alleviate clashes by moving parts of the dimer away from each other as it did in try1 (of T0284/dimer). Note: you don't always want "multimer 2" for a dimer or "multimer 4" for a tetramer. What the command (or option to OptConform) do is to force the creation of a cyclic multimer. That is the transform that takes A to B will take B back to A for a dimer, or T(A->B) = T(B->C) = T(C->D) = T(D->A) for a tetramer. Not all multimers are cyclic! You can still optimize non-cyclic multimers in undertaker, but you must *not* use the multimer command or option to OptConform. This will cause each chain to be separately optimized but the "OptSubtree" method will tend to rearrange the transformation between chains. You can optimize a mixture of cyclic and non-cyclic dimers in OptConform if they are initially labeled with Multimer commands and OptConform has no "multimer" keyword (or, equivalently, "multimer 0"). If OptConform has "multimer 2" set, then all multimers will be set ot be cyclic dimers. Note: you can do optimization of a some tetramer with symmetry S_{2,2} by telling OptConform to use "multimer 2". You don't get the full symmetry, but you will get some symmetry: chain A and chain B will be independently optimized, but chain C and chain D will be copies of chains A and B and T(AB->CD)= T(CD->AB). NOTE: gromacs doesn't like big chain breaks, and it will not see the multimer merged into a single chain as two chains. To get gromacs to optimize a multimer, you need to unpack the multimer into separate chains: cd casp7/T0332/dimer make decoys/T0332.try2-opt2.unpack.pdb.gz decoys/T0332.try2-opt2.unpack.gromacs0.pdb.gz You can get this to happen for you automatically if you use cd casp7/T0332/dimer (make T0332.mult2 >& do2.log; gzip -9f do2.log)& instead of the monomer version (make T0332.do2 >& do2.log; gzip -9f do2.log)& Sat Jul 1 13:35:27 PDT 2006 Kevin Karplus I made a small change to undertaker, adding force_alignment fragment_only options to ReadFragmentAlignment, so that I could force undertaker to treat the short fragments as being a complete alignment or not being treated as an alignment at all (just fragments). If neither option is provided, then it is added to the alignment library only if it is multiple fragments or a sufficiently long single fragment (something like half the total protein length). For multimers, you can include force_alignment in the ReadFragmentAlignment command that specifies the multimer, to avoid losing an alignment that has only a short piece aligned to show what corresponds. Sun Jun 1 16:45:25 PDT 2008 Kevin Karplus The script in T0413/dimer/make-dimer-chimera-try12.under shows how to use an existing dimeric model build a dimeric model from a different monomer. A self-alignment file is needed, which can either be the whole monomer (as in this example), or an alignment of just residues in the dimeric interface. Tue May 20 09:14:00 PDT 2008 John Archie Building QA targets can be done with % make qa_all which creates, in addition to some intermediate evaluation files, three files SAM-T08-MQAO.qa1 - the QA file using only the alignment-based constraints SAM-T08-MQAU.qa1 - the QA file using all undertaker cost functions (including the alignment based constraints) SAM-T08-MQAC.qa1 - the QA file for all undertaker cost functions and a consensus term SAM-T08-MQAC.qa1 is likely the most reliable quality assessment method; however, given the number of consensus-based methods in CASP8, SAM-T08-MQAU.qa1 has a chance of being the best of our methods. By default, "make qa_all" will try to pull files (alignments, neural net predictions, etc) from decoys/SAM_T08/; to change this, use the macro QA_PREDDIR, this option may be especially useful if the files in the human-prediction directory might be more accurate: % make QA_PREDICTDIR=`pwd` qa_all Note that QA_PREDICTDIR may be used in contexts where the current working directory is not the directory in which make was invoked--so please do not use relative path names. The QA files may be submitted with % make mail_qa_all At the moment, only John A and Kevin have permission to submit. Using undertaker as a meta-server Wed May 21 10:44:47 PDT 2008 Kevin Karplus We can use the server models as starting points for further optimization. First, make the MQA assessments as above (with "make qa_all"). Then create scripts for reading in the top 10 models according to each assessment method with "make under_qa_all"). This creates SAM-T08-MQAC.read_under SAM-T08-MQAU.read_under SAM-T08-MQAO.read_under I don't plan to use the MQAO selections, but the MQAU and MQAC ones might be interesting, so I plan to do an optimization from each of those sets separately. Mon Jun 9 21:23:04 PDT 2008 Kevin Karplus The under_qa_all make target also makes metaserve-MQAC1.under metaserve-MQAU1.under optimization scripts for the two sets. The script is currently set up to optimize the try1 costfcn, but for many targets we have already found a better costfcn, so the scripts should probably be edited to change the costfcn. To run the scripts, you can do 'make run_metaservers', but it might be better to do (make meta_MQAU1 >& MQAU1.log; gzip -9f MQAU1.log)& (make meta_MQAC1 >& MQAC1.log; gzip -9f MQAC1.log)& each on a separate machine. If the workstations are busy, they can be sent to the cluster with para-trickle-make -quick '(make meta_MQAC1 >& MQAC1.log; gzip -9f MQAC1.log)' para-trickle-make -quick '(make meta_MQAU1 >& MQAU1.log; gzip -9f MQAU1.log)' Rescoring all models with one of the cost functions can be done with "make decoys/score-all.try2.pretty" (replacing try2 with the name of the desired costfcn file). Analyzing the CASP8 results Fri Nov 7 10:52:29 PST 2008 Kevin Karplus model1-evaluate.rdb are the full-length model evaluations, sorted by GDT on the whole chain (not domains). Note that model1 is either the SAM-T08-human model (for human-predicted targets) or SAM-T08-server (for server-only targets). It was created by using grep ^model1.ts */decoys/evaluate.rdb | sort -g +16 then pasting in the rdb header from an evaluate.rdb file, editing the file to replace "/decoys/evaluate.rdb:" with a tab and editing the header to have a "target" column first. real_cost and GDT can be seen to be highly correlated across different targets, but with a few outliers. The easiest target (for us) was T0458, with GDT score of 96.2% The hardest target was T0430, with GDT score of 5.67%. (Oops, T0430 has the wrong REAL_PDB id. I'll redo the evaluation with the right id.) Fri Nov 7 12:05:33 PST 2008 Kevin Karplus I'm redoing evaluations for T0492 typo in name T0390 wrong chain of PDB model T0420 typo in Makefile T0430 wrong PDB file (reason unknown) Fri Nov 7 13:13:09 PST 2008 Kevin Karplus The hardest models for us were actually T0514 (GDT=18.1%) and T0466 (GDT=20%, but real_cost=320.6). There were good models for T0514 (GDT 50.5% for pro-sp3-TASSER_TS2), and we would have gotten SAM-T08-MQAC 39.5% Zhang-Server_TS3 SAM-T08-MQAU 50.5% pro-sp3-TASSER_TS2 SAM-T08-MQAO 17.7% SAM-T08-server_TS2 T0466 was really harder, with the best server model being nFOLD3_TS5 (GDT 40.1%, real_cost 213.8). The best I submitted was MQAU1-opt3 (as model 5) with a GDT of 30% and real_cost of 270.2. Mon Nov 10 12:59:03 PST 2008 Kevin Karplus I looked at the refinement models a bit today. It does not look like I made any improvement over the initial models, nor were my model1 choices better than ones I favored less. I don't think I should waste my time on refinement, as I obviously can't do it. I also collected the best-evalues from the human predictions into an RDB file "best-evalues". We know that T0466 would be tough (best-evalue 33.456), but there were some that looked tougher that turned out to be not quite as hard (T0465 GDT=30.6% and T0496 GDT=23.4%). Mon Nov 10 13:19:37 PST 2008 Kevin Karplus Although a large E-value tends to indicate a poor model, this is not invariable (T0471 has GDT 65% but E-value 1.12). And a really small e-value does not guarantee a good model: T0487 has E-value 6e-78 but only 33% GDT. The problem there is multiple domains, which are individually predicted ok, but which aren't assembled perfectly. Wed Nov 12 10:13:51 PST 2008 Kevin Karplus I joined the model1-evaluate+evalue table with the SAM-T08-server table, so that I could see whether I improved on the server. It looks like there were 7 targets where I did substantially worse than the server on GDT, and 17 where I did substantially better, so the human input was worthwhile. The difference is even more stricking in real_cost. But how much of that was really due to human input, and how much due to metaservers? I probably need to extract the MQAC-recommended server model and perhaps the Zhang-server_TS1 model to see whether the hand effort was worth anything. Wed Nov 12 12:19:13 PST 2008 Kevin Karplus row e_value ne "" < model1-SAM-T08-Zhang.rdb | histogram-from-rdb -compute 'GDT - Zhang_GDT' -bin 1 indicates that my models averaged 1.29 worse GDT than Zhang's, but row e_value ne "" < model1-SAM-T08-Zhang.rdb | histogram-from-rdb -compute 'real_cost - Zhang_real_cost' -bin 1 indicates that my models averaged 1.66 better on real_cost. Overall, this is pretty much a wash---there were a few models that I did much better on and a few that the Zhang-Server did much better, but most were pretty close. The models I did much better than the Zhang-Server_TS1 on GDT (10% more on GDT) were T0394 (SAM-T08-server, since server-only) T0472. The models I did much worse (10% lower) than the Zhang-Server_TS1 on GDT were T0462 T0468 though the SAM-T08-server did worse on T0398 T0400 T0404 T0408 T0432 T0486 T0512 For real_cost, the models I did much worse (70 points) than the Zhang server were T0468 though the SAM-T08-server did worse on T0398 T0400 T0404 T0408 T0432 T0486 T0509 T0512 The ones I did much better on were T0472 T0492 T0495 though SAM-T08-server did better on T0394 T0504 Overall, it is looking like I made some improvements on the Zhang-server, but not by a lot. I'll have to see how each of the MQA methods did compared to my hand predictions, to see if I added anything with all my hard work. So far, only T0495 looks like I made any real improvement over the server models. For T0468, I got the topology of the sheet wrong, and the MQA methods would have done much better. Thu Nov 20 22:04:57 PST 2008 John Archie I'm copying all of the old MQA RDB files out of the way to make way for the evaluation versions with include columns for the real cost against the experimental structure: foreach t (T0???) mv $t/decoys/servers.evaluate.everything.rdb \ $t/decoys/servers.evaluate.everything.rdb.casp mv $t/decoys/similarity.servers.evaluate.everything.rdb \ $t/decoys/similarity.servers.evaluate.everything.rdb.casp end Mon Dec 22 18:35:18 PST 2008 Kevin Karplus Eyeballing the SAM-T08-server curves on http://predictioncenter.org/casp8/results.cgi which ones are particularly good or bad? T0389_1 bad T0393_2 bad T0398_1 bad T0398_2 bad T0407_1 bad (SAM-T08-human still moderately bad, SAM-T06-server beats both) T0407_2 bad (SAM-T08-human still bad) T0419_1 bad T0419_2 bad T0462_2 bad T0476_1 bad T0482_1 bad T0487_4 bad (SAM-T08-human quite good) T0498_1 bad (SAM-T08-human still bad) T0501_2 bad T0512_1 bad T0513_2 bad T0514_1 bad T0393 moderately bad T0404_1 moderately bad T0432_1 moderately bad T0434_1 moderately bad T0448_1 moderately bad T0464_1 moderately bad (SAM-T08-human moderately good) T0468_1 moderately bad (SAM-T08-human bad) T0471_1 moderately bad (SAM-T08-human moderately good) T0477_1 moderately bad T0487_1 moderately bad T0504_3 moderately bad T0397_2 moderately good T0473_1 moderately good T0504_2 moderately good T0394_1 good T0395_1 good T0435_1 good T0443_2 good T0443_3 good T0446_2 good T0461_1 good T0472 good (but individual domains bad) T0478_1 good T0480_1 good T0487_2 good T0489_1 good T0495_1 good (SAM-T08-human good for most residues) T0496_2 good T0501_1 good T0504_1 good T0510 good (but individual domains only moderately good)