[ Sat Sep 8 06:15:31 PDT 2007 Kevin Karplus Note: many of the instructions in John's protocol have been replaced by Makefile targets. See the Makefile for details. ] Cost Function Evaluation Overview John Archie (2007-08-20) Most of the evaluation process can be done with the cfneval.pl script provided here; the script is documented: % cfneval.pl --man Evaluating cost functions using the code here is a multi-step process: First, create the evaluation score files in all of the CASP7 target directories. For my anglevector.costfcn file this was done by % set casp7=/projects/compbio/experiments/protein-predict/casp7/ % set anglevectorcfn=/cse/grads/jarchie/projects/anglevector/anglevector.costfcn % set targetfile=$casp7/target_list.txt % umask 002 % foreach target (`cat $targetfile`) foreach> sed -e "s/TXXXX/$target/g" < $anglevectorcfn > $casp7/$target/anglevector.costfcn foreach> end Next, do all the scoring of decoys using the CASP7 stuff. One way is % cfneval.pl -us "decoys/predictions.evaluate.anglevector.rdb" \ ? | para-trickle-make -command ' ' -max_jobs 5 [ Thu Aug 23 12:02:21 PDT 2007 Kevin Karplus Alternatively, you could use para-trickle-make -manyids -se2log -no2letter -modelsdir $casp7 \ -makefile ./Makefile -target decoys/predictions.evaluate.anglevector.rdb < $targetfile ] Summary statistics can be generated by cfneval.pl: % cfneval.pl -s decoys/predictions.evaluate.anglevector.rdb -f0 > example.rdb Finally, plot the graphs and analyze the data in R, gnuplot, or some other program: % R --no-save < cfneval_example.R > cfneval_example.log (Check the R log for summary statistics and the plots/ directory for plots.) Tue Aug 21 13:27:31 PDT 2007 Kevin Karplus Copied to /projects/compbio/experiments/protein-predict/CostFcnEval Tue Aug 21 13:39:25 PDT 2007 Kevin Karplus Created builtins.costfcn to evaluate all the cost functions that are not specific to a particular target. Tue Aug 21 20:36:27 PDT 2007 John Archie Fussed with the method used in cfneval.pl to compute Kendall's tau a bit to increase speed. My very rough guess is that it will now take about 5 hours to complete the heiarchial cost function tree that I need to build in the Fall. Fri Aug 24 12:53:34 PDT 2007 Kevin Karplus One can get a quick summary of the results in the rdb file using summ -m < builtins.rdb | sort -nr +7 > builtins.avg For the builtin cost fcns, the highest average tau is for near_backbone, followed by other burial functions. Note: I had to modify summ slightly, as it had used %d instead of %g to print the values. Fri Aug 24 13:19:00 PDT 2007 Kevin Karplus I have put targets in the Makefile for evaluating the costfcn, building an rdb file of the results by target, and giving the average for each costfcn. Fri Aug 24 14:48:55 PDT 2007 Kevin Karplus There is now a %.summarize target, so that make -k builtins.summarize will make builtins-gdt-btr.avg builtins-gdt-tau.rdb builtins-real_cost-tau.avg builtins-gdt-btr.rdb builtins-real_cost-btr.avg builtins-real_cost-tau.rdb builtins-gdt-tau.avg builtins-real_cost-btr.rdb Using real_cost metric and tau, the best costfunction components are Min, Avg, Max, Total for hbond_geom_backbone: -0.136, 0.333244, 0.581, 28.659 Min, Avg, Max, Total for near_backbone: -0.029, 0.32643, 0.566, 28.073 Min, Avg, Max, Total for dry12: -0.053, 0.305628, 0.642, 26.284 Min, Avg, Max, Total for dry8: -0.023, 0.303709, 0.569, 26.119 Using gdt and tau, the best costfunction components are Min, Avg, Max, Total for near_backbone: -0.054, 0.302116, 0.553, 25.982 Min, Avg, Max, Total for dry12: -0.064, 0.290791, 0.646, 25.008 Min, Avg, Max, Total for dry8: -0.016, 0.286314, 0.543, 24.623 Min, Avg, Max, Total for way_back: -0.087, 0.28086, 0.562, 24.154 Min, Avg, Max, Total for dry6.5: -0.037, 0.261, 0.546, 22.446 Min, Avg, Max, Total for hbond_geom_backbone: -0.102, 0.250128, 0.5, 21.511 It is interesting that hbond_geom_backbone moves up so much in the real_cost measure---probably because of the hbond scoring functions included in real_cost. Sun Aug 26 20:15:13 PDT 2007 Kevin Karplus WARNING: there seems to be an occasional problem with T0305 on the moai cluster: # ReadConformPDB reading from PDB file predictions/T0305TS601_3 looking for model 1 # Found a chain break before 294 # copying to AlignedFragments data structure # naming current conformation T0305TS601_3 # request to SCWRL produces command: ulimit -t 268 ; scwrl3 -i /var/tmp/to_scwrl_1995065502.pdb -s /var/tmp/to_scwrl_1995065502.seq -o /var/tmp/from_scwrl_1995065502.pdb > /var/tmp/scwrl_1995065502.log # Trying to read SCWRLed conformation from /var/tmp/from_scwrl_1995065502.pdb undertaker: ScwrlCommands.cc:224: Conformation* SCWRL(Conformation*, std::ostream&): Assertion `ch->atom(a).no_wc_match(new_ch->atom(atom_in_new_ch))' failed. Running exactly the same program on cheep does not cause any problems. When comparing tau or btr numbers, check to make sure that the same number of targets is included in both runs (not a problem if the computations are form the same run). Sun Aug 26 22:25:26 PDT 2007 Kevin Karplus The best cost functions for choosing high GDT are all neural-net predictions: predburial-gdt-tau.avg:Min, Avg, Max, Total for pred_nb11_04_simple: 0.262, 0.520244, 0.733, 44.741 predburial-gdt-tau.avg:Min, Avg, Max, Total for pred_nb11_06_simple: 0.261, 0.515058, 0.732, 44.295 predburial-gdt-tau.avg:Min, Avg, Max, Total for pred_nb11_2k_simple: 0.29, 0.51464, 0.735, 44.259 predburial-gdt-tau.avg:Min, Avg, Max, Total for pred_nb11_04: 0.272, 0.483512, 0.715, 41.582 predburial-gdt-tau.avg:Min, Avg, Max, Total for pred_nb11_06: 0.227, 0.477128, 0.708, 41.033 predburial-gdt-tau.avg:Min, Avg, Max, Total for pred_nb11_2k: 0.237, 0.473186, 0.706, 40.694 predburial-gdt-tau.avg:Min, Avg, Max, Total for pred_cb14_04_simple: -0.03, 0.447767, 0.716, 38.508 predburial-gdt-tau.avg:Min, Avg, Max, Total for pred_alpha06: 0.098, 0.444233, 0.695, 38.204 predburial-gdt-tau.avg:Min, Avg, Max, Total for pred_cb14_06_simple: -0.03, 0.443849, 0.725, 38.171 predburial-gdt-tau.avg:Min, Avg, Max, Total for pred_alpha04: 0.099, 0.44264, 0.709, 38.067 predburial-gdt-tau.avg:Min, Avg, Max, Total for pred_alpha2k: 0.117, 0.436105, 0.65, 37.505 predburial-gdt-tau.avg:Min, Avg, Max, Total for pred_cb14_06: -0.014, 0.424, 0.721, 36.464 predburial-gdt-tau.avg:Min, Avg, Max, Total for pred_cb14_04: -0.01, 0.421593, 0.736, 36.257 predburial-gdt-tau.avg:Min, Avg, Max, Total for pred_cb14_2k_simple: 0.071, 0.418953, 0.724, 36.03 predburial-gdt-tau.avg:Min, Avg, Max, Total for pred_cb14_2k: -0.171, 0.412314, 0.697, 35.459 anglevector-gdt-tau.avg:Min, Avg, Max, Total for pred_pb_mean: 0.093, 0.409151, 0.634, 35.187 anglevector-gdt-tau.avg:Min, Avg, Max, Total for pred_pb_t04: 0.088, 0.409116, 0.633, 35.184 anglevector-gdt-tau.avg:Min, Avg, Max, Total for pred_pb_t06: 0.096, 0.408558, 0.636, 35.136 anglevector-gdt-tau.avg:Min, Avg, Max, Total for pred_pb_t2k: 0.091, 0.406221, 0.632, 34.935 predburial-gdt-tau.avg:Min, Avg, Max, Total for pred_CB8-sep9_06_simple: 0.057, 0.397884, 0.697, 34.218 predburial-gdt-tau.avg:Min, Avg, Max, Total for pred_CB8-sep9_06: 0.056, 0.387663, 0.648, 33.339 anglevector-gdt-tau.avg:Min, Avg, Max, Total for pred_bys_t06: 0.12, 0.375698, 0.632, 32.31 anglevector-gdt-tau.avg:Min, Avg, Max, Total for pred_bys_t04: 0.12, 0.375523, 0.629, 32.295 anglevector-gdt-tau.avg:Min, Avg, Max, Total for pred_bys_mean: 0.121, 0.373802, 0.632, 32.147 anglevector-gdt-tau.avg:Min, Avg, Max, Total for pred_bys_t2k: 0.121, 0.373233, 0.64, 32.098 predburial-gdt-tau.avg:Min, Avg, Max, Total for near_backbone: -0.054, 0.302116, 0.553, 25.982 builtins-gdt-tau.avg:Min, Avg, Max, Total for near_backbone: -0.054, 0.302116, 0.553, 25.982 builtins-gdt-tau.avg:Min, Avg, Max, Total for dry12: -0.064, 0.290791, 0.646, 25.008 builtins-gdt-tau.avg:Min, Avg, Max, Total for dry8: -0.016, 0.286314, 0.543, 24.623 builtins-gdt-tau.avg:Min, Avg, Max, Total for way_back: -0.087, 0.28086, 0.562, 24.154 builtins-gdt-tau.avg:Min, Avg, Max, Total for dry6.5: -0.037, 0.261, 0.546, 22.446 builtins-gdt-tau.avg:Min, Avg, Max, Total for hbond_geom_backbone: -0.102, 0.250128, 0.5, 21.511 For real_cost, the best are again predictions: predburial-real_cost-tau.avg:Min, Avg, Max, Total for pred_nb11_04_simple: 0.295, 0.552698, 0.752, 47.532 predburial-real_cost-tau.avg:Min, Avg, Max, Total for pred_nb11_06_simple: 0.308, 0.549256, 0.745, 47.236 predburial-real_cost-tau.avg:Min, Avg, Max, Total for pred_nb11_2k_simple: 0.316, 0.545872, 0.75, 46.945 predburial-real_cost-tau.avg:Min, Avg, Max, Total for pred_nb11_04: 0.22, 0.5175, 0.727, 44.505 predburial-real_cost-tau.avg:Min, Avg, Max, Total for pred_nb11_06: 0.259, 0.512442, 0.724, 44.07 predburial-real_cost-tau.avg:Min, Avg, Max, Total for pred_nb11_2k: 0.203, 0.506477, 0.721, 43.557 predburial-real_cost-tau.avg:Min, Avg, Max, Total for pred_alpha06: 0.151, 0.499477, 0.69, 42.955 predburial-real_cost-tau.avg:Min, Avg, Max, Total for pred_alpha04: 0.15, 0.498279, 0.698, 42.852 predburial-real_cost-tau.avg:Min, Avg, Max, Total for pred_alpha2k: 0.147, 0.486709, 0.671, 41.857 anglevector-real_cost-tau.avg:Min, Avg, Max, Total for pred_pb_t04: 0.192, 0.475895, 0.67, 40.927 anglevector-real_cost-tau.avg:Min, Avg, Max, Total for pred_pb_mean: 0.191, 0.475756, 0.672, 40.915 anglevector-real_cost-tau.avg:Min, Avg, Max, Total for pred_pb_t06: 0.191, 0.475721, 0.674, 40.912 anglevector-real_cost-tau.avg:Min, Avg, Max, Total for pred_pb_t2k: 0.189, 0.471488, 0.67, 40.548 predburial-real_cost-tau.avg:Min, Avg, Max, Total for pred_cb14_04_simple: -0.013, 0.460442, 0.726, 39.598 predburial-real_cost-tau.avg:Min, Avg, Max, Total for pred_cb14_06_simple: -0.013, 0.457674, 0.729, 39.36 anglevector-real_cost-tau.avg:Min, Avg, Max, Total for pred_bys_t06: 0.096, 0.453384, 0.69, 38.991 anglevector-real_cost-tau.avg:Min, Avg, Max, Total for pred_bys_t04: 0.097, 0.45307, 0.689, 38.964 anglevector-real_cost-tau.avg:Min, Avg, Max, Total for pred_bys_mean: 0.094, 0.450965, 0.691, 38.783 anglevector-real_cost-tau.avg:Min, Avg, Max, Total for pred_bys_t2k: 0.091, 0.449674, 0.7, 38.672 predburial-real_cost-tau.avg:Min, Avg, Max, Total for pred_cb14_06: -0.012, 0.440907, 0.7, 37.918 predburial-real_cost-tau.avg:Min, Avg, Max, Total for pred_cb14_04: -0.006, 0.437709, 0.73, 37.643 predburial-real_cost-tau.avg:Min, Avg, Max, Total for pred_cb14_2k_simple: 0.044, 0.428244, 0.727, 36.829 predburial-real_cost-tau.avg:Min, Avg, Max, Total for pred_cb14_2k: -0.166, 0.425407, 0.69, 36.585 predburial-real_cost-tau.avg:Min, Avg, Max, Total for pred_CB8-sep9_06_simple: 0.089, 0.412093, 0.663, 35.44 predburial-real_cost-tau.avg:Min, Avg, Max, Total for pred_CB8-sep9_06: 0.073, 0.402047, 0.664, 34.576 builtins-real_cost-tau.avg:Min, Avg, Max, Total for hbond_geom_backbone: -0.136, 0.333244, 0.581, 28.659 predburial-real_cost-tau.avg:Min, Avg, Max, Total for near_backbone: -0.029, 0.326488, 0.566, 28.078 builtins-real_cost-tau.avg:Min, Avg, Max, Total for near_backbone: -0.029, 0.32643, 0.566, 28.073 builtins-real_cost-tau.avg:Min, Avg, Max, Total for dry12: -0.053, 0.305628, 0.642, 26.284 builtins-real_cost-tau.avg:Min, Avg, Max, Total for dry8: -0.023, 0.303709, 0.569, 26.119 builtins-real_cost-tau.avg:Min, Avg, Max, Total for way_back: -0.101, 0.296767, 0.607, 25.522 builtins-real_cost-tau.avg:Min, Avg, Max, Total for alpha: -0.023, 0.288349, 0.583, 24.798 builtins-real_cost-tau.avg:Min, Avg, Max, Total for alpha_prev: 0.005, 0.28786, 0.586, 24.756 From martin.madera@gmail.com Tue Sep 4 20:24:55 2007 MIME-Version: 1.0 Content-type: text/plain; charset=ISO-8859-1 X-ASG-Debug-ID: 1188962689-3b4500560000-uZEIwy X-Barracuda-URL: http://mailgw.cse.ucsc.edu:8000/cgi-bin/mark.cgi X-Barracuda-Connect: mu-out-0910.google.com[209.85.134.189] X-Barracuda-Start-Time: 1188962689 DKIM-Signature: a=rsa-sha1; c=relaxed/relaxed; d=gmail.com; s=beta; h=domainkey-signature:received:received:message-id:date:from:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; b=WRRAOT5orqBqghJ7X7wxCNrwwjX9M9g1StesmPxkZzbLz07MfnTI3I4V5y4/CaBzFF2tbwZ8fnR2XASAIavsZB0DjcmoH3frNJfPRL2dQuPrYpRVx20eKPmCxz3IQwLHk/EapEq3Lo/x3toCzXIXKRpczSyHcM6KshL/sw6qBQs= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=beta; h=received:message-id:date:from:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; b=ofY3ZqxdNCnb1TODuHP/rGB0ia/EaKZXM6nfFZBsuTvgIfIcUcn5LDGQzcGBwbxi+ZF5ftD0Tk2CB1axlK4MuTfYhfB2eX7bWTJ7loCo5RL4KtUb4H7pRAWz6jvGz9nGa4W0F6zvWnGG/j+mOiqad+yXY2bCDrWkHlXYKLyNIOM= Date: Tue, 4 Sep 2007 20:24:48 -0700 From: "Martin Madera" To: "Kevin Karplus" X-ASG-Orig-Subj: Re: new local properties wanted Subject: Re: new local properties wanted Cc: rph@soe.ucsc.edu, gerloff@soe.ucsc.edu, ggshack@soe.ucsc.edu, josue@soe.ucsc.edu, bort@soe.ucsc.edu, thiltgen@soe.ucsc.edu, jarchie@soe.ucsc.edu, paluszewski@gmail.com, T.Juettemann@sms.ed.ac.uk, J.L.Sharman@sms.ed.ac.uk In-Reply-To: <200709042358.l84NwsJr014860@cheep.cse.ucsc.edu> Content-Disposition: inline X-Barracuda-Virus-Scanned: by Barracuda Spam Firewall at soe.ucsc.edu X-Barracuda-Spam-Score: 0.00 X-Barracuda-Spam-Status: No, SCORE=0.00 using global scores of TAG_LEVEL=3.5 QUARANTINE_LEVEL=5.0 KILL_LEVEL=9.0 tests= X-Barracuda-Spam-Report: Code version 3.1, rules version 3.1.27547 Rule breakdown below pts rule name description ---- ---------------------- -------------------------------------------------- X-Spam-Checker-Version: SpamAssassin 3.0.1 (2004-10-22) on services.cse.ucsc.edu X-Spam-Level: X-Spam-Status: No, score=-99.0 required=3.0 tests=AWL,RCVD_BY_IP, USER_IN_WHITELIST autolearn=no version=3.0.1 A few random ideas in this general area: 1) A cost function to penalize foaminess of final models, along the lines of SASApack (or use SASApack directly?). 2) Re-evaluation of all existing local structure alphabets, ignoring fold recognition and looking at how useful they are for scoring 3D models. I suspect that the problems with fold recognition that we're seeing for many alphabets are caused by bad null models / bad HMM calibration / similarities between unrelated folds (e.g. Rossmanns vs. TIM barrels), and we've been restricting ourselves too much by focusing on alphabets that work for fold recognition. (It would also be interesting to compare these results with alignment accuracy benchmarks, which I'll do.) 3) I think we're doing fine for secondary structure elements and burial, but I'd like to see more on hairpins / short turns etc. -- basically model evaluation using I-sites / Bystroff, and Osep and Nsep. (This falls under 2, I guess, but I thought I'd emphasize it.) 4) A random idea that has just occurred to me: ProteinShop has a few parameters for beta sheets, IIRC something like twist and curl. Could we try to predict these? This would be one way of separating Rossmanns from TIM barrels. (How does this relate to NOtor?!) 5) Which reminds me, we desperately need an alphabet that can tell Rossmanns from TIM barrels. (Burial and secondary structure are really bad for this, and it's screwing up fold recognition.) (4) is one possibility. Another possibility is to try and predict whether an alpha helix that follows a beta strand lies above or below the beta sheet. This may not be possible, but I think we should try, because it's a very important problem. 6) Generalizations of (5). For a beta strand that follows a beta strand, Osep/Nsep gives a lot of information, but it doesn't say whether the next strand is on the left or on the right. Helix-helix turns are more complex, but maybe we could categorize them and see what we can say about the relative position of the two helices. Martin On 9/4/07, Kevin Karplus wrote: > > In our first substantive tests of the undertaker cost functions, we > have found that predicted properties (secondary structure, burial, > contacts from alignments, ...) are much better at selecting good > models from the CASP7 pool than the built-in cost functions. > > This suggests to me that we want to have more such properties to > predict and use in the cost function. > > Grant and I are working on a couple of definitions of a backbone > alphabet (str4) that can be scored by undertaker (unlike str2, which > relies on DSSP). > > Martin Paluszewski is working on getting contact predictions from alignments. > > George is working on getting contact predictions from neural nets. > > John will be working on making combinations of cost functions to get > stronger combined cost functions. John will also be working on better > ways to evaluate the cost functions. He has come up with two tools so far: > Kendall's tau (a correlation measure of monotonicity) and btr (better > than real). The tau measure seems quite useful, but the btr measure > is less informative. > > What other directions could we be exploring on this front? > > > 1) Evaluating models from alignment, and not just models from CASP7 submissions. > > 2) New local properties. Anyone have any ideas that seem worth trying? > > Kevin Karplus > From martin.madera@gmail.com Tue Sep 4 20:27:46 2007 MIME-Version: 1.0 Content-type: text/plain; charset=ISO-8859-1 X-ASG-Debug-ID: 1188962864-3db200410000-uZEIwy X-Barracuda-URL: http://mailgw.cse.ucsc.edu:8000/cgi-bin/mark.cgi X-Barracuda-Connect: nf-out-0910.google.com[64.233.182.189] X-Barracuda-Start-Time: 1188962864 DKIM-Signature: a=rsa-sha1; c=relaxed/relaxed; d=gmail.com; s=beta; h=domainkey-signature:received:received:message-id:date:from:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; b=fJVmOz0lIDRi8DGgDxFyPQ6sQ/altDtOAf0aPMrh6QNCJwofRgeOsS6XX2iI76+zxbBXKRoeEgZvEa+5vQ8HOushJ9U1nYaza7Bf2dfRuijX9boflneEfiCf9mDpcKreZLI6xWtdubYAvL4b9uita3cNlrp7+1LGayNaf4h8hZ0= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=beta; h=received:message-id:date:from:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; b=YTx/QnIklFPYVH9iRNjkPMf3646nqePwZEw+QHX/c3OgfsM6nl6yk1aI7FJTlegquFgVz/JC7O/KenN0uc2kzVHlqiMwbgP/ha6vQSBPsPPjUcW4WXNBYQLBtB80l8IUh8Qwt69/4KkrdIaGvi4kj1jmRBGrOfXZQcO9dgKrs7c= Date: Tue, 4 Sep 2007 20:27:43 -0700 From: "Martin Madera" To: "Kevin Karplus" X-ASG-Orig-Subj: Re: new local properties wanted Subject: Re: new local properties wanted Cc: rph@soe.ucsc.edu, gerloff@soe.ucsc.edu, ggshack@soe.ucsc.edu, josue@soe.ucsc.edu, bort@soe.ucsc.edu, thiltgen@soe.ucsc.edu, jarchie@soe.ucsc.edu, paluszewski@gmail.com, T.Juettemann@sms.ed.ac.uk, J.L.Sharman@sms.ed.ac.uk In-Reply-To: <200709042358.l84NwsJr014860@cheep.cse.ucsc.edu> Content-Disposition: inline X-Barracuda-Virus-Scanned: by Barracuda Spam Firewall at soe.ucsc.edu X-Barracuda-Spam-Score: 0.00 X-Barracuda-Spam-Status: No, SCORE=0.00 using global scores of TAG_LEVEL=3.5 QUARANTINE_LEVEL=5.0 KILL_LEVEL=9.0 tests= X-Barracuda-Spam-Report: Code version 3.1, rules version 3.1.27547 Rule breakdown below pts rule name description ---- ---------------------- -------------------------------------------------- X-Spam-Checker-Version: SpamAssassin 3.0.1 (2004-10-22) on services.cse.ucsc.edu X-Spam-Level: X-Spam-Status: No, score=-98.5 required=3.0 tests=AWL,RCVD_BY_IP, USER_IN_WHITELIST autolearn=no version=3.0.1 > John will be working on making combinations of cost functions to get > stronger combined cost functions. John will also be working on better > ways to evaluate the cost functions. He has come up with two tools so far: > Kendall's tau (a correlation measure of monotonicity) and btr (better > than real). The tau measure seems quite useful, but the btr measure > is less informative. Ah, I remember thinking that improving tau/btr was an interesting problem, but I've completely forgotten what tau and btr were trying to measure (hopeless!). Could someone remind me? M. From martin.madera@gmail.com Tue Sep 4 21:14:31 2007 MIME-Version: 1.0 Content-type: text/plain; charset=ISO-8859-1 X-ASG-Debug-ID: 1188965667-525f004b0000-uZEIwy X-Barracuda-URL: http://mailgw.cse.ucsc.edu:8000/cgi-bin/mark.cgi X-Barracuda-Connect: nf-out-0910.google.com[64.233.182.185] X-Barracuda-Start-Time: 1188965667 DKIM-Signature: a=rsa-sha1; c=relaxed/relaxed; d=gmail.com; s=beta; h=domainkey-signature:received:received:message-id:date:from:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; b=A5x7XSWbUU4kNIyy+zdMqkO7nZvlMtJL3eU0iCRZH35He5NU+z2txe27v3eTvHjFWqGs5rN88fTJcPqz4wLygj1AQM78GosUoP7qYDnNhzM/JmZV6PXRiAvRHxU0rqaHmKuya/d2wFTaVX7fO2MdgXlskFAvbGQvEbLi68q3EIk= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=beta; h=received:message-id:date:from:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; b=sxaZnDIrwdXfHJQgiW7bY0NwfnFyMl/Lr2RWpQAM8Tp2YQjWcLh5s5p2Cfhk/VAsbxLf4IyAcYeA8zwyZYHUM6pw2glLzK3zdrvOyaRoqHqcaOUSe3uGGdJG1VE2RXrkNgieXHTzhwxRK0MliQIkZq5EdjkMIbFh9eGKIKy9gFQ= Date: Tue, 4 Sep 2007 21:14:26 -0700 From: "Martin Madera" To: "Kevin Karplus" X-ASG-Orig-Subj: Re: new local properties wanted Subject: Re: new local properties wanted Cc: rph@soe.ucsc.edu, gerloff@soe.ucsc.edu, ggshack@soe.ucsc.edu, josue@soe.ucsc.edu, bort@soe.ucsc.edu, thiltgen@soe.ucsc.edu, jarchie@soe.ucsc.edu, paluszewski@gmail.com, T.Juettemann@sms.ed.ac.uk, J.L.Sharman@sms.ed.ac.uk In-Reply-To: <6de0ae080709042024k60058cd9qbf88bc5b15f5a645@mail.gmail.com> Content-Disposition: inline X-Barracuda-Virus-Scanned: by Barracuda Spam Firewall at soe.ucsc.edu X-Barracuda-Spam-Score: 0.00 X-Barracuda-Spam-Status: No, SCORE=0.00 using global scores of TAG_LEVEL=3.5 QUARANTINE_LEVEL=5.0 KILL_LEVEL=9.0 tests= X-Barracuda-Spam-Report: Code version 3.1, rules version 3.1.27549 Rule breakdown below pts rule name description ---- ---------------------- -------------------------------------------------- X-Spam-Checker-Version: SpamAssassin 3.0.1 (2004-10-22) on services.cse.ucsc.edu X-Spam-Level: X-Spam-Status: No, score=-98.5 required=3.0 tests=AWL,RCVD_BY_IP, USER_IN_WHITELIST autolearn=no version=3.0.1 > 5) Which reminds me, we desperately need an alphabet that can tell > Rossmanns from TIM barrels. (Burial and secondary structure are really > bad for this, and it's screwing up fold recognition.) (4) is one > possibility. Another possibility is to try and predict whether an > alpha helix that follows a beta strand lies above or below the beta > sheet. This may not be possible, but I think we should try, because > it's a very important problem. > > 6) Generalizations of (5). For a beta strand that follows a beta > strand, Osep/Nsep gives a lot of information, but it doesn't say > whether the next strand is on the left or on the right. Helix-helix > turns are more complex, but maybe we could categorize them and see > what we can say about the relative position of the two helices. We should probably have a look at the TOPS algorithm from Janet Thornton's group, which they use to automatically generate topology cartoons: DAVID R. WESTHEAD a1 p1 c1 , TIMOTHY W.F. SLIDEL a1 , TOMAS P.J. FLORES a2 and JANET M. THORNTON a1 a3 a4 Protein structural topology: Automated analysis and diagrammatic representation Protein Science (1999), 8: 897-904 http://journals.cambridge.org/action/displayAbstract?fromPage=online&aid=45405 M. From jarchie@empress.cse.ucsc.edu Tue Sep 4 21:51:02 2007 MIME-Version: 1.0 Content-type: text/plain; charset=us-ascii X-ASG-Debug-ID: 1188967861-66ed00520000-uZEIwy X-Barracuda-URL: http://mailgw.cse.ucsc.edu:8000/cgi-bin/mark.cgi X-Barracuda-Connect: empress.cse.ucsc.edu[128.114.49.20] X-Barracuda-Start-Time: 1188967861 X-Barracuda-Encrypted: DHE-RSA-AES256-SHA X-ASG-Whitelist: Client Date: Tue, 4 Sep 2007 21:50:46 -0700 From: John Archie To: Martin Madera Cc: Kevin Karplus , rph@soe.ucsc.edu, gerloff@soe.ucsc.edu, ggshack@soe.ucsc.edu, josue@soe.ucsc.edu, bort@soe.ucsc.edu, thiltgen@soe.ucsc.edu, jarchie@soe.ucsc.edu, paluszewski@gmail.com, T.Juettemann@sms.ed.ac.uk, J.L.Sharman@sms.ed.ac.uk X-ASG-Orig-Subj: Re: new local properties wanted Subject: Re: new local properties wanted Content-Disposition: inline In-Reply-To: <6de0ae080709042027i145aec9aw61269098907b1903@mail.gmail.com> X-Original-Status: RO X-Barracuda-Virus-Scanned: by Barracuda Spam Firewall at soe.ucsc.edu > Ah, I remember thinking that improving tau/btr was an interesting > problem, but I've completely forgotten what tau and btr were trying to > measure (hopeless!). Could someone remind me? > > M. Kendall's tau is a standard measure of rank-correlation which captures a monotonic relationship between two variables; it's similar to Spearman's rho but is more intuitive. For cost functions, Kendall's tau can be computed as follows: Count all possible pairs of structures. Count the number of pairs where the structure with the lower cost is the better structure. Use these counts to estimate the probability that, given a random pair, choosing the structure with the lower cost selects the better structure. Tau can be computed by normalizing this probability to the range [-1,1]. (So a probability of 0 would be -1; a probability of 0.50, 0; and a probability of 1, 1). Furthermore, tau has an empirically shown (but not proven) relationship with mutual information, mutual information = -log(1-tau^2)/2 which is useful if one wants to weight cost functions in proportion to their mutual information with GDT or other quality measure. (Harry Joe, Relative Entropy Measures of Multivariate Dependence, Journal of the American Statistical Association, Vol 84, No 405) It is possible to weight both Spearman's rho and Kendall's tau such that structures with lower cost are given greater influence. Doing so yields measures that, when applied to random data, have a normal distribution centered at 0 with a range of [-1,1], as desired. Btr is simply the proportion of decoys scoring better than the experimental structure. The problem with this measure is that it is not continuous, and values of 0 and 1 occur frequently. Another more significant problem is that cost functions do not handle missing data consistently, and the experimental structure usually has missing atoms. (With tau, one has the luxury of filtering out incomplete structures.) If one can overcome the missing-atoms problem, similar measures can be easily created without the problems of btr. As things now stand, btr is not useful for comparing different cost functions. John From jarchie@empress.cse.ucsc.edu Tue Sep 4 23:18:46 2007 MIME-Version: 1.0 Content-type: text/plain; charset=us-ascii X-ASG-Debug-ID: 1188973125-171700400000-uZEIwy X-Barracuda-URL: http://mailgw.cse.ucsc.edu:8000/cgi-bin/mark.cgi X-Barracuda-Connect: empress.cse.ucsc.edu[128.114.49.20] X-Barracuda-Start-Time: 1188973125 X-Barracuda-Encrypted: DHE-RSA-AES256-SHA X-ASG-Whitelist: Client Date: Tue, 4 Sep 2007 23:18:33 -0700 From: John Archie To: Martin Madera Cc: Kevin Karplus , rph@soe.ucsc.edu, gerloff@soe.ucsc.edu, ggshack@soe.ucsc.edu, josue@soe.ucsc.edu, bort@soe.ucsc.edu, thiltgen@soe.ucsc.edu, jarchie@soe.ucsc.edu, paluszewski@gmail.com, T.Juettemann@sms.ed.ac.uk, J.L.Sharman@sms.ed.ac.uk X-ASG-Orig-Subj: Re: new local properties wanted Subject: Re: new local properties wanted Content-Disposition: inline In-Reply-To: <6de0ae080709042114n42670fc8xd999ef3f057b9e26@mail.gmail.com> X-Barracuda-Virus-Scanned: by Barracuda Spam Firewall at soe.ucsc.edu > Tau seems like a good idea. But I remember that in your talk you > mentioned that there are some problems with it. What are they? There was one main problem. For easy targets there were a lot of decoys that were very close to being correct. The set of good decoys had both a low cost and high GDT, but were relatively uncorrelated with each other. Still, there were a few bad decoys which had a high cost. And so the cost function may have been able to tell between good and poor structures, but tau was very small. I think this could be solved in two ways (1) Say that this isn't really a problem. For easy targets we want to be able to tell the difference between very good predictions since most predictions are very good. (Should this be called the Microsoft solution?) (2) Say that the problem is with the decoy set not being representative, and thin the decoy set somehow--perhaps using structures greater than some RMSD from all other targets in the set (?). (3) Use a measure like Pearson's correlation which assumes a bivariate normal distribution--something that might not be true for cost functions and model quality measures. Still, Pearson's is very sensitive to outliers, so it would give an "intuitive" result in this case. Nonetheless, I think Pearson's would cause more problems elsewhere... Naively, I would think that (1) would be better for quality assessment and (2) would be better for structure prediction--but I'm not sure. Goodnight, John On Tue, Sep 04, 2007 at 10:15:33PM -0700, Martin Madera wrote: > Ah, now it's coming back! > > For each decoy you have a cost computed using your cost function, > and > something like GDT, which gives you a scatter plot. And you want a > single number that will characterize this scatter plot. > > Tau seems like a good idea. But I remember that in your talk you > mentioned that there are some problems with it. What are they? > > M. > From martin.madera@gmail.com Wed Sep 5 01:04:17 2007 MIME-Version: 1.0 Content-type: text/plain; charset=ISO-8859-1 X-ASG-Debug-ID: 1188979455-459d00760000-uZEIwy X-Barracuda-URL: http://mailgw.cse.ucsc.edu:8000/cgi-bin/mark.cgi X-Barracuda-Connect: nf-out-0910.google.com[64.233.182.186] X-Barracuda-Start-Time: 1188979455 DKIM-Signature: a=rsa-sha1; c=relaxed/relaxed; d=gmail.com; s=beta; h=domainkey-signature:received:received:message-id:date:from:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; b=OtpoqsCX8klSRFh/o4gzX2CBtxQJaoQG3ewp+NYiMV8w3PKPpIp+4gQuT0WMwZl9tMhrx3oYkv0ryMt3xiCuWxwbeYDYkktDEuYkWYkgtak3XzA8Wad8dRZ+00IYqc0LhU2Ciy97EpEhpZlrfGkCnWxUu5d23KPfzM3F2E60Tmo= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=beta; h=received:message-id:date:from:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; b=L7We3CEZcMQq7pKtWAg5ixPAplXIaTlNua6ddnNrGTc9hYU5uQcJCt0R4RmcO4lwD7CNxiO7wWpFBBIGOHzky1kXaZMoQz90m5l2etguIKkN36zmuftL2kXXxJ8XC7hw1rr2GWiPLCynlEgUNRK2dSu+66/fWi9xf8MEvwStqIg= Date: Wed, 5 Sep 2007 01:04:14 -0700 From: "Martin Madera" To: "John Archie" X-ASG-Orig-Subj: Re: new local properties wanted Subject: Re: new local properties wanted Cc: "Kevin Karplus" In-Reply-To: <20070905061833.GA18237@localhost> Content-Disposition: inline X-Barracuda-Virus-Scanned: by Barracuda Spam Firewall at soe.ucsc.edu X-Barracuda-Spam-Score: 0.00 X-Barracuda-Spam-Status: No, SCORE=0.00 using global scores of TAG_LEVEL=3.5 QUARANTINE_LEVEL=5.0 KILL_LEVEL=9.0 tests= X-Barracuda-Spam-Report: Code version 3.1, rules version 3.1.27565 Rule breakdown below pts rule name description ---- ---------------------- -------------------------------------------------- X-Spam-Checker-Version: SpamAssassin 3.0.1 (2004-10-22) on services.cse.ucsc.edu X-Spam-Level: X-Spam-Status: No, score=-98.5 required=3.0 tests=AWL,RCVD_BY_IP, USER_IN_WHITELIST autolearn=no version=3.0.1 > There was one main problem. For easy targets there were a lot of > decoys that were very close to being correct. The set of good decoys > had both a low cost and high GDT, but were relatively > uncorrelated with each other. Still, there were a few bad decoys > which had a high cost. And so the cost function may have been able > to tell between good and poor structures, but tau was very small. Ah, yes. OK. Now I'm fully with you. > (1) Say that this isn't really a problem. For easy targets we want > to be able to tell the difference between very good predictions > since most predictions are very good. (Should this be called > the Microsoft solution?) No. There are different types of cost functions. Some are fine-grained and focus on the details of the structure (e.g. SASApack for foaminess and penalties for clashes), but once you're more than a certain distance away from native they don't tell you anything. Other functions (e.g. secondary structure and burial) are much coarser, and can tell you whether the overall structure is sensible, but they don't know about the high-resolution details. It's silly to expect that secondary structure predictions should tell you anything about high-resolution homology modelling! > (2) Say that the problem is with the decoy set not being > representative, and thin the decoy set somehow--perhaps using > structures greater than some RMSD from all other targets in the > set (?). Adding weights (say between 0 and 1) may be better than thinning. Tau's easy to generalize, just replace pair counts by the sum of pair weights. Hmmm. For a coarse-grained cost function you want to downweight pairs where both decoys are close to native, because you don't expect the cost function to be able to tell the difference. (For a fine-grained function, on the other hand, you want to downweight pairs far from native.) You also want to give a lower weight to a 5A-5A pair than a 4A-6A pair. But I'm not sure how to determine the weights... and once you've determined the weights, how to compare the scores for two cost functions with very different weights. E.g. if you have two burial cost functions, one of which works well in the 3-6A range, the other works OK but not great in the 5A-10A range, then you definitely want to use the first one for 3-6A, maybe both in the 6-8A range, and only the second one in the 8-12A range (because it's better than nothing). Martin > (3) Use a measure like Pearson's correlation which assumes a > bivariate normal distribution--something that might not be true > for cost functions and model quality measures. Still, Pearson's > is very sensitive to outliers, so it would give an "intuitive" > result in this case. Nonetheless, I think Pearson's would cause > more problems elsewhere... > > Naively, I would think that (1) would be better for quality assessment > and (2) would be better for structure prediction--but I'm not sure. > > Goodnight, > John > > On Tue, Sep 04, 2007 at 10:15:33PM -0700, Martin Madera wrote: > > Ah, now it's coming back! > > > > For each decoy you have a cost computed using your cost function, > > and > > something like GDT, which gives you a scatter plot. And you want a > > single number that will characterize this scatter plot. > > > > Tau seems like a good idea. But I remember that in your talk you > > mentioned that there are some problems with it. What are they? > > > > M. > > > From karplus@soe.ucsc.edu Wed Sep 5 03:53:50 2007 Date: Wed, 5 Sep 2007 03:53:33 -0700 From: Kevin Karplus To: jarchie@soe.ucsc.edu CC: martin.madera@gmail.com, rph@soe.ucsc.edu, gerloff@soe.ucsc.edu, ggshack@soe.ucsc.edu, josue@soe.ucsc.edu, bort@soe.ucsc.edu, thiltgen@soe.ucsc.edu, jarchie@soe.ucsc.edu, paluszewski@gmail.com, T.Juettemann@sms.ed.ac.uk, J.L.Sharman@sms.ed.ac.uk, karplus@soe.ucsc.edu In-reply-to: <20070905061833.GA18237@localhost> (message from John Archie on Tue, 4 Sep 2007 23:18:33 -0700) Subject: Re: new local properties wanted I think that weighting the low-cost points higher would improve tau as a measure. Rejecting the few really bad solutions is not very difficult---the hard part is distinguishing among the fairly good solutions. So the fact that tau is low when all a cost function does is distinguish the total crap from the adequate models is actually one of its good features. I think that the rejection (or downweighting) of data points should be done based on the cost function, and not the actual quality of the models, as we certainly want to know if a cost function is liking the really bad models, but we don't really care much if a few "good" models are rejected by the cost function. From karplus@soe.ucsc.edu Wed Sep 5 04:14:50 2007 Date: Wed, 5 Sep 2007 04:14:49 -0700 From: Kevin Karplus To: martin.madera@gmail.com CC: jarchie@soe.ucsc.edu, karplus@soe.ucsc.edu In-reply-to: <6de0ae080709050104y63d36cf0lc8b83a4546f2f2fc@mail.gmail.com> (martin.madera@gmail.com) Subject: Re: new local properties wanted Martin, you said > It's silly to expect that secondary structure predictions should tell > you anything about high-resolution homology modelling! That seems intuitively correct, so I wanted to check some of the "HA-TBM" targets for some of our best cost functions: CostFcn pred_nb11_04_simple pred_alpha06 pred_pb_mean avg tau 0.552698 0.549256 0.475756 # average over all targets T0288 0.500 0.510 0.385 T0290 0.330 0.501 0.559 T0291 0.504 0.690 0.644 T0292 0.505 0.597 0.541 T0295 0.626 0.568 0.672 T0302 0.536 0.366 0.418 T0305 0.529 0.442 0.500 T0308 0.495 0.447 0.419 T0311 0.567 0.324 0.365 T0313 0.326 0.264 0.242 T0315 0.472 0.542 0.484 T0317 0.371 0.468 0.451 T0324 0.624 0.480 0.527 T0326 0.682 0.550 0.654 T0328 0.741 0.528 0.664 T0332 0.510 0.345 0.287 T0334 ? ? ? T0340 0.518 0.425 0.332 T0345 0.610 0.585 0.535 T0346 0.316 0.515 0.545 T0359 0.357 0.490 0.320 T0366 ? ? ? T0367 0.744 0.606 0.423 Even on the HA-TBM targets, the Kendall's tau for these predicted burial and predicted secondary structure cost functions is respectably high. So I think your intuition here is wrong. Kevin From karplus@soe.ucsc.edu Wed Sep 5 05:26:42 2007 Date: Wed, 5 Sep 2007 05:26:38 -0700 From: Kevin Karplus To: martin.madera@gmail.com CC: rph@soe.ucsc.edu, gerloff@soe.ucsc.edu, ggshack@soe.ucsc.edu, josue@soe.ucsc.edu, bort@soe.ucsc.edu, thiltgen@soe.ucsc.edu, jarchie@soe.ucsc.edu, paluszewski@gmail.com, T.Juettemann@sms.ed.ac.uk, J.L.Sharman@sms.ed.ac.uk, karplus@soe.ucsc.edu In-reply-to: <6de0ae080709042024k60058cd9qbf88bc5b15f5a645@mail.gmail.com> (martin.madera@gmail.com) Subject: Re: new local properties wanted Following up on Martin's ideas: > 1) A cost function to penalize foaminess of final models, along the > lines of SASApack (or use SASApack directly?). This might be worth looking into, as we certainly use foaminess as one of our visual checks. Christian Barrett had some measures in his thesis that try to capture this (and which did well in decoy tests) and that are cheaper to compute than SASApack, being based on atom counting rather than area or volume computation. I don't remember him publishing this outside his thesis. I forget the details, but his thesis is in the UCSC library. > 2) Re-evaluation of all existing local structure alphabets, ignoring > fold recognition and looking at how useful they are for scoring 3D > models. I suspect that the problems with fold recognition that we're > seeing for many alphabets are caused by bad null models / bad HMM > calibration / similarities between unrelated folds (e.g. Rossmanns vs. > TIM barrels), and we've been restricting ourselves too much by > focusing on alphabets that work for fold recognition. (It would also > be interesting to compare these results with alignment accuracy > benchmarks, which I'll do.) I don't know that we want to implement neural nets and cost functions for *all* the alphabets we've looked at in the past. Some of them are quite similar to ones we are already using, so unlikely to be much of an improvement, and others were really terrible (like chi1, not predictable with neural nets). I'm willing to consider any local structural alphabet that is easy to implement in undertaker, as well as other predictable properties. > 3) I think we're doing fine for secondary structure elements and > burial, but I'd like to see more on hairpins / short turns etc. -- > basically model evaluation using I-sites / Bystroff, and Osep and > Nsep. (This falls under 2, I guess, but I thought I'd emphasize it.) Grant is working on implementing cost functions in undertaker for the Hbond alphabets---these have not been evaluated yet as cost functions, just as fold-recognition tools. We have not tried the I-sites classification of residues, in part because it was not a very complete classification scheme, in part because it was based on a combination of sequence and structure, and in part because there were a lot of different states (HMMSTR reduced the I-sites library to only 247 states). Our neural net methods may have trouble with such large alphabets, and the states are not really structural features, but motifs that are recognized. We have used Bystroff's single-letter phi-psi classification. We have had some success with de Brevern's protein blocks alphabet as a cost function, though we were unable to use it for fold recognition, because it was not compatible with reverse-sequence nulls. We could investigate other local-fragment structure alphabets or even create our own, but I'm not convinced we could do much better than the de Brevern set. Perhaps a slightly larger alphabet of somewhat shorter fragments would allow finer coverage. > 4) A random idea that has just occurred to me: ProteinShop has a few > parameters for beta sheets, IIRC something like twist and curl. Could > we try to predict these? This would be one way of separating Rossmanns > from TIM barrels. (How does this relate to NOtor?!) We have not looked at twist and curl---those could be interesting to predict. The NOtor angles for parallel sheets are fairly tightly clustered. We only separated the antiparallel Hbonds into two classes, since they had a clearly bimodal distribution. > 5) Which reminds me, we desperately need an alphabet that can tell > Rossmanns from TIM barrels. (Burial and secondary structure are really > bad for this, and it's screwing up fold recognition.) (4) is one > possibility. Another possibility is to try and predict whether an > alpha helix that follows a beta strand lies above or below the beta > sheet. This may not be possible, but I think we should try, because > it's a very important problem. "above" and "below" the sheet is unfortunately rather vague and may be hard to capture in a local structure alphabet. I guess that what were are looking for is an adjacent strand-helix pair, then labeling the strand residues according to whether they are on the same side of the sheet as the helix or the opposite side. Generalizing further, for helix-strand-helix, we could label each residue with one of 4 labels: same-same, same-opposite, opposite-same, opposite-opposite. Would we want to do this only to parallel strands, to mixed strands, or to all strands? I think that labeling (parallel) strand residues according to which sides the preceding and following helices are on would be quite useful, if it turns out to be predictable. I'm not sure how much it will help with the TIM/Rossmann distinction, as big chunks of both folds are strand-helix-strand-helix-strand, with all the helices on the same side. The difference is that the Rossmann fold has two 3-strand chunks, 321456, while the TIM barrel has 12345678. The difference in connectivity is primarily in the flipping of the 123 sheet, moving the helices to the other side. > 6) Generalizations of (5). For a beta strand that follows a beta > strand, Osep/Nsep gives a lot of information, but it doesn't say > whether the next strand is on the left or on the right. Helix-helix > turns are more complex, but maybe we could categorize them and see > what we can say about the relative position of the two helices. The Nsep, Osep alphabets really only cover beta hairpins, not antiparallel sheets in general. There is no notion of "left" or "right" when looking at a single hairpin. For anything other than a simple meander, the interesting strand-strand pairings will be in the "other antiparallel" category, not the -10 to +10 range of the separation alphabets. There may be some sheet-topological notions that we can capture in a local structure alphabet, but Osep and Nsep don't really do much beyond predicting hairpins and standard secondary structure. (I'm not knocking the separation alphabets---I think that improving hairpin prediction is useful.) I have some vague ideas about labeling strand residues by the separation from their bonding partner, not with the fine-grain of the current sep alphabets but with a coarser binning that could be used in parallel sheets to distinguish roughly between strand-helix-strand neighboring connections and more distant strand pairings. The mean separation for parallel residues is around 59, but it peaks at 24 with a median of 36. (~/pce/undertaker/output/dunbrack-1332-beta-parallel-sep.cum-hist) We could try binning the separation for the H-bonded partner into roughly equiprobable bins: s<-50 -50<=s<-30 -30<=s<0 0 To: karplus@soe.ucsc.edu, rph@soe.ucsc.edu, gerloff@soe.ucsc.edu, martin.madera@gmail.com, ggshack@soe.ucsc.edu, josue@soe.ucsc.edu, bort@soe.ucsc.edu, thiltgen@soe.ucsc.edu, jarchie@soe.ucsc.edu, paluszewski@gmail.com, T.Juettemann@sms.ed.ac.uk, J.L.Sharman@sms.ed.ac.uk Subject: cost function evaluations I was looking at the cost function evaluations today in pcep/CostFcnEval/ The average Kendall's tau value for correlation with real-cost varies from 0.006 for contact_order to 0.5527 for predicted near-backbone-11 burial (from t04 alignments). I could not find the rdb files for Martin Paluszewski's contact predictions from alignment---they do not seem to be in the CostFcnEval directory. I made a real-cost-tau-merged.rdb file and tried looking to see if there were easy and hard targets (that is, whether the tau values correlated between different cost functions). I have not automated this yet, just eye-balled some scatter diagrams. For different predictions of the same thing (like pred_nb11_04_simple and pred_nb11_2k_simple) the correlation is very high. For different predictions (like ehl2+sheets, contact449a_45, and pred_nb11_04_simple) the correlation seems to be very low. Predictions of related properties (like pred_alpha06 and pred_pb_t04, or pred_nb11_04_simple and pred_cb14_04_simple) have intermediate correlations. This means that the different cost functions are working well on different targets, implying to me that a combined cost function should be able to do much better. Currently all the neural-net predictions are beating all the builtin cost functions, though George's contact449a_45 is barely squeezing out hbond_geom_backbone (0.349 vs 0.333 average tau). Things to do on this project: 1) Precompute all the scwrled casp7 predictions and save them. This would cut the time for evaluating a cost function in half. 2) Precompute all the real_cost functions for all the casp7 predictions and save them in an rdb file. Use jointbl to merge cost function rdb tables with these real_cost rdb tables, rather than recomputing them each time. This would probably provide another factor of 2 or 3 reduction in the time to evaluate a cost function, and doesn't need much scripting. In fact, the existing rdb files for builtins, could be used as the source for the real_costs, so only the jointbl would need to be done. 3) Replace whole-chain evaluation with domain-based evaluation. The scripts for running domain-based evaluations exist in the casp7 Make.main, but not all the targets have the true-domain pdb files properly created yet. The scripts and Makefile in CostFcnEval would also need some minor mods to handle domains. Note that (1) and (2) are independent speedups and can be implemented in any order, but (3) would require redoing the real_cost computations. 4) Replace current Kendall's tau computation with weighted computation that assigns more importance to low-cost points. I believe that John and I eye-balled some plots and decided that weighting rank k by exp(-2 k/n) looked like it did about the right thing for summarizing the scatter diagrams in a single number. 5) Start combining cost functions to see how linear combinations fare. John has started implementing a tree-based approach, where we use heirarchical clustering of the cost functions (based on their correlations to each other), then go up the tree optimizing the relative weight of the two subtrees. This will not result in a global optimum, but it should give a good starting point for more sophisticated optimization methods. (Methods like multiple linear regression will fail because of the high correlation between some of the cost functions.) We may want to eliminate some nearly identical cost functions (like taking only one of the pred_nb11 costfunctions) to reduce the number of parameters to tweak. After builing the tree and getting weights for the cost functions, we may want to eliminate cost functions that end up with very low weight, since they may just be fitting noise. 6) Make it easy to add new cost functions to the mix later on. I am hopeful that predicting str4 will be useful, and that predicting Hbonds (by extraction from alignments and by conversion from n_sep, o_sep, n_notor2, and o_notor2 alphabets) will be useful. 7) For neural-net predicted cost functions, test a new version that makes the cost be -log P(observed|prediction)/P(observed|background), rather than just -log P(observed|prediction), to compensate for different bin sizes. This should not affect the burial alphabets much (as they were constructed to have near-uniform backgrounds), but should help the secondary-structure predictions. 8) Test and compare two MQA methods: 1) using our optimized cost function. 2) a meta-server that uses the cost function to weight the different server models, then creates a new cost function based on extracting info from the server models. (This could be C-beta constraints, like Martin P. is using, helix and sheet constraints, Hbond constraints, or even rmsd between models.) -------------------------------------------------------------------------------- Sat Sep 29 13:42:33 PDT 2007 Kevin Karplus Martin Paluszewski provided constraints-all and constraints-optimized files for constraints extracted from alignments. constraints-all includes all the C-beta distance constraints that he has extracted from the alignments. constraints-optimized includes only selected C-beta constraints, attempting to maximize the sum of the weights of the contacts and the probability of seeing that many sep>=9 contacts for that residue (using the CB8-sep9 prediction). The constraints-optimized cost function is nearly as good as the pred-nb11 cost functions (assuming the average doesn't change much when T0305 is included). If gdt-tau is used, instead of real_cost-tau, then the constraints-optimized actually do slightly better than pred_nb11 cost functions. Martin P. says "Also I should mention that they do not include T0305 because it crashes on the cluster. I haven't looked at the reason why, but it seems to be a scwrl problem." I'm a bit surprised at this, as no one else has had trouble with T0305. -------------------------------------------------------------------------------- Sat Nov 17 13:44:15 PST 2007 John Archie I am creating a new file called everything.costfcn to contain one of each cost function. This evaluation will be used for testing my cost function optimizer. I noticed that RealCost cost functions are defined in a lot of the *.costfcn files in this directory. It is worth noting that the these are simply ignored with the current CASP7 Makefile. Instead, for performance reasons, only the noncheating cost functions are evaluated, and these data are merged with precomputed cheating cost function data. Finally, I replaced all instances of "cfneval.pl" in the makefile with "./cfneval.pl". Not everyone has "." in their path. --------------------------------------------------------------------------------