[ Sat Sep  8 06:15:31 PDT 2007 Kevin Karplus
	Note: many of the instructions in John's protocol have been
	replaced by Makefile targets.  See the Makefile for details.
]
		      
		      Cost Function Evaluation Overview
                          John Archie (2007-08-20)

Most of the evaluation process can be done with the cfneval.pl script provided
here; the script is documented:

    % cfneval.pl --man

Evaluating cost functions using the code here is a multi-step process: First,
create the evaluation score files in all of the CASP7 target directories.
For my anglevector.costfcn file this was done by

    % set casp7=/projects/compbio/experiments/protein-predict/casp7/
    % set anglevectorcfn=/cse/grads/jarchie/projects/anglevector/anglevector.costfcn
    % set targetfile=$casp7/target_list.txt
    % umask 002
    % foreach target (`cat $targetfile`)
	foreach> sed -e "s/TXXXX/$target/g" < $anglevectorcfn > $casp7/$target/anglevector.costfcn
	foreach> end

Next, do all the scoring of decoys using the CASP7 stuff.  One way is

    % cfneval.pl -us "decoys/predictions.evaluate.anglevector.rdb" \
    ? 	                         | para-trickle-make -command ' ' -max_jobs 5

[ Thu Aug 23 12:02:21 PDT 2007 Kevin Karplus
   Alternatively, you could use 
   	para-trickle-make -manyids -se2log -no2letter -modelsdir $casp7 \
		-makefile ./Makefile -target decoys/predictions.evaluate.anglevector.rdb < $targetfile
]

Summary statistics can be generated by cfneval.pl:

    % cfneval.pl -s decoys/predictions.evaluate.anglevector.rdb -f0 > example.rdb 

Finally, plot the graphs and analyze the data in R, gnuplot, or some other
program:

    % R --no-save < cfneval_example.R > cfneval_example.log

(Check the R log for summary statistics and the plots/ directory for plots.)


Tue Aug 21 13:27:31 PDT 2007 Kevin Karplus

    Copied to /projects/compbio/experiments/protein-predict/CostFcnEval

Tue Aug 21 13:39:25 PDT 2007 Kevin Karplus

    Created builtins.costfcn to evaluate all the cost functions that
    are not specific to a particular target.

Tue Aug 21 20:36:27 PDT 2007 John Archie

    Fussed with the method used in cfneval.pl to compute Kendall's tau
    a bit to increase speed.  My very rough guess is that it will now take
    about 5 hours to complete the heiarchial cost function tree that I need
    to build in the Fall.

Fri Aug 24 12:53:34 PDT 2007 Kevin Karplus

One can get a quick summary of the results in the rdb file using
	summ -m < builtins.rdb | sort -nr +7 > builtins.avg

For the builtin cost fcns, the highest average tau is for
near_backbone, followed by other burial functions.

Note: I had to modify summ slightly, as it had used %d instead of %g
to print the values.


Fri Aug 24 13:19:00 PDT 2007 Kevin Karplus

I have put targets in the Makefile for evaluating the costfcn,
building an rdb file of the results by target, and giving the average
for each costfcn.

Fri Aug 24 14:48:55 PDT 2007 Kevin Karplus

There is now a %.summarize target, so that 
	make -k builtins.summarize
will make
builtins-gdt-btr.avg  builtins-gdt-tau.rdb        builtins-real_cost-tau.avg
builtins-gdt-btr.rdb  builtins-real_cost-btr.avg  builtins-real_cost-tau.rdb
builtins-gdt-tau.avg  builtins-real_cost-btr.rdb 

Using real_cost metric and tau, the best costfunction components are
Min, Avg, Max, Total for hbond_geom_backbone: -0.136, 0.333244, 0.581, 28.659
Min, Avg, Max, Total for near_backbone: -0.029, 0.32643, 0.566, 28.073
Min, Avg, Max, Total for dry12: -0.053, 0.305628, 0.642, 26.284
Min, Avg, Max, Total for dry8: -0.023, 0.303709, 0.569, 26.119

Using gdt and tau, the best costfunction components are
Min, Avg, Max, Total for near_backbone: -0.054, 0.302116, 0.553, 25.982
Min, Avg, Max, Total for dry12: -0.064, 0.290791, 0.646, 25.008
Min, Avg, Max, Total for dry8: -0.016, 0.286314, 0.543, 24.623
Min, Avg, Max, Total for way_back: -0.087, 0.28086, 0.562, 24.154
Min, Avg, Max, Total for dry6.5: -0.037, 0.261, 0.546, 22.446
Min, Avg, Max, Total for hbond_geom_backbone: -0.102, 0.250128, 0.5, 21.511

It is interesting that hbond_geom_backbone moves up so much in the
real_cost measure---probably because of the hbond scoring functions
included in real_cost.

Sun Aug 26 20:15:13 PDT 2007 Kevin Karplus

WARNING: there seems to be an occasional problem with T0305 on the
moai cluster:

  # ReadConformPDB reading from PDB file predictions/T0305TS601_3 looking for model 1
  # Found a chain break before 294
  # copying to AlignedFragments data structure
  # naming current conformation T0305TS601_3
  # request to SCWRL produces command: ulimit -t 268 ; scwrl3 -i   /var/tmp/to_scwrl_1995065502.pdb -s /var/tmp/to_scwrl_1995065502.seq   -o /var/tmp/from_scwrl_1995065502.pdb > /var/tmp/scwrl_1995065502.log
  # Trying to read SCWRLed conformation from /var/tmp/from_scwrl_1995065502.pdb
  undertaker: ScwrlCommands.cc:224: Conformation* SCWRL(Conformation*, std::ostream&): Assertion `ch->atom(a).no_wc_match(new_ch->atom(atom_in_new_ch))' failed.

Running exactly the same program on cheep does not cause any
problems.  When comparing tau or btr numbers, check to make sure that
the same number of targets is included in both runs (not a problem if
the computations are form the same run).

Sun Aug 26 22:25:26 PDT 2007 Kevin Karplus

The best cost functions for choosing high GDT are all neural-net predictions:
predburial-gdt-tau.avg:Min, Avg, Max, Total for pred_nb11_04_simple: 0.262, 0.520244, 0.733, 44.741
predburial-gdt-tau.avg:Min, Avg, Max, Total for pred_nb11_06_simple: 0.261, 0.515058, 0.732, 44.295
predburial-gdt-tau.avg:Min, Avg, Max, Total for pred_nb11_2k_simple: 0.29, 0.51464, 0.735, 44.259
predburial-gdt-tau.avg:Min, Avg, Max, Total for pred_nb11_04: 0.272, 0.483512, 0.715, 41.582
predburial-gdt-tau.avg:Min, Avg, Max, Total for pred_nb11_06: 0.227, 0.477128, 0.708, 41.033
predburial-gdt-tau.avg:Min, Avg, Max, Total for pred_nb11_2k: 0.237, 0.473186, 0.706, 40.694
predburial-gdt-tau.avg:Min, Avg, Max, Total for pred_cb14_04_simple: -0.03, 0.447767, 0.716, 38.508
predburial-gdt-tau.avg:Min, Avg, Max, Total for pred_alpha06: 0.098, 0.444233, 0.695, 38.204
predburial-gdt-tau.avg:Min, Avg, Max, Total for pred_cb14_06_simple: -0.03, 0.443849, 0.725, 38.171
predburial-gdt-tau.avg:Min, Avg, Max, Total for pred_alpha04: 0.099, 0.44264, 0.709, 38.067
predburial-gdt-tau.avg:Min, Avg, Max, Total for pred_alpha2k: 0.117, 0.436105, 0.65, 37.505
predburial-gdt-tau.avg:Min, Avg, Max, Total for pred_cb14_06: -0.014, 0.424, 0.721, 36.464
predburial-gdt-tau.avg:Min, Avg, Max, Total for pred_cb14_04: -0.01, 0.421593, 0.736, 36.257
predburial-gdt-tau.avg:Min, Avg, Max, Total for pred_cb14_2k_simple: 0.071, 0.418953, 0.724, 36.03
predburial-gdt-tau.avg:Min, Avg, Max, Total for pred_cb14_2k: -0.171, 0.412314, 0.697, 35.459
anglevector-gdt-tau.avg:Min, Avg, Max, Total for pred_pb_mean: 0.093, 0.409151, 0.634, 35.187
anglevector-gdt-tau.avg:Min, Avg, Max, Total for pred_pb_t04: 0.088, 0.409116, 0.633, 35.184
anglevector-gdt-tau.avg:Min, Avg, Max, Total for pred_pb_t06: 0.096, 0.408558, 0.636, 35.136
anglevector-gdt-tau.avg:Min, Avg, Max, Total for pred_pb_t2k: 0.091, 0.406221, 0.632, 34.935
predburial-gdt-tau.avg:Min, Avg, Max, Total for pred_CB8-sep9_06_simple: 0.057, 0.397884, 0.697, 34.218
predburial-gdt-tau.avg:Min, Avg, Max, Total for pred_CB8-sep9_06: 0.056, 0.387663, 0.648, 33.339
anglevector-gdt-tau.avg:Min, Avg, Max, Total for pred_bys_t06: 0.12, 0.375698, 0.632, 32.31
anglevector-gdt-tau.avg:Min, Avg, Max, Total for pred_bys_t04: 0.12, 0.375523, 0.629, 32.295
anglevector-gdt-tau.avg:Min, Avg, Max, Total for pred_bys_mean: 0.121, 0.373802, 0.632, 32.147
anglevector-gdt-tau.avg:Min, Avg, Max, Total for pred_bys_t2k: 0.121, 0.373233, 0.64, 32.098
predburial-gdt-tau.avg:Min, Avg, Max, Total for near_backbone: -0.054, 0.302116, 0.553, 25.982
builtins-gdt-tau.avg:Min, Avg, Max, Total for near_backbone: -0.054, 0.302116, 0.553, 25.982
builtins-gdt-tau.avg:Min, Avg, Max, Total for dry12: -0.064, 0.290791, 0.646, 25.008
builtins-gdt-tau.avg:Min, Avg, Max, Total for dry8: -0.016, 0.286314, 0.543, 24.623
builtins-gdt-tau.avg:Min, Avg, Max, Total for way_back: -0.087, 0.28086, 0.562, 24.154
builtins-gdt-tau.avg:Min, Avg, Max, Total for dry6.5: -0.037, 0.261, 0.546, 22.446
builtins-gdt-tau.avg:Min, Avg, Max, Total for hbond_geom_backbone: -0.102, 0.250128, 0.5, 21.511

For real_cost, the best are again predictions:

predburial-real_cost-tau.avg:Min, Avg, Max, Total for pred_nb11_04_simple: 0.295, 0.552698, 0.752, 47.532
predburial-real_cost-tau.avg:Min, Avg, Max, Total for pred_nb11_06_simple: 0.308, 0.549256, 0.745, 47.236
predburial-real_cost-tau.avg:Min, Avg, Max, Total for pred_nb11_2k_simple: 0.316, 0.545872, 0.75, 46.945
predburial-real_cost-tau.avg:Min, Avg, Max, Total for pred_nb11_04: 0.22, 0.5175, 0.727, 44.505
predburial-real_cost-tau.avg:Min, Avg, Max, Total for pred_nb11_06: 0.259, 0.512442, 0.724, 44.07
predburial-real_cost-tau.avg:Min, Avg, Max, Total for pred_nb11_2k: 0.203, 0.506477, 0.721, 43.557
predburial-real_cost-tau.avg:Min, Avg, Max, Total for pred_alpha06: 0.151, 0.499477, 0.69, 42.955
predburial-real_cost-tau.avg:Min, Avg, Max, Total for pred_alpha04: 0.15, 0.498279, 0.698, 42.852
predburial-real_cost-tau.avg:Min, Avg, Max, Total for pred_alpha2k: 0.147, 0.486709, 0.671, 41.857
anglevector-real_cost-tau.avg:Min, Avg, Max, Total for pred_pb_t04: 0.192, 0.475895, 0.67, 40.927
anglevector-real_cost-tau.avg:Min, Avg, Max, Total for pred_pb_mean: 0.191, 0.475756, 0.672, 40.915
anglevector-real_cost-tau.avg:Min, Avg, Max, Total for pred_pb_t06: 0.191, 0.475721, 0.674, 40.912
anglevector-real_cost-tau.avg:Min, Avg, Max, Total for pred_pb_t2k: 0.189, 0.471488, 0.67, 40.548
predburial-real_cost-tau.avg:Min, Avg, Max, Total for pred_cb14_04_simple: -0.013, 0.460442, 0.726, 39.598
predburial-real_cost-tau.avg:Min, Avg, Max, Total for pred_cb14_06_simple: -0.013, 0.457674, 0.729, 39.36
anglevector-real_cost-tau.avg:Min, Avg, Max, Total for pred_bys_t06: 0.096, 0.453384, 0.69, 38.991
anglevector-real_cost-tau.avg:Min, Avg, Max, Total for pred_bys_t04: 0.097, 0.45307, 0.689, 38.964
anglevector-real_cost-tau.avg:Min, Avg, Max, Total for pred_bys_mean: 0.094, 0.450965, 0.691, 38.783
anglevector-real_cost-tau.avg:Min, Avg, Max, Total for pred_bys_t2k: 0.091, 0.449674, 0.7, 38.672
predburial-real_cost-tau.avg:Min, Avg, Max, Total for pred_cb14_06: -0.012, 0.440907, 0.7, 37.918
predburial-real_cost-tau.avg:Min, Avg, Max, Total for pred_cb14_04: -0.006, 0.437709, 0.73, 37.643
predburial-real_cost-tau.avg:Min, Avg, Max, Total for pred_cb14_2k_simple: 0.044, 0.428244, 0.727, 36.829
predburial-real_cost-tau.avg:Min, Avg, Max, Total for pred_cb14_2k: -0.166, 0.425407, 0.69, 36.585
predburial-real_cost-tau.avg:Min, Avg, Max, Total for pred_CB8-sep9_06_simple: 0.089, 0.412093, 0.663, 35.44
predburial-real_cost-tau.avg:Min, Avg, Max, Total for pred_CB8-sep9_06: 0.073, 0.402047, 0.664, 34.576
builtins-real_cost-tau.avg:Min, Avg, Max, Total for hbond_geom_backbone: -0.136, 0.333244, 0.581, 28.659
predburial-real_cost-tau.avg:Min, Avg, Max, Total for near_backbone: -0.029, 0.326488, 0.566, 28.078
builtins-real_cost-tau.avg:Min, Avg, Max, Total for near_backbone: -0.029, 0.32643, 0.566, 28.073
builtins-real_cost-tau.avg:Min, Avg, Max, Total for dry12: -0.053, 0.305628, 0.642, 26.284
builtins-real_cost-tau.avg:Min, Avg, Max, Total for dry8: -0.023, 0.303709, 0.569, 26.119
builtins-real_cost-tau.avg:Min, Avg, Max, Total for way_back: -0.101, 0.296767, 0.607, 25.522
builtins-real_cost-tau.avg:Min, Avg, Max, Total for alpha: -0.023, 0.288349, 0.583, 24.798
builtins-real_cost-tau.avg:Min, Avg, Max, Total for alpha_prev: 0.005, 0.28786, 0.586, 24.756

From martin.madera@gmail.com  Tue Sep  4 20:24:55 2007
MIME-Version: 1.0
Content-type: text/plain; charset=ISO-8859-1
X-ASG-Debug-ID: 1188962689-3b4500560000-uZEIwy
X-Barracuda-URL: http://mailgw.cse.ucsc.edu:8000/cgi-bin/mark.cgi
X-Barracuda-Connect: mu-out-0910.google.com[209.85.134.189]
X-Barracuda-Start-Time: 1188962689
DKIM-Signature: a=rsa-sha1; c=relaxed/relaxed;         d=gmail.com; s=beta;         h=domainkey-signature:received:received:message-id:date:from:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references;         b=WRRAOT5orqBqghJ7X7wxCNrwwjX9M9g1StesmPxkZzbLz07MfnTI3I4V5y4/CaBzFF2tbwZ8fnR2XASAIavsZB0DjcmoH3frNJfPRL2dQuPrYpRVx20eKPmCxz3IQwLHk/EapEq3Lo/x3toCzXIXKRpczSyHcM6KshL/sw6qBQs=
DomainKey-Signature: a=rsa-sha1; c=nofws;         d=gmail.com; s=beta;         h=received:message-id:date:from:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references;         b=ofY3ZqxdNCnb1TODuHP/rGB0ia/EaKZXM6nfFZBsuTvgIfIcUcn5LDGQzcGBwbxi+ZF5ftD0Tk2CB1axlK4MuTfYhfB2eX7bWTJ7loCo5RL4KtUb4H7pRAWz6jvGz9nGa4W0F6zvWnGG/j+mOiqad+yXY2bCDrWkHlXYKLyNIOM=
Date: Tue, 4 Sep 2007 20:24:48 -0700
From: "Martin Madera" <martin.madera@gmail.com>
To: "Kevin Karplus" <karplus@soe.ucsc.edu>
X-ASG-Orig-Subj: Re: new local properties wanted
Subject: Re: new local properties wanted
Cc: rph@soe.ucsc.edu, gerloff@soe.ucsc.edu, ggshack@soe.ucsc.edu,         josue@soe.ucsc.edu, bort@soe.ucsc.edu, thiltgen@soe.ucsc.edu,         jarchie@soe.ucsc.edu, paluszewski@gmail.com, T.Juettemann@sms.ed.ac.uk,         J.L.Sharman@sms.ed.ac.uk
In-Reply-To: <200709042358.l84NwsJr014860@cheep.cse.ucsc.edu>
Content-Disposition: inline
X-Barracuda-Virus-Scanned: by Barracuda Spam Firewall at soe.ucsc.edu
X-Barracuda-Spam-Score: 0.00
X-Barracuda-Spam-Status: No, SCORE=0.00 using global scores of TAG_LEVEL=3.5 QUARANTINE_LEVEL=5.0 KILL_LEVEL=9.0 tests=
X-Barracuda-Spam-Report: Code version 3.1, rules version 3.1.27547  Rule breakdown below   pts rule name              description  ---- ---------------------- --------------------------------------------------
X-Spam-Checker-Version: SpamAssassin 3.0.1 (2004-10-22) on   services.cse.ucsc.edu
X-Spam-Level: 
X-Spam-Status: No, score=-99.0 required=3.0 tests=AWL,RCVD_BY_IP,  USER_IN_WHITELIST autolearn=no version=3.0.1

A few random ideas in this general area:

1) A cost function to penalize foaminess of final models, along the
lines of SASApack (or use SASApack directly?).

2) Re-evaluation of all existing local structure alphabets, ignoring
fold recognition and looking at how useful they are for scoring 3D
models. I suspect that the problems with fold recognition that we're
seeing for many alphabets are caused by bad null models / bad HMM
calibration / similarities between unrelated folds (e.g. Rossmanns vs.
TIM barrels), and we've been restricting ourselves too much by
focusing on alphabets that work for fold recognition. (It would also
be interesting to compare these results with alignment accuracy
benchmarks, which I'll do.)

3) I think we're doing fine for secondary structure elements and
burial, but I'd like to see more on hairpins / short turns etc. --
basically model evaluation using I-sites / Bystroff, and Osep and
Nsep. (This falls under 2, I guess, but I thought I'd emphasize it.)

4) A random idea that has just occurred to me: ProteinShop has a few
parameters for beta sheets, IIRC something like twist and curl. Could
we try to predict these? This would be one way of separating Rossmanns
from TIM barrels. (How does this relate to NOtor?!)

5) Which reminds me, we desperately need an alphabet that can tell
Rossmanns from TIM barrels. (Burial and secondary structure are really
bad for this, and it's  screwing up fold recognition.) (4) is one
possibility. Another possibility is to try and predict whether an
alpha helix that follows a beta strand lies above or below the beta
sheet. This may not be possible, but I think we should try, because
it's a very important problem.

6) Generalizations of (5). For a beta strand that follows a beta
strand, Osep/Nsep gives a lot of information, but it doesn't say
whether the next strand is on the left or on the right. Helix-helix
turns are more complex, but maybe we could categorize them and see
what we can say about the relative position of the two helices.

Martin


On 9/4/07, Kevin Karplus <karplus@soe.ucsc.edu> wrote:
>
> In our first substantive tests of the undertaker cost functions, we
> have found that predicted properties (secondary structure, burial,
> contacts from alignments, ...) are much better at selecting good
> models from the CASP7 pool than the built-in cost functions.
>
> This suggests to me that we want to have more such properties to
> predict and use in the cost function.
>
> Grant and I are working on a couple of definitions of a backbone
> alphabet (str4) that can be scored by undertaker (unlike str2, which
> relies on DSSP).
>
> Martin Paluszewski is working on getting contact predictions from alignments.
>
> George is working on getting contact predictions from neural nets.
>
> John will be working on making combinations of cost functions to get
> stronger combined cost functions.  John will also be working on better
> ways to evaluate the cost functions.  He has come up with two tools so far:
> Kendall's tau (a correlation measure of monotonicity) and btr (better
> than real).  The tau measure seems quite useful, but the btr measure
> is less informative.
>
> What other directions could we be exploring on this front?
>
>
> 1) Evaluating models from alignment, and not just models from CASP7 submissions.
>
> 2) New local properties.  Anyone have any ideas that seem worth trying?
>
> Kevin Karplus
>

From martin.madera@gmail.com  Tue Sep  4 20:27:46 2007
MIME-Version: 1.0
Content-type: text/plain; charset=ISO-8859-1
X-ASG-Debug-ID: 1188962864-3db200410000-uZEIwy
X-Barracuda-URL: http://mailgw.cse.ucsc.edu:8000/cgi-bin/mark.cgi
X-Barracuda-Connect: nf-out-0910.google.com[64.233.182.189]
X-Barracuda-Start-Time: 1188962864
DKIM-Signature: a=rsa-sha1; c=relaxed/relaxed;         d=gmail.com; s=beta;         h=domainkey-signature:received:received:message-id:date:from:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references;         b=fJVmOz0lIDRi8DGgDxFyPQ6sQ/altDtOAf0aPMrh6QNCJwofRgeOsS6XX2iI76+zxbBXKRoeEgZvEa+5vQ8HOushJ9U1nYaza7Bf2dfRuijX9boflneEfiCf9mDpcKreZLI6xWtdubYAvL4b9uita3cNlrp7+1LGayNaf4h8hZ0=
DomainKey-Signature: a=rsa-sha1; c=nofws;         d=gmail.com; s=beta;         h=received:message-id:date:from:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references;         b=YTx/QnIklFPYVH9iRNjkPMf3646nqePwZEw+QHX/c3OgfsM6nl6yk1aI7FJTlegquFgVz/JC7O/KenN0uc2kzVHlqiMwbgP/ha6vQSBPsPPjUcW4WXNBYQLBtB80l8IUh8Qwt69/4KkrdIaGvi4kj1jmRBGrOfXZQcO9dgKrs7c=
Date: Tue, 4 Sep 2007 20:27:43 -0700
From: "Martin Madera" <martin.madera@gmail.com>
To: "Kevin Karplus" <karplus@soe.ucsc.edu>
X-ASG-Orig-Subj: Re: new local properties wanted
Subject: Re: new local properties wanted
Cc: rph@soe.ucsc.edu, gerloff@soe.ucsc.edu, ggshack@soe.ucsc.edu,         josue@soe.ucsc.edu, bort@soe.ucsc.edu, thiltgen@soe.ucsc.edu,         jarchie@soe.ucsc.edu, paluszewski@gmail.com, T.Juettemann@sms.ed.ac.uk,         J.L.Sharman@sms.ed.ac.uk
In-Reply-To: <200709042358.l84NwsJr014860@cheep.cse.ucsc.edu>
Content-Disposition: inline
X-Barracuda-Virus-Scanned: by Barracuda Spam Firewall at soe.ucsc.edu
X-Barracuda-Spam-Score: 0.00
X-Barracuda-Spam-Status: No, SCORE=0.00 using global scores of TAG_LEVEL=3.5 QUARANTINE_LEVEL=5.0 KILL_LEVEL=9.0 tests=
X-Barracuda-Spam-Report: Code version 3.1, rules version 3.1.27547  Rule breakdown below   pts rule name              description  ---- ---------------------- --------------------------------------------------
X-Spam-Checker-Version: SpamAssassin 3.0.1 (2004-10-22) on   services.cse.ucsc.edu
X-Spam-Level: 
X-Spam-Status: No, score=-98.5 required=3.0 tests=AWL,RCVD_BY_IP,  USER_IN_WHITELIST autolearn=no version=3.0.1

> John will be working on making combinations of cost functions to get
> stronger combined cost functions.  John will also be working on better
> ways to evaluate the cost functions.  He has come up with two tools so far:
> Kendall's tau (a correlation measure of monotonicity) and btr (better
> than real).  The tau measure seems quite useful, but the btr measure
> is less informative.

Ah, I remember thinking that improving tau/btr was an interesting
problem, but I've completely forgotten what tau and btr were trying to
measure (hopeless!). Could someone remind me?

M.

From martin.madera@gmail.com  Tue Sep  4 21:14:31 2007
MIME-Version: 1.0
Content-type: text/plain; charset=ISO-8859-1
X-ASG-Debug-ID: 1188965667-525f004b0000-uZEIwy
X-Barracuda-URL: http://mailgw.cse.ucsc.edu:8000/cgi-bin/mark.cgi
X-Barracuda-Connect: nf-out-0910.google.com[64.233.182.185]
X-Barracuda-Start-Time: 1188965667
DKIM-Signature: a=rsa-sha1; c=relaxed/relaxed;         d=gmail.com; s=beta;         h=domainkey-signature:received:received:message-id:date:from:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references;         b=A5x7XSWbUU4kNIyy+zdMqkO7nZvlMtJL3eU0iCRZH35He5NU+z2txe27v3eTvHjFWqGs5rN88fTJcPqz4wLygj1AQM78GosUoP7qYDnNhzM/JmZV6PXRiAvRHxU0rqaHmKuya/d2wFTaVX7fO2MdgXlskFAvbGQvEbLi68q3EIk=
DomainKey-Signature: a=rsa-sha1; c=nofws;         d=gmail.com; s=beta;         h=received:message-id:date:from:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references;         b=sxaZnDIrwdXfHJQgiW7bY0NwfnFyMl/Lr2RWpQAM8Tp2YQjWcLh5s5p2Cfhk/VAsbxLf4IyAcYeA8zwyZYHUM6pw2glLzK3zdrvOyaRoqHqcaOUSe3uGGdJG1VE2RXrkNgieXHTzhwxRK0MliQIkZq5EdjkMIbFh9eGKIKy9gFQ=
Date: Tue, 4 Sep 2007 21:14:26 -0700
From: "Martin Madera" <martin.madera@gmail.com>
To: "Kevin Karplus" <karplus@soe.ucsc.edu>
X-ASG-Orig-Subj: Re: new local properties wanted
Subject: Re: new local properties wanted
Cc: rph@soe.ucsc.edu, gerloff@soe.ucsc.edu, ggshack@soe.ucsc.edu,         josue@soe.ucsc.edu, bort@soe.ucsc.edu, thiltgen@soe.ucsc.edu,         jarchie@soe.ucsc.edu, paluszewski@gmail.com, T.Juettemann@sms.ed.ac.uk,         J.L.Sharman@sms.ed.ac.uk
In-Reply-To: <6de0ae080709042024k60058cd9qbf88bc5b15f5a645@mail.gmail.com>
Content-Disposition: inline
X-Barracuda-Virus-Scanned: by Barracuda Spam Firewall at soe.ucsc.edu
X-Barracuda-Spam-Score: 0.00
X-Barracuda-Spam-Status: No, SCORE=0.00 using global scores of TAG_LEVEL=3.5 QUARANTINE_LEVEL=5.0 KILL_LEVEL=9.0 tests=
X-Barracuda-Spam-Report: Code version 3.1, rules version 3.1.27549  Rule breakdown below   pts rule name              description  ---- ---------------------- --------------------------------------------------
X-Spam-Checker-Version: SpamAssassin 3.0.1 (2004-10-22) on   services.cse.ucsc.edu
X-Spam-Level: 
X-Spam-Status: No, score=-98.5 required=3.0 tests=AWL,RCVD_BY_IP,  USER_IN_WHITELIST autolearn=no version=3.0.1

> 5) Which reminds me, we desperately need an alphabet that can tell
> Rossmanns from TIM barrels. (Burial and secondary structure are really
> bad for this, and it's  screwing up fold recognition.) (4) is one
> possibility. Another possibility is to try and predict whether an
> alpha helix that follows a beta strand lies above or below the beta
> sheet. This may not be possible, but I think we should try, because
> it's a very important problem.
>
> 6) Generalizations of (5). For a beta strand that follows a beta
> strand, Osep/Nsep gives a lot of information, but it doesn't say
> whether the next strand is on the left or on the right. Helix-helix
> turns are more complex, but maybe we could categorize them and see
> what we can say about the relative position of the two helices.

We should probably have a look at the TOPS algorithm from Janet
Thornton's group, which they use to automatically generate topology
cartoons:

DAVID R. WESTHEAD a1 p1 c1 , TIMOTHY W.F. SLIDEL a1 , TOMAS P.J.
FLORES a2 and JANET M. THORNTON a1 a3 a4
Protein structural topology: Automated analysis and diagrammatic representation
Protein Science  (1999), 8: 897-904
http://journals.cambridge.org/action/displayAbstract?fromPage=online&aid=45405

M.

From jarchie@empress.cse.ucsc.edu  Tue Sep  4 21:51:02 2007
MIME-Version: 1.0
Content-type: text/plain; charset=us-ascii
X-ASG-Debug-ID: 1188967861-66ed00520000-uZEIwy
X-Barracuda-URL: http://mailgw.cse.ucsc.edu:8000/cgi-bin/mark.cgi
X-Barracuda-Connect: empress.cse.ucsc.edu[128.114.49.20]
X-Barracuda-Start-Time: 1188967861
X-Barracuda-Encrypted: DHE-RSA-AES256-SHA
X-ASG-Whitelist:  Client
Date: Tue, 4 Sep 2007 21:50:46 -0700
From: John Archie <jarchie@soe.ucsc.edu>
To: Martin Madera <martin.madera@gmail.com>
Cc: Kevin Karplus <karplus@soe.ucsc.edu>, rph@soe.ucsc.edu,         gerloff@soe.ucsc.edu, ggshack@soe.ucsc.edu, josue@soe.ucsc.edu,         bort@soe.ucsc.edu, thiltgen@soe.ucsc.edu, jarchie@soe.ucsc.edu,         paluszewski@gmail.com, T.Juettemann@sms.ed.ac.uk,         J.L.Sharman@sms.ed.ac.uk
X-ASG-Orig-Subj: Re: new local properties wanted
Subject: Re: new local properties wanted
Content-Disposition: inline
In-Reply-To: <6de0ae080709042027i145aec9aw61269098907b1903@mail.gmail.com>
X-Original-Status: RO
X-Barracuda-Virus-Scanned: by Barracuda Spam Firewall at soe.ucsc.edu

> Ah, I remember thinking that improving tau/btr was an interesting
> problem, but I've completely forgotten what tau and btr were trying to
> measure (hopeless!). Could someone remind me?
> 
> M.

Kendall's tau is a standard measure of rank-correlation which captures
a monotonic relationship between two variables; it's similar to
Spearman's rho but is more intuitive.

For cost functions, Kendall's tau can be computed as follows:  Count
all possible pairs of structures.  Count the number of pairs where the
structure with the lower cost is the better structure.  Use these
counts to estimate the probability that, given a random pair, choosing
the structure with the lower cost selects the better structure.  Tau
can be computed by normalizing this probability to the range [-1,1].
(So a probability of 0 would be -1; a probability of 0.50, 0; and a
probability of 1, 1).

Furthermore, tau has an empirically shown (but not proven)
relationship with mutual information,

    mutual information = -log(1-tau^2)/2

which is useful if one wants to weight cost functions in proportion to
their mutual information with GDT or other quality measure. (Harry
Joe, Relative Entropy Measures of Multivariate Dependence, Journal of
the American Statistical Association, Vol 84, No 405)

It is possible to weight both Spearman's rho and Kendall's tau such
that structures with lower cost are given greater influence.  Doing so
yields measures that, when applied to random data, have a normal
distribution centered at 0 with a range of [-1,1], as desired.

Btr is simply the proportion of decoys scoring better than the
experimental structure.  The problem with this measure is that it is
not continuous, and values of 0 and 1 occur frequently.  Another more
significant problem is that cost functions do not handle missing data
consistently, and the experimental structure usually has missing
atoms.  (With tau, one has the luxury of filtering out incomplete
structures.)  If one can overcome the missing-atoms problem, similar
measures can be easily created without the problems of btr.  As things
now stand, btr is not useful for comparing different cost functions.

John

From jarchie@empress.cse.ucsc.edu  Tue Sep  4 23:18:46 2007
MIME-Version: 1.0
Content-type: text/plain; charset=us-ascii
X-ASG-Debug-ID: 1188973125-171700400000-uZEIwy
X-Barracuda-URL: http://mailgw.cse.ucsc.edu:8000/cgi-bin/mark.cgi
X-Barracuda-Connect: empress.cse.ucsc.edu[128.114.49.20]
X-Barracuda-Start-Time: 1188973125
X-Barracuda-Encrypted: DHE-RSA-AES256-SHA
X-ASG-Whitelist:  Client
Date: Tue, 4 Sep 2007 23:18:33 -0700
From: John Archie <jarchie@soe.ucsc.edu>
To: Martin Madera <martin.madera@gmail.com>
Cc: Kevin Karplus <karplus@soe.ucsc.edu>, rph@soe.ucsc.edu,         gerloff@soe.ucsc.edu, ggshack@soe.ucsc.edu, josue@soe.ucsc.edu,         bort@soe.ucsc.edu, thiltgen@soe.ucsc.edu, jarchie@soe.ucsc.edu,         paluszewski@gmail.com, T.Juettemann@sms.ed.ac.uk,         J.L.Sharman@sms.ed.ac.uk
X-ASG-Orig-Subj: Re: new local properties wanted
Subject: Re: new local properties wanted
Content-Disposition: inline
In-Reply-To: <6de0ae080709042114n42670fc8xd999ef3f057b9e26@mail.gmail.com>
X-Barracuda-Virus-Scanned: by Barracuda Spam Firewall at soe.ucsc.edu

> Tau seems like a good idea. But I remember that in your talk you
> mentioned that there are some problems with it. What are they?

There was one main problem.  For easy targets there were a lot of
decoys that were very close to being correct.  The set of good decoys
had both a low cost and high GDT, but were relatively
uncorrelated with each other.  Still, there were a few bad decoys
which had a high cost.  And so the cost function may have been able
to tell between good and poor structures, but tau was very small.

I think this could be solved in two ways

  (1) Say that this isn't really a problem.  For easy targets we want
      to be able to tell the difference between very good predictions
      since most predictions are very good.  (Should this be called
      the Microsoft solution?)
  (2) Say that the problem is with the decoy set not being
      representative, and thin the decoy set somehow--perhaps using
      structures greater than some RMSD from all other targets in the
      set (?).
  (3) Use a measure like Pearson's correlation which assumes a
      bivariate normal distribution--something that might not be true
      for cost functions and model quality measures.  Still, Pearson's
      is very sensitive to outliers, so it would give an "intuitive"
      result in this case.  Nonetheless, I think Pearson's would cause
      more problems elsewhere...

Naively, I would think that (1) would be better for quality assessment
and (2) would be better for structure prediction--but I'm not sure.

Goodnight,
John

On Tue, Sep 04, 2007 at 10:15:33PM -0700, Martin Madera wrote:
> Ah, now it's coming back!
>
> For each decoy you have a cost computed using your cost function,
> and
> something like GDT, which gives you a scatter plot. And you want a
> single number that will characterize this scatter plot.
>
> Tau seems like a good idea. But I remember that in your talk you
> mentioned that there are some problems with it. What are they?
>
> M.
>

From martin.madera@gmail.com  Wed Sep  5 01:04:17 2007
MIME-Version: 1.0
Content-type: text/plain; charset=ISO-8859-1
X-ASG-Debug-ID: 1188979455-459d00760000-uZEIwy
X-Barracuda-URL: http://mailgw.cse.ucsc.edu:8000/cgi-bin/mark.cgi
X-Barracuda-Connect: nf-out-0910.google.com[64.233.182.186]
X-Barracuda-Start-Time: 1188979455
DKIM-Signature: a=rsa-sha1; c=relaxed/relaxed;         d=gmail.com; s=beta;         h=domainkey-signature:received:received:message-id:date:from:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references;         b=OtpoqsCX8klSRFh/o4gzX2CBtxQJaoQG3ewp+NYiMV8w3PKPpIp+4gQuT0WMwZl9tMhrx3oYkv0ryMt3xiCuWxwbeYDYkktDEuYkWYkgtak3XzA8Wad8dRZ+00IYqc0LhU2Ciy97EpEhpZlrfGkCnWxUu5d23KPfzM3F2E60Tmo=
DomainKey-Signature: a=rsa-sha1; c=nofws;         d=gmail.com; s=beta;         h=received:message-id:date:from:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references;         b=L7We3CEZcMQq7pKtWAg5ixPAplXIaTlNua6ddnNrGTc9hYU5uQcJCt0R4RmcO4lwD7CNxiO7wWpFBBIGOHzky1kXaZMoQz90m5l2etguIKkN36zmuftL2kXXxJ8XC7hw1rr2GWiPLCynlEgUNRK2dSu+66/fWi9xf8MEvwStqIg=
Date: Wed, 5 Sep 2007 01:04:14 -0700
From: "Martin Madera" <martin.madera@gmail.com>
To: "John Archie" <jarchie@soe.ucsc.edu>
X-ASG-Orig-Subj: Re: new local properties wanted
Subject: Re: new local properties wanted
Cc: "Kevin Karplus" <karplus@soe.ucsc.edu>
In-Reply-To: <20070905061833.GA18237@localhost>
Content-Disposition: inline
X-Barracuda-Virus-Scanned: by Barracuda Spam Firewall at soe.ucsc.edu
X-Barracuda-Spam-Score: 0.00
X-Barracuda-Spam-Status: No, SCORE=0.00 using global scores of TAG_LEVEL=3.5 QUARANTINE_LEVEL=5.0 KILL_LEVEL=9.0 tests=
X-Barracuda-Spam-Report: Code version 3.1, rules version 3.1.27565  Rule breakdown below   pts rule name              description  ---- ---------------------- --------------------------------------------------
X-Spam-Checker-Version: SpamAssassin 3.0.1 (2004-10-22) on   services.cse.ucsc.edu
X-Spam-Level: 
X-Spam-Status: No, score=-98.5 required=3.0 tests=AWL,RCVD_BY_IP,  USER_IN_WHITELIST autolearn=no version=3.0.1

> There was one main problem.  For easy targets there were a lot of
> decoys that were very close to being correct.  The set of good decoys
> had both a low cost and high GDT, but were relatively
> uncorrelated with each other.  Still, there were a few bad decoys
> which had a high cost.  And so the cost function may have been able
> to tell between good and poor structures, but tau was very small.

Ah, yes. OK. Now I'm fully with you.


>   (1) Say that this isn't really a problem.  For easy targets we want
>       to be able to tell the difference between very good predictions
>       since most predictions are very good.  (Should this be called
>       the Microsoft solution?)

No. There are different types of cost functions. Some are fine-grained
and focus on the details of the structure (e.g. SASApack for foaminess
and penalties for clashes), but once you're more than a certain
distance away from native they don't tell you anything. Other
functions (e.g. secondary structure and burial) are much coarser, and
can tell you whether the overall structure is sensible, but they don't
know about the high-resolution details. It's silly to expect that
secondary structure predictions should tell you anything about
high-resolution homology modelling!


>   (2) Say that the problem is with the decoy set not being
>       representative, and thin the decoy set somehow--perhaps using
>       structures greater than some RMSD from all other targets in the
>       set (?).

Adding weights (say between 0 and 1) may be better than thinning.
Tau's easy to generalize, just replace pair counts by the sum of pair
weights.

Hmmm. For a coarse-grained cost function you want to downweight pairs
where both decoys are close to native, because you don't expect the
cost function to be able to tell the difference. (For a fine-grained
function, on the other hand, you want to downweight pairs far from
native.) You also want to give a lower weight to a 5A-5A pair than a
4A-6A pair.

But I'm not sure how to determine the weights... and once you've
determined the weights, how to compare the scores for two cost
functions with very different weights. E.g. if you have two burial
cost functions, one of which works well in the 3-6A range, the other
works OK but not great in the 5A-10A range, then you definitely want
to use the first one for 3-6A, maybe both in the 6-8A range, and only
the second one in the 8-12A range (because it's better than nothing).

Martin


>   (3) Use a measure like Pearson's correlation which assumes a
>       bivariate normal distribution--something that might not be true
>       for cost functions and model quality measures.  Still, Pearson's
>       is very sensitive to outliers, so it would give an "intuitive"
>       result in this case.  Nonetheless, I think Pearson's would cause
>       more problems elsewhere...
>
> Naively, I would think that (1) would be better for quality assessment
> and (2) would be better for structure prediction--but I'm not sure.
>
> Goodnight,
> John
>
> On Tue, Sep 04, 2007 at 10:15:33PM -0700, Martin Madera wrote:
> > Ah, now it's coming back!
> >
> > For each decoy you have a cost computed using your cost function,
> > and
> > something like GDT, which gives you a scatter plot. And you want a
> > single number that will characterize this scatter plot.
> >
> > Tau seems like a good idea. But I remember that in your talk you
> > mentioned that there are some problems with it. What are they?
> >
> > M.
> >
>

From karplus@soe.ucsc.edu  Wed Sep  5 03:53:50 2007
Date: Wed, 5 Sep 2007 03:53:33 -0700
From: Kevin Karplus <karplus@soe.ucsc.edu>
To: jarchie@soe.ucsc.edu
CC: martin.madera@gmail.com, rph@soe.ucsc.edu, gerloff@soe.ucsc.edu,
        ggshack@soe.ucsc.edu, josue@soe.ucsc.edu, bort@soe.ucsc.edu,
        thiltgen@soe.ucsc.edu, jarchie@soe.ucsc.edu, paluszewski@gmail.com,
        T.Juettemann@sms.ed.ac.uk, J.L.Sharman@sms.ed.ac.uk,
        karplus@soe.ucsc.edu
In-reply-to: <20070905061833.GA18237@localhost> (message from John Archie on
	Tue, 4 Sep 2007 23:18:33 -0700)
Subject: Re: new local properties wanted


I think that weighting the low-cost points higher would improve tau as
a measure.  Rejecting the few really bad solutions is not very
difficult---the hard part is distinguishing among the fairly good solutions.
So the fact that tau is low when all a cost function does is
distinguish the total crap from the adequate models is actually one of
its good features.

I think that the rejection (or downweighting) of data points should be
done based on the cost function, and not the actual quality of the
models, as we certainly want to know if a cost function is liking the
really bad models, but we don't really care much if a few "good" models
are rejected by the cost function.

From karplus@soe.ucsc.edu  Wed Sep  5 04:14:50 2007
Date: Wed, 5 Sep 2007 04:14:49 -0700
From: Kevin Karplus <karplus@soe.ucsc.edu>
To: martin.madera@gmail.com
CC: jarchie@soe.ucsc.edu, karplus@soe.ucsc.edu
In-reply-to: <6de0ae080709050104y63d36cf0lc8b83a4546f2f2fc@mail.gmail.com>
	(martin.madera@gmail.com)
Subject: Re: new local properties wanted


Martin, you said
> It's silly to expect that secondary structure predictions should tell
> you anything about high-resolution homology modelling!

That seems intuitively correct, so I wanted to check some of the
"HA-TBM" targets for some of our best cost functions:

CostFcn	pred_nb11_04_simple	pred_alpha06	pred_pb_mean
avg tau	0.552698		0.549256	0.475756	# average over all targets

T0288	0.500			0.510		0.385
T0290	0.330			0.501		0.559
T0291	0.504			0.690		0.644
T0292	0.505			0.597		0.541
T0295	0.626			0.568		0.672
T0302	0.536			0.366		0.418
T0305	0.529			0.442		0.500
T0308	0.495			0.447		0.419
T0311	0.567			0.324		0.365
T0313	0.326			0.264		0.242
T0315	0.472			0.542		0.484
T0317	0.371			0.468		0.451
T0324	0.624			0.480		0.527
T0326	0.682			0.550		0.654
T0328	0.741			0.528		0.664
T0332	0.510			0.345		0.287
T0334	?			?		?
T0340	0.518			0.425		0.332
T0345	0.610			0.585		0.535
T0346	0.316			0.515		0.545
T0359	0.357			0.490		0.320
T0366	?			?		?
T0367	0.744			0.606		0.423

Even on the HA-TBM targets, the Kendall's tau for these predicted
burial and predicted secondary structure cost functions is respectably
high.  So I think your intuition here is wrong.  

Kevin

From karplus@soe.ucsc.edu  Wed Sep  5 05:26:42 2007
Date: Wed, 5 Sep 2007 05:26:38 -0700
From: Kevin Karplus <karplus@soe.ucsc.edu>
To: martin.madera@gmail.com
CC: rph@soe.ucsc.edu, gerloff@soe.ucsc.edu, ggshack@soe.ucsc.edu,
        josue@soe.ucsc.edu, bort@soe.ucsc.edu, thiltgen@soe.ucsc.edu,
        jarchie@soe.ucsc.edu, paluszewski@gmail.com, T.Juettemann@sms.ed.ac.uk,
        J.L.Sharman@sms.ed.ac.uk, karplus@soe.ucsc.edu
In-reply-to: <6de0ae080709042024k60058cd9qbf88bc5b15f5a645@mail.gmail.com>
	(martin.madera@gmail.com)
Subject: Re: new local properties wanted


Following up on Martin's ideas:

> 1) A cost function to penalize foaminess of final models, along the
> lines of SASApack (or use SASApack directly?).

This might be worth looking into, as we certainly use foaminess as one
of our visual checks.  Christian Barrett had some measures in his
thesis that try to capture this (and which did well in decoy tests)
and that are cheaper to compute than SASApack, being based on atom
counting rather than area or volume computation.  I don't remember him
publishing this outside his thesis.  I forget the details, but his
thesis is in the UCSC library.

> 2) Re-evaluation of all existing local structure alphabets, ignoring
> fold recognition and looking at how useful they are for scoring 3D
> models. I suspect that the problems with fold recognition that we're
> seeing for many alphabets are caused by bad null models / bad HMM
> calibration / similarities between unrelated folds (e.g. Rossmanns vs.
> TIM barrels), and we've been restricting ourselves too much by
> focusing on alphabets that work for fold recognition. (It would also
> be interesting to compare these results with alignment accuracy
> benchmarks, which I'll do.)

I don't know that we want to implement neural nets and cost functions
for *all* the alphabets we've looked at in the past.  Some of them are
quite similar to ones we are already using, so unlikely to be much of
an improvement, and others were really terrible (like chi1, not
predictable with neural nets).

I'm willing to consider any local structural alphabet that is easy to
implement in undertaker, as well as other predictable properties.

> 3) I think we're doing fine for secondary structure elements and
> burial, but I'd like to see more on hairpins / short turns etc. --
> basically model evaluation using I-sites / Bystroff, and Osep and
> Nsep. (This falls under 2, I guess, but I thought I'd emphasize it.)

Grant is working on implementing cost functions in undertaker for the
Hbond alphabets---these have not been evaluated yet as cost functions,
just as fold-recognition tools.

We have not tried the I-sites classification of residues, in part
because it was not a very complete classification scheme, in part
because it was based on a combination of sequence and structure, and
in part because there were a lot of different states (HMMSTR reduced
the I-sites library to only 247 states).   Our neural net methods may
have trouble with such large alphabets, and the states are not really
structural features, but motifs that are recognized.  We have used
Bystroff's single-letter phi-psi classification.

We have had some success with de Brevern's protein blocks alphabet as
a cost function, though we were unable to use it for fold recognition,
because it was not compatible with reverse-sequence nulls.  We could
investigate other local-fragment structure alphabets or even create
our own, but I'm not convinced we could do much better than the de
Brevern set.  Perhaps a slightly larger alphabet of somewhat shorter
fragments would allow finer coverage.

> 4) A random idea that has just occurred to me: ProteinShop has a few
> parameters for beta sheets, IIRC something like twist and curl. Could
> we try to predict these? This would be one way of separating Rossmanns
> from TIM barrels. (How does this relate to NOtor?!)

We have not looked at twist and curl---those could be interesting to predict.
The NOtor angles for parallel sheets are fairly tightly clustered.  We
only separated the antiparallel Hbonds into two classes, since they
had a clearly bimodal distribution.

> 5) Which reminds me, we desperately need an alphabet that can tell
> Rossmanns from TIM barrels. (Burial and secondary structure are really
> bad for this, and it's  screwing up fold recognition.) (4) is one
> possibility. Another possibility is to try and predict whether an
> alpha helix that follows a beta strand lies above or below the beta
> sheet. This may not be possible, but I think we should try, because
> it's a very important problem.

"above" and "below" the sheet is unfortunately rather vague and may be
hard to capture in a local structure alphabet.  I guess that what were
are looking for is an adjacent strand-helix pair, then labeling the
strand residues according to whether they are on the same side of the
sheet as the helix or the opposite side.  Generalizing further, for
helix-strand-helix, we could label each residue with one of 4 labels:
same-same, same-opposite, opposite-same, opposite-opposite.  Would we
want to do this only to parallel strands, to mixed strands, or to all
strands? 

I think that labeling (parallel) strand residues according to which
sides the preceding and following helices are on would be quite
useful, if it turns out to be predictable.  I'm not sure how much it
will help with the TIM/Rossmann distinction, as big chunks of both
folds are strand-helix-strand-helix-strand, with all the helices on
the same side.  The difference is that the Rossmann fold has two
3-strand chunks, 321456, while the TIM barrel has 12345678.  The
difference in connectivity is primarily in the flipping of the 123
sheet, moving the helices to the other side.

> 6) Generalizations of (5). For a beta strand that follows a beta
> strand, Osep/Nsep gives a lot of information, but it doesn't say
> whether the next strand is on the left or on the right. Helix-helix
> turns are more complex, but maybe we could categorize them and see
> what we can say about the relative position of the two helices.

The Nsep, Osep alphabets really only cover beta hairpins, not
antiparallel sheets in general.  There is no notion of "left" or
"right" when looking at a single hairpin.  For anything other than a
simple meander, the interesting strand-strand pairings will be in the
"other antiparallel" category, not the -10 to +10 range of the
separation alphabets.

There may be some sheet-topological notions that we can capture in a
local structure alphabet, but Osep and Nsep don't really do much
beyond predicting hairpins and standard secondary structure.  (I'm not
knocking the separation alphabets---I think that improving hairpin
prediction is useful.)

I have some vague ideas about labeling strand residues by the
separation from their bonding partner, not with the fine-grain of the
current sep alphabets but with a coarser binning that could be used in
parallel sheets to distinguish roughly between strand-helix-strand
neighboring connections and more distant strand pairings.  

The mean separation for parallel residues is around 59, but it peaks
at 24 with a median of 36.
(~/pce/undertaker/output/dunbrack-1332-beta-parallel-sep.cum-hist)
We could try binning the separation for the H-bonded partner into
roughly equiprobable bins:

s<-50	-50<=s<-30	-30<=s<0	0<s<=30	30<s<=50	50<s

Except for strands 1 and 8, a TIM barrel will typically alternate
between the two close separations, while the Rossmann fold would have
different behavior for strands 1, 3, 4, and 6.  This would give quite
different labelings to the two, though insertions would mess things up
rather frequently.

------------------------------------------------------------
Thu Sep  6 12:15:51 PDT 2007 Kevin Karplus

For antiparallel partners we could use

s<-30  -30<=s<-10	-10<=s<0	0<s<=10	10<s<=30	30<s

to get a roughly equiprobable grouping.

------------------------------------------------------------

Fri Sep  7 14:47:26 PDT 2007 Kevin Karplus

Added target %.distribute-costfcn, to make distributing the cost
function to the target directories easier.

Sat Sep  8 18:21:49 PDT 2007 Kevin Karplus

I fixed some memory leaks and failures to close files in undertaker
that were causing difficulty for target T0305 today, and have run
evaluations for 3 sets of constraints:
	dssp-ehl2 helix and strand constraints
	undertaker-sheet constraints (from the top 5 alignments)
	ehl2+sheets constraints, just reading in both sets of constraints,
		with no attempts to weight them appropriately
		
			real_cost	real_cost	GDT		GDT
			tau		btr		tau		btr
pred_nb11_04_simple	0.552698	0.0564651	0.520244	0.0558023	
pred_alpha06		0.499477	0.100849	0.444233	0.0969535
ehl2+sheets		0.481826	0.281547	0.454012	0.283674
pred_pb_t04		0.475895	0.0948837	0.409116	0.0963605
pred_cb14_04_simple	0.460442	0.14336 	0.447767	0.144337
pred_bys_t06		0.450965	0.0615		0.375698	0.0626628
dssp-ehl2		0.443872	0.245512	0.386628	0.245512
pred_CB8-sep9-06_simple	0.412093	0.0939186	0.397884	0.0922326
undertaker-sheets	0.393628	0.252733	0.380128	0.254407
hbond_geom_backbone	0.333244	0.113988	0.250128	0.112814

By the tau measures, combining the two sets of constraints improves
the cost function, being almost as good as the predicted near-backbone
burial and predicted alpha.  By the better-than-real measure, the
constraints are awful---on average a quarter of the decoys score
better than the real structure, and combining the constraints makes it
even worse. 

I should probably modify undertaker to be able to handle sets of
constraints separately, so that we can play with the relative weights
of sets of constraints, without having to worry about the weights
within the sets at the same time.

From karplus@soe.ucsc.edu  Sat Sep 29 09:32:17 2007
Date: Sat, 29 Sep 2007 09:32:15 -0700
From: Kevin Karplus <karplus@soe.ucsc.edu>
To: karplus@soe.ucsc.edu, rph@soe.ucsc.edu, gerloff@soe.ucsc.edu,
        martin.madera@gmail.com, ggshack@soe.ucsc.edu, josue@soe.ucsc.edu,
        bort@soe.ucsc.edu, thiltgen@soe.ucsc.edu, jarchie@soe.ucsc.edu,
        paluszewski@gmail.com, T.Juettemann@sms.ed.ac.uk,
        J.L.Sharman@sms.ed.ac.uk
Subject: cost function evaluations


I was looking at the cost function evaluations today in pcep/CostFcnEval/

The average Kendall's tau value for correlation with real-cost varies
from 0.006 for contact_order to 0.5527 for predicted near-backbone-11
burial (from t04 alignments).  I could not find the rdb files for
Martin Paluszewski's contact predictions from alignment---they do not
seem to be in the CostFcnEval directory.

I made a real-cost-tau-merged.rdb file and tried looking to see if
there were easy and hard targets (that is, whether the tau values
correlated between different cost functions).  I have not automated
this yet, just eye-balled some scatter diagrams.

For different predictions of the same thing (like pred_nb11_04_simple
and pred_nb11_2k_simple) the correlation is very high.
For different predictions (like ehl2+sheets, contact449a_45, and
pred_nb11_04_simple) the correlation seems to be very low.
Predictions of related properties (like pred_alpha06 and pred_pb_t04, or
pred_nb11_04_simple and pred_cb14_04_simple)
have intermediate correlations.

This means that the different cost functions are working well on
different targets, implying to me that a combined cost function should
be able to do much better.

Currently all the neural-net predictions are beating all the builtin
cost functions, though George's contact449a_45 is barely squeezing out
hbond_geom_backbone (0.349 vs 0.333 average tau).

Things to do on this project:

1) Precompute all the scwrled casp7 predictions and save them.  This
   would cut the time for evaluating a cost function in half.

2) Precompute all the real_cost functions for all the casp7
   predictions and save them in an rdb file.  Use jointbl to merge
   cost function rdb tables with these real_cost rdb tables, rather
   than recomputing them each time.  This would probably provide
   another factor of 2 or 3 reduction in the time to evaluate a cost
   function, and doesn't need much scripting.  In fact, the existing
   rdb files for builtins, could be used as the source for the
   real_costs, so only the jointbl would need to be done.  
   
3) Replace whole-chain evaluation with domain-based evaluation.  The
   scripts for running domain-based evaluations exist in the casp7
   Make.main, but not all the targets have the true-domain pdb files
   properly created yet.  The scripts and Makefile in CostFcnEval
   would also need some minor mods to handle domains.

   Note that (1) and (2) are independent speedups and can be
   implemented in any order, but (3) would require redoing the
   real_cost computations.

4) Replace current Kendall's tau computation with weighted computation
   that assigns more importance to low-cost points.  I believe that
   John and I eye-balled some plots and decided that weighting rank k
   by exp(-2 k/n) looked like it did about the right thing for
   summarizing the scatter diagrams in a single number.

5) Start combining cost functions to see how linear combinations fare.
   John has started implementing a tree-based approach, where we use
   heirarchical clustering of the cost functions (based on their
   correlations to each other), then go up the tree optimizing the
   relative weight of the two subtrees.  This will not result in a
   global optimum, but it should give a good starting point for more
   sophisticated optimization methods.  (Methods like multiple linear
   regression will fail because of the high correlation between some
   of the cost functions.)  We may want to eliminate some nearly
   identical cost functions (like taking only one of the pred_nb11
   costfunctions) to reduce the number of parameters to tweak.
   
   After builing the tree and getting weights for the cost functions,
   we may want to eliminate cost functions that end up with very low
   weight, since they may just be fitting noise.

6) Make it easy to add new cost functions to the mix later on.  I am
   hopeful that predicting str4 will be useful, and that predicting
   Hbonds (by extraction from alignments and by conversion from n_sep,
   o_sep, n_notor2, and o_notor2 alphabets) will be useful.
   
7) For neural-net predicted cost functions, test a new version that
   makes the cost be -log P(observed|prediction)/P(observed|background), 
   rather than just  -log P(observed|prediction), to compensate for
   different bin sizes.  This should not affect the burial alphabets
   much (as they were constructed to have near-uniform backgrounds),
   but should help the secondary-structure predictions.

8) Test and compare two MQA methods:
	1) using our optimized cost function.
	2) a meta-server that uses the cost function to weight the
		different server models, then creates a new cost
		function based on extracting info from the server models.
		(This could be C-beta constraints, like Martin P. is using,
		helix and sheet constraints, Hbond constraints, or
		even rmsd between models.)

--------------------------------------------------------------------------------

Sat Sep 29 13:42:33 PDT 2007 Kevin Karplus

Martin Paluszewski provided constraints-all and constraints-optimized
files for constraints extracted from alignments.

constraints-all includes all the C-beta distance constraints that he
has extracted from the alignments.

constraints-optimized includes only selected C-beta constraints,
attempting to maximize the sum of the weights of the contacts and the
probability of seeing that many sep>=9 contacts for that residue
(using the CB8-sep9 prediction).  The constraints-optimized cost
function is nearly as good as the pred-nb11 cost functions (assuming
the average doesn't change much when T0305 is included).  If gdt-tau
is used, instead of real_cost-tau, then the constraints-optimized
actually do slightly better than pred_nb11 cost functions.

Martin P. says "Also I should mention that they do not include T0305
because it crashes on the cluster. I haven't looked at the reason why,
but it seems to be a scwrl problem."  I'm a bit surprised at this, as
no one else has had trouble with T0305.

--------------------------------------------------------------------------------

Sat Nov 17 13:44:15 PST 2007  John Archie

I am creating a new file called everything.costfcn to contain one of each cost
function.  This evaluation will be used for testing my cost function optimizer.

I noticed that RealCost cost functions are defined in a lot of the *.costfcn
files in this directory.  It is worth noting that the these are simply ignored
with the current CASP7 Makefile.  Instead, for performance reasons, only the
noncheating cost functions are evaluated, and these data are merged with
precomputed cheating cost function data.

Finally, I replaced all instances of "cfneval.pl" in the makefile with
"./cfneval.pl".  Not everyone has "." in their path.

--------------------------------------------------------------------------------