Assessment of SAM-T04 predictions for CASP6

(Last Update: 03:49 PDT 29 July 2005 )


CASP (the critical assessment of structure prediction) is a bienniel world-wide assessment of the state of protein structure prediction. Our group at UCSC has always done well in it (starting with our first attempt in CASP2). The CAFASP experiments are parallel experiments using the same target proteins, but only allowing fully automatic predictions by servers.
The official site for the CASP6 experiment. For assessment results, see In those talks we were referred to as either SAM-T04-hand or group 166.
Unofficial rankings of servers and some human predictors using TM score for CASP6. In these rankings, the SAM-T04-hand group ranked (as of 16 Dec 2004)
targets/domainsrank (using 1st model)rank (best of 5 models)
All 87 8 6
25 CM easy 21 4
18 CM hard 9 6
19 FR/H 10 12
15 FR/A 7 7
10 NF 2 6

Our old SAM-T02 server did not do as well:
targets/domainsrank (using 1st model)rank (best of 5 models)
All 87 69 56
25 CM easy 59 62
18 CM hard 48 43
19 FR/H 64 59
15 FR/A 113 96
10 NF 122 93

The ancient (and now obsolete) SAM-T99 server did even worse:
targets/domainsrank (using 1st model)rank (best of 5 models)
All 87 102 104
25 CM easy 88 80
18 CM hard 72 74
19 FR/H 120 121
15 FR/A 155 153
10 NF 144 139
CAFASP4 was not as important in this round, as servers were evaluated as part of the main CASP evaluation, without special consideration. Only a few servers chose to participate in CAFASP and not CASP (the servers of the CAFASP organizers). In the CAFASP4 evaluations (as of 22 Nov 2004), the old SAM-T02 server ranked 48/85 on the easier HM targets and 46/83 on the harder FR targets.

Personal assessment by Kevin Karplus:

Two groups were clearly superior to us (Ginalski's and David Baker's) and there were 3 or 4 other groups whose results were comparable to ours (Kolinski+Bujnicki, Skolnick-Zhang, GeneSilico). The exact list of the "comparable" groups depends on whether you put more weight on homologous modeling or non-homologous modeling.

The SAM-T02 server is getting dated---it was only in the middle of the pack for servers, and the SAM-T99 server is positively ancient, with expectedly poor performance.

I've looked briefly at the CASP-supplied GDT plots for all the targets and tried to assess how well we did relative to other groups for each target. The notes on this are in comparison-with-others.

I have also done a "smooth GDT" evaluation of all my models, including unsubmitted ones. The summary for all models is in gdt.summary and the list of GDT score, smooth GDT score, and RMSD fore each model is in Txxx/decoys/evaluate.rdb (or evaluate_1.rdb, evaluate_2.rdb, for single-domain evaluations). Smooth GDT curves for a particular target can be created with gnuplot, using the script in the subdirectory of the target.

The evaluate.rdb files have been updated to include log of RMSD scores and "clens"---a new contact-based evaluation function. The GDT and smooth_GDT scores are essentially interchangeable, with
smooth_GDT =approx 0.9423 * GDT,
but when the models are really bad, smooth_GDT/GDT is slightly larger---scattered around 1.

One can predict the log RMSD and log RMSD_CA scores from the GDT and clens scores:
log_RMSD =approx 2.8864 * clens + 0.116975
log_RMSD_CA =approx 3.27581 * clens -0.280389
log_RMSD =approx -0.0298756 * GDT + 3.50611
log_RMSD_CA =approx -0.0354623 * GDT + 3.66133
log_RMSD =approx -0.0317885 * smooth_GDT + 3.50091
log_RMSD_CA =approx -0.0379614* smooth_GDT + 3.66936

Looking at RMSD values of the fit when the clens or GDT values are perfect gives us an idea of the lower limit of resolution for the evaluation method.
For clens=0, RMSD=approx 1.124 and RMSD_CA=approx 0.7555
For GDT=100, RMSD=approx 1.6796 and RMSD_CA=1.1220
For smooth_GDT=100, RMSD=approx 1.3800 and RMSD_CA=0.8809

The only apparent advantage that smooth_GDT has over GDT is that it allows detection of smaller differences in very good predictions, but clens is even more sensitive to such small errors. Unfortunately, some of the outliers for clens vs. GDT are not very promising for clens: on very short sequences that lack a core (such as the 24 residues of T0229_1) the clens evaluation seems overly pessimistic. I still need to look at other outliers to see if clens or GDT is a better measure of quality for them. I also need to run evaluations on other people's predictions, since there may be more extreme outliers there (clens may be more sensitive to overcompaction than GDT, for example).

The linear fits for log_rmsd_ca and log_rmsd are closest for GDT, very slightly worse for smooth_GDT, and quite a bit worse for clens. The only advantage clens seems to have so far is that it is fast and determinisitic, not requiring sampling superpositions---it is computed by comparing distance maps.

If we do a non-linear fit of GDT from clens, we get a pretty good fit with
GDT=approx -123.664*clens^3+125.779*clens^2-100.631*clens+100.888
(restricting the fit to models longer than 40 residues). Even with this non-linear scaling of clens, clens is not quite as good a predictor of rmsd as GDT is.

The non-linear fit can be reduced to a one-parameter fit:
GDT =approx 100/b *(1-x)*(x^2+b)
with b=approx 0.757291. The same curve for smooth_GDT can be fit with b=approx 0.884582.

Even better fits for smooth_GDT are with smooth_GDT =approx 100/c *(1-x)*(x^3+c)
for c=approx 0.6485, but this form is not as good a fit for GDT as the x^2 form.

SoE home
sketch of Kevin Karplus by Abe
Kevin Karplus's home page
Biomolecular Engineering Department
UCSC Bioinformatics research

Questions about page content should be directed to Kevin Karplus
Biomolecular Engineering
University of California, Santa Cruz
Santa Cruz, CA 95064
318 Physical Sciences Building

Locations of visitors to pages with this footer (started 3 Nov 2008)