Mon Mar 31 17:51:41 PDT 2008 Kevin Karplus

This will be the directory for hand predictions and for model quality
assessment results.

CONTENTS:

Group names and identifiers
Directory structure
When the moai cluster gets wedged, what can we do?
Handling a homodimer (or homotrimer)
Building QA targets
Using undertaker as a meta-server
Analyzing the CASP8 results


Group names and identifiers
HUMAN groups:

SAM-T08-human 	4008-1775-0004		is the human prediction group
					for TS predictions only.
	Who should sign up: anyone who actually works on hand
	predictions this spring or summer.
SAM-T08-MQAO	3724-8702-5528  	MQA server using alignment only
	Additional members: Martin Paluszewski
SAM-T08-MQAU	7072-3475-1278		MQA server using alignment and other costfcn
	Additional members: Martin Paluszewski and John Archie
SAM-T08-MQAC	not registered yet	MQA server using alignment/undertaker/consensus
	Additional members: John Archie

The MQA groups should download a tarball and run scripts to evaluate
the models for *all* targets, not just the human-prediction subset.
It is possible to create an MQA server, but I don't think that the
effort is worth it, since we need to run most of a prediction before
we can do the MQA computation.

SERVER groups:

SAM-T02-server	7768-4665-5533
	No additional members.
SAM-T06-server	6290-5691-1801
	Additional members: George Shackelford
SAM-T08-server 	7957-1341-9349
	Additional members: George Shackelford and Grant Thiltgen
SAM-T08-2stage	9060-0201-6142
	Additional members: George Shackelford
SAM-T08-MQAC  	2165-1648-9790	
	Consensus MQA.
	Additional members: John Archie 
	

Directory structure

Fri Apr  4 15:08:43 PDT 2008 Kevin Karplus

The targets will each have their own directory, starting with T0387.
The directory starter-directory/ has the seeds for producing new
predictions. 

Within T0xxx, there will always be a subdirectory "decoys".
Underneath decoys there will be subdirectories for server predictions:
	SAM_T06		all files created for SAM-T06-server
	SAM_T08		all files created for SAM-T08-server
	server		the contents of the tarball from CASP of all servers

For those predictions identified as ones for hand prediction, the
T0xxx directory will also have all the work we do on the prediction,
which will include "metaserver" models built starting from the server
models, as well as our own models.	
	

When the moai cluster gets wedged, what can we do?

Mon Apr 14 18:04:09 PDT 2008 Kevin Karplus

To look for phantom batches that are running but have lost their
controlling directories (probably because the web server thought that
the jobs was finished), run
	pcem/scripts/find-phantom-batches
on moai.


To find out what farmer jobs are running on what machines, run
	do-all-condor-cluster 'ps -fwu farmer |grep  query'
on moai.
If some of the jobs there need killing, the process ids are the two
numbers after "farmer".


Mon May  5 14:33:46 PDT 2008 Kevin Karplus

Some targets (like T0387) are homomultimers.  Optimizing them as such
may help get the monomers right.  If the instructions below get
confusing, try looking at T0387/dimer as an example.

This method assumes that you have a pretty good monomer that you want
to dimerize based on a template with an existing dimer and then
optimize.  It is not intended for creating dimers from scratch.


1) run "make make_dimer"
   which creates 
   	a subdirectory "dimer/",
   	a target a2m file (double length), 
	a Makefile (with MONOMER_LENGTH set), 
	a costfcn-init.under file (with KnownBreak added, and
		   constraint sets coming from ../ files)
	all the local-structure .rdb files (doubled, with renumbering)

    Mon Jun  9 15:37:59 PDT 2008 Kevin Karplus
    I fixed the bug in generating dimer/costfcn-init.under, so that it
    should now have a KnownBreak command at the end with the first
    residue of the second copy, including the one-letter amino-acid
    code.
    
    I also added a make_trimer target which does essentially the same
    thing as make_dimer, but with a tripled copy.

2) In the dimer directory, do "make make-dimer.under" to get an
    initial dimerization script. That you will have to edit.
    You will need to replace the 1xxxA
    words with the monomers of your template. and the YYYY words with
    the name of the model you want to dimerize.
    
    This script needs to have a properly dimerized template to copy
    the positioning from and a monomer to dimerize.

3) Create an alignment file that has the target and copies of the
    best alignment.  For example, for T0284, we have
    T0284/1mumA/1mumA.dimer-a2m modified from
    T0284-1mumA-t04-local-str2+CB_burial_14_7-1.0+0.4+0.4-adpstyle5.a2m :

>T0284 PA4872, Pseudomonas aeruginosa PAO1, 287 res
MHRASHHELRAMFRALLDSSRCYHTASVFDPMSARIAADLGFECGILGGS
VASLQVLAAPDFALITLSEFVEQATRIGRVARLPVIADADHGYGNALNVM
RTVVELERAGIAALTIEDTLLPAQFGRKSTDLICVEEGVGKIRAALEARV
DPALTIIARTNAELIDVDAVIQRTLAYQEAGADGICLVGVRDFAHLEAIA
EHLHIPLMLVTYGNPQLRDDARLARLGVRVVVNGHAAYFAAIKATYDCLR
EERGAVASDLTASELSKKYTFPEEYQAWARDYMEVKE
>1mumA
sl------HSPGKAFRAALTKENPLQIVGTINANHALLAQRAGYQAIYLS
GGGVAAGSLGLPDLGISTLDDVLTDIRRITDVCSLPLLVDADIGFGsSAF
NVARTVKSMIKAGAAGLHIEDQVGAKRCGHrPNKAIVSKEEMVDRIRAAV
DAKTDPDFVIMARTDALAvEGLDAAIERAQAYVEAGAEMLFPEAITELAM
YRQFADAVQVPIlaNITEFGATPLFTTDELRSAHVAMALYPLSAFRAMNR
AAEHVYNVLRQegtqksVIDTMQTRNELYESINYYQYEEKLDNL------
farsqvk
>1mumB
sl------HSPGKAFRAALTKENPLQIVGTINANHALLAQRAGYQAIYLS
GGGVAAGSLGLPDLGISTLDDVLTDIRRITDVCSLPLLVDADIGFGsSAF
NVARTVKSMIKAGAAGLHIEDQVGAKRCGHrPNKAIVSKEEMVDRIRAAV
DAKTDPDFVIMARTDALAvEGLDAAIERAQAYVEAGAEMLFPEAITELAM
YRQFADAVQVPIlaNITEFGATPLFTTDELRSAHVAMALYPLSAFRAMNR
AAEHVYNVLRQegtqksVIDTMQTRNELYESINYYQYEEKLDNL------
farsqvk

4) In the dimer directory, make try1.costfcn, or copy a costfcn from
   the parent directory and edit it.

   If you want any constraints on the optimization, it is necessary to
   make multiple copies in the cost function, renumbering the
   constraints in the later chains (a real pain).  Alternatively, you
   can compute the constraints only on the first monomer.  If the
   monomers are identical, this should not cause any problems.
   
Once you have an acceptable dimer, you want to optimize it, keeping it
dimerized in roughly the same orientations.

If you read in a dimer with ReadConformPDB, be sure to mark it as a
dimer by following the read command with
	Multimer 2
as a separate command to label the dimer as a cyclic dimer.
Note: if the multimer is *not* cyclic then *don't* label it, as
undertaker will try to symmetrize it.

You can do the optimization as usual, but use "multimer 2" in the
OptConform arguments.  Any alignments (for fragments and the like) can
be gotten from the original monomeric runs.   You probably want to
reduce the duration of the run (by reducing num_gen, gen_size,
super_iter, and/or super_num_gen), because multimeric runs take longer
than monomeric ones.  You can also read the Template.atoms file from
the monomeric directory, avoiding duplicating that file.

You might want to turn off TweakMultimer at first if you are trying to pack a
tight interface, as it will tend to move monomers apart to reduce clashes.
But if you have a loose interface, you definitely want TweakMultimer
on to try to tighten up the interface.

It may be necessary to add some inter-chain constraints to hold the
dimer together.  Even without TweakMultimer on, undertaker may find a
way to alleviate clashes by moving parts of the dimer away from each
other as it did in try1 (of T0284/dimer).

Note: you don't always want "multimer 2" for a dimer or "multimer 4"
for a tetramer.  What the command (or option to OptConform) do is to
force the creation of a cyclic multimer.  That is the transform that
takes A to B will take B back to A for a dimer, or T(A->B) = T(B->C) =
T(C->D) = T(D->A) for a tetramer.  Not all multimers are cyclic!

You can still optimize non-cyclic multimers in undertaker, but you
must *not* use the multimer command or option to OptConform.  This
will cause each chain to be separately optimized but the "OptSubtree"
method will tend to rearrange the transformation between chains.

You can optimize a mixture of cyclic and non-cyclic dimers in
OptConform if they are initially labeled with Multimer commands and
OptConform has no "multimer" keyword (or, equivalently, "multimer 0").
If OptConform has "multimer 2" set, then all multimers will be set ot
be cyclic dimers.

Note: you can do optimization of a some tetramer with symmetry S_{2,2}
by telling OptConform to use "multimer 2".  You don't get the full
symmetry, but you will get some symmetry: chain A and chain B will be
independently optimized, but chain C and chain D will be copies of
chains A and B and T(AB->CD)= T(CD->AB).


NOTE: gromacs doesn't like big chain breaks, and it will not see the
multimer merged into a single chain as two chains.  To get gromacs to
optimize a multimer, you need to unpack the multimer into separate chains:
	cd casp7/T0332/dimer
	make decoys/T0332.try2-opt2.unpack.pdb.gz decoys/T0332.try2-opt2.unpack.gromacs0.pdb.gz

You can get this to happen for you automatically if you use
	cd casp7/T0332/dimer
	(make  T0332.mult2 >& do2.log; gzip -9f do2.log)&
instead of the monomer version
	(make  T0332.do2 >& do2.log; gzip -9f do2.log)&

Sat Jul  1 13:35:27 PDT 2006 Kevin Karplus

I made a small change to undertaker, adding
        force_alignment
        fragment_only

options to ReadFragmentAlignment, so that I could force undertaker to
treat the short fragments as being a complete alignment or not being
treated as an alignment at all (just fragments).  If neither option is
provided, then it is added to the alignment library only if it is
multiple fragments or a sufficiently long single fragment (something
like half the total protein length).

For multimers, you can include force_alignment in the
ReadFragmentAlignment command that specifies the multimer, to avoid
losing an alignment that has only a short piece aligned to show what
corresponds.

Sun Jun  1 16:45:25 PDT 2008 Kevin Karplus

The script in 
	T0413/dimer/make-dimer-chimera-try12.under 
shows how to use an existing dimeric model build a dimeric model from
a different monomer.  A self-alignment file is needed, which can
either be the whole monomer (as in this example), or an alignment of
just residues in the dimeric interface.


Tue May 20 09:14:00 PDT 2008  John Archie

Building QA targets can be done with
  % make qa_all

which creates, in addition to some intermediate evaluation files, three files

SAM-T08-MQAO.qa1 - the QA file using only the alignment-based constraints
SAM-T08-MQAU.qa1 - the QA file using all undertaker cost functions
		   (including the alignment based constraints)
SAM-T08-MQAC.qa1 - the QA file for all undertaker cost functions and
                   a consensus term

SAM-T08-MQAC.qa1 is likely the most reliable quality assessment method;
however, given the number of consensus-based methods in CASP8,
SAM-T08-MQAU.qa1 has a chance of being the best of our methods.

By default, "make qa_all" will try to pull files (alignments, neural net
predictions, etc) from decoys/SAM_T08/; to change this, use the macro
QA_PREDDIR, this option may be especially useful if the files in
the human-prediction directory might be more accurate:
  % make QA_PREDICTDIR=`pwd` qa_all

Note that QA_PREDICTDIR may be used in contexts where  the current working
directory is not the directory in which make was invoked--so please do not use
relative path names.

The QA files may be submitted with
  % make mail_qa_all
At the moment, only John A and Kevin have permission to submit.


Using undertaker as a meta-server

Wed May 21 10:44:47 PDT 2008 Kevin Karplus

We can use the server models as starting points for further optimization.
First, make the MQA assessments as above (with "make qa_all").

Then create scripts for reading in the top 10 models according to each
assessment method with "make under_qa_all"). This creates
	SAM-T08-MQAC.read_under
	SAM-T08-MQAU.read_under
	SAM-T08-MQAO.read_under

I don't plan to use the MQAO selections, but the MQAU and MQAC ones
might be interesting, so I plan to do an optimization from each of
those sets separately.

Mon Jun  9 21:23:04 PDT 2008 Kevin Karplus

The under_qa_all make target also makes
	 metaserve-MQAC1.under metaserve-MQAU1.under
optimization scripts for the two sets.  The script is currently
set up to optimize the try1 costfcn, but for many targets we have
already found a better costfcn, so the scripts should probably be
edited to change the costfcn.

To run the scripts, you can do 'make run_metaservers', but it might be
better to do
	(make meta_MQAU1 >& MQAU1.log; gzip -9f MQAU1.log)&
	(make meta_MQAC1 >& MQAC1.log; gzip -9f MQAC1.log)&
each on a separate machine. If the workstations are busy, they can be
sent to the cluster with
	para-trickle-make -quick '(make meta_MQAC1 >& MQAC1.log; gzip -9f MQAC1.log)'
	para-trickle-make -quick '(make meta_MQAU1 >& MQAU1.log; gzip -9f MQAU1.log)'

Rescoring all models with one of the cost functions can be done with
"make decoys/score-all.try2.pretty" (replacing try2 with the name of
the desired costfcn file).


Analyzing the CASP8 results

Fri Nov  7 10:52:29 PST 2008 Kevin Karplus

model1-evaluate.rdb are the full-length model evaluations, sorted by GDT on the whole
chain (not domains).  Note that model1 is either the SAM-T08-human
model (for human-predicted targets) or SAM-T08-server (for server-only targets).

It was created by using
	grep ^model1.ts */decoys/evaluate.rdb | sort -g +16
then pasting in the rdb header from an evaluate.rdb file,
editing the file to replace "/decoys/evaluate.rdb:" with a tab and
editing  the header to have a "target" column first.

real_cost and GDT can be seen to be highly correlated across different
targets, but with a few outliers.


The easiest target (for us) was T0458, with GDT score of 96.2%
The hardest target was T0430, with GDT score of 5.67%.
(Oops, T0430 has the wrong REAL_PDB id.  I'll redo the evaluation with
the right id.)

Fri Nov  7 12:05:33 PST 2008 Kevin Karplus

I'm redoing evaluations for
T0492	typo in name
T0390	wrong chain of PDB model
T0420	typo in Makefile
T0430	wrong PDB file (reason unknown)

Fri Nov  7 13:13:09 PST 2008 Kevin Karplus

The hardest models for us were actually T0514 (GDT=18.1%) and T0466
(GDT=20%, but real_cost=320.6).

There were good models for T0514 (GDT 50.5% for pro-sp3-TASSER_TS2),
and we would have gotten
	SAM-T08-MQAC	39.5%	Zhang-Server_TS3
	SAM-T08-MQAU	50.5%	pro-sp3-TASSER_TS2
	SAM-T08-MQAO	17.7%	SAM-T08-server_TS2

T0466 was really harder, with the best server model being nFOLD3_TS5 (GDT
40.1%, real_cost 213.8).  The best I submitted was MQAU1-opt3 (as
model 5) with a GDT of 30% and real_cost of 270.2.

Mon Nov 10 12:59:03 PST 2008 Kevin Karplus

I looked at the refinement models a bit today.  It does not look like
I made any improvement over the initial models, nor were my model1
choices better than ones I favored less.  I don't think I should waste
my time on refinement, as I obviously can't do it.


I also collected the best-evalues from the human predictions into an
RDB file "best-evalues".  We know that T0466 would be tough
(best-evalue 33.456), but there were some that looked tougher that
turned out to be not quite as hard (T0465 GDT=30.6% and T0496
GDT=23.4%).

Mon Nov 10 13:19:37 PST 2008 Kevin Karplus

Although a large E-value tends to indicate a poor model, this is not
invariable (T0471 has GDT 65% but E-value 1.12).  And a really small
e-value does not guarantee a good model: T0487 has E-value  6e-78 but
only 33% GDT. The problem there is multiple domains, which are
individually predicted ok, but which aren't assembled perfectly.

Wed Nov 12 10:13:51 PST 2008 Kevin Karplus

I joined the model1-evaluate+evalue table with the SAM-T08-server
table, so that I could see whether I improved on the server.  It looks
like there were 7 targets where I did substantially worse than the
server on GDT, and 17 where I did substantially better, so the human
input was worthwhile.  The difference is even more stricking in real_cost.

But how much of that was really due to human input, and how much due
to metaservers?  I probably need to extract the MQAC-recommended
server model and perhaps the Zhang-server_TS1 model to see whether the
hand effort was worth anything.

Wed Nov 12 12:19:13 PST 2008 Kevin Karplus

row e_value ne "" < model1-SAM-T08-Zhang.rdb | histogram-from-rdb -compute 'GDT - Zhang_GDT' -bin 1 
indicates that my models averaged 1.29 worse GDT than Zhang's, but

row e_value ne "" < model1-SAM-T08-Zhang.rdb | histogram-from-rdb -compute 'real_cost - Zhang_real_cost' -bin 1 

indicates that my models averaged 1.66 better on real_cost.  Overall,
this is pretty much a wash---there were a few models that I did much
better on and a few that the Zhang-Server did much better, but most
were pretty close.

The models I did much better than the Zhang-Server_TS1 on GDT (10%
more on GDT) were 
T0394	(SAM-T08-server, since server-only)
T0472.  

The models I did much worse (10% lower) than the Zhang-Server_TS1 on GDT were
T0462
T0468
though the SAM-T08-server did worse on
T0398
T0400
T0404
T0408
T0432
T0486
T0512

For real_cost, the models I did much worse (70 points) than the Zhang server were
T0468
though the SAM-T08-server did worse on 
T0398
T0400
T0404
T0408
T0432
T0486
T0509
T0512

The ones I did much better on were
T0472
T0492
T0495
though SAM-T08-server did better on
T0394
T0504

Overall, it is looking like I made some improvements on the
Zhang-server, but not by a lot.  I'll have to see how each of the MQA
methods did compared to my hand predictions, to see if I added
anything with all my hard work.  So far, only T0495 looks like I made
any real improvement over the server models.

For T0468, I got the topology of the sheet wrong, and the MQA methods
would have done much better.


Thu Nov 20 22:04:57 PST 2008 John Archie

I'm copying all of the old MQA RDB files out of the way to make way
for the evaluation versions with include columns for the real cost
against the experimental structure:

foreach t (T0???)
    mv $t/decoys/servers.evaluate.everything.rdb \
       $t/decoys/servers.evaluate.everything.rdb.casp
    mv $t/decoys/similarity.servers.evaluate.everything.rdb \
       $t/decoys/similarity.servers.evaluate.everything.rdb.casp
end


Mon Dec 22 18:35:18 PST 2008 Kevin Karplus

Eyeballing the SAM-T08-server curves on http://predictioncenter.org/casp8/results.cgi
which ones are particularly good or bad?

T0389_1	bad
T0393_2	bad
T0398_1	bad
T0398_2	bad
T0407_1	bad	(SAM-T08-human still moderately bad, SAM-T06-server beats both)
T0407_2	bad	(SAM-T08-human still bad)
T0419_1	bad
T0419_2	bad
T0462_2	bad
T0476_1	bad
T0482_1	bad
T0487_4	bad	(SAM-T08-human quite good)
T0498_1	bad	(SAM-T08-human still bad)
T0501_2	bad
T0512_1	bad
T0513_2	bad
T0514_1	bad

T0393	moderately bad
T0404_1 moderately bad
T0432_1	moderately bad
T0434_1	moderately bad
T0448_1	moderately bad
T0464_1	moderately bad	(SAM-T08-human moderately good)
T0468_1	moderately bad	(SAM-T08-human bad)
T0471_1	moderately bad	(SAM-T08-human moderately good)
T0477_1	moderately bad
T0487_1	moderately bad
T0504_3	moderately bad

T0397_2 moderately good
T0473_1	moderately good
T0504_2	moderately good

T0394_1	good
T0395_1 good
T0435_1	good
T0443_2	good
T0443_3	good
T0446_2	good
T0461_1	good
T0472	good (but individual domains bad)
T0478_1	good
T0480_1	good
T0487_2	good
T0489_1	good
T0495_1	good	(SAM-T08-human good for most residues)
T0496_2	good
T0501_1	good
T0504_1	good
T0510	good (but individual domains only moderately good)