From SMian@lbl.gov  Fri May 26 09:37:49 2000
Return-Path: <SMian@lbl.gov>
Sender: saira@lbl.gov
Date: Fri, 26 May 2000 09:37:47 -0700
From: Saira Mian <SMian@lbl.gov>
X-Accept-Language: en
To: karplus@cse.ucsc.edu
Subject: Prior weight/RSCB functional category/FIMs
Content-Type: text/plain; charset=us-ascii

Dear Kevin,

  Whilst working on t87, I was thinking about techniques that could
minimise the amount of human input necessry and could be automated. 

  The first idea is to generalise a model by progressively increasing
the contribution of the priors (sequence weighting is still used). This
should have a greater impact on families with large numbers of sequences
(such as t87) where the data dominate the parameter estimates at the
present time. This approach doesn't require model training/building -
only estimating models from alignments and then rescoring. I haven't had
a chance to sit down and work this out more formally, but given a large
family, selecting 5-10 "diverse" sequences generates a more effective
model than using all the sequences (even without sequence weighting).

  The second thought has to do with selecting possible hits based on
assigned function. RCSB has a record that indicates
biochemical/biological function e.g. 2FOK is "Classification: Nucleic
Acid Recognition". If the target can be assigned to one of the
categories that RCSB has defined, then if it fails the first round of
fold recognition, then the next best guess would be the best hit to
structures in the same category.

  The third idea is construct a motif-based HMM for a large sequence
target family (Meta-MEME idea). One of the issues this raises is whether
it is better to connect motifs with internal FIMS or insert states. The
scoring for each type of model would be different but it might be worth
trying both approaches. For example, there is one potential structural
homologue of T87 that has a >200 residue insertion).

	-saira

-- 
I. Saira Mian
Life Sciences Division (Mail Stop 74-197)  E-mail: SMian@lbl.gov
Lawrence Berkeley National Laboratory      Tel:    (510) 486-6216
1 Cyclotron Road                           Fax:    (510) 486-6949
Berkeley, California 94720


From karplus@cse.ucsc.edu  Fri May 26 09:54:08 2000
Return-Path: <karplus@cse.ucsc.edu>
Date: Fri, 26 May 2000 09:54:03 -0700
From: Kevin Karplus <karplus@cse.ucsc.edu>
To: SMian@lbl.gov
CC: karplus@cse.ucsc.edu
In-reply-to: <392EA85B.22DC233E@lbl.gov> (message from Saira Mian on Fri, 26
	May 2000 09:37:47 -0700)
Subject: Re: Prior weight/RSCB functional category/FIMs


The SAM-T** series do more or less the opposite of what you suggest.
Right now the prior used on the first iteration is fairly stiff, and
the priors get weaker for later iterations.  The idea is that gaps
should be moderately cheap when we have littlte data, but as we get
more sequences, we can make gaps more expensive---except where they
have already been seen.

The model-building method we used to create the HMM used for searching
PDB is in the script w0.5.  It thins the alignment to include only
sequences of <80% residue identity to existing sequences in the
alignment before building the model.  One could thin more aggressively
if the set was very diverse, but 80% seems to be a good compromise
over the range of different diversities that we see.  For T0087, the
make.log file reports that 6 of the 40 sequences were dropped because
of having > 80% identity.

The "uniqueseq" program can be used to thin alignments, and the
"build-weighted-model" script called by w0.5 has parameters for the
things we most often want to change when building HMMs from alignments
(including max fraction identical).  The parameters are documented in
the source code: /projects/compbio/bin/scripts/build-weighted-model

I agree that functional information can be useful in choosing among
weak hits.  Unfortunately, I lack the chemical expertise necessary to
recognize when two differently described functions are likely to be
similar or use similarly structured proteins.  The classification
schemes for different databases seem to me to be quite different, so
it is very difficult for me to see whether or not there is match.
Writing an automatic script will be difficult, when even manual
matching is too hard for me.  (I could write simple keyword matching,
but I doubt that it would do much good.)  For some of the targets this
time it seems that there is no known function, which makes it even
harder. 

You have had very good success with building motif-based HMMs.  I have
not been particularly successful at creating them automatically, which
is why I've stuck with the full-sequence model.  If you think a
motif-based HMM might work better for one of the targets (like t87),
feel free to try it.  I suspect that FIMs are more useful when
insertions can be quite long, and that regular insert nodes are better
when the insertions are short.


From SMian@lbl.gov  Fri May 26 14:30:39 2000
Return-Path: <SMian@lbl.gov>
Sender: saira@lbl.gov
Date: Fri, 26 May 2000 14:30:37 -0700
From: Saira Mian <SMian@lbl.gov>
X-Accept-Language: en
To: Kevin Karplus <karplus@cse.ucsc.edu>
Subject: Re: Prior weight/RSCB functional category/FIMs
Content-Type: text/plain; charset=us-ascii

Dear Kevin,
 
Kevin Karplus wrote:
> 
> The SAM-T** series do more or less the opposite of what you suggest.
> Right now the prior used on the first iteration is fairly stiff, and
> the priors get weaker for later iterations.  The idea is that gaps
> should be moderately cheap when we have littlte data, but as we get
> more sequences, we can make gaps more expensive---except where they
> have already been seen.

  I agree - this is a good way to find "easy matches" that are
significant. My comments were aimed at what to if SAM-T00 doesn't find
anything i.e.the final SAM-T00 HMM can be used to generate an alignment
and this alignment used to reestimate a model (with modelfromalign) in
which the prior has a higher weight.

> The model-building method we used to create the HMM used for searching
> PDB is in the script w0.5.  It thins the alignment to include only
> sequences of <80% residue identity to existing sequences in the
> alignment before building the model.  One could thin more aggressively
> if the set was very diverse, but 80% seems to be a good compromise
> over the range of different diversities that we see.  For T0087, the
> make.log file reports that 6 of the 40 sequences were dropped because
> of having > 80% identity.

  I think it's not just a matter of dropping sequences with more than a
greater global sequence identity. The thinning should focus on keeping
those that differ in the most highly conserved positions (peaked match
state probs) i.e. it's the distribution and not the total number of
variations that is more effective at generalising a model. Thus, given
two sequences with equal similarity but one which replaces a highly
conserved residue, then this is the one that should be kept (in general,
this is equivalent to keeping trypanosomal or other unusual species).

> The "uniqueseq" program can be used to thin alignments, and the
> "build-weighted-model" script called by w0.5 has parameters for the
> things we most often want to change when building HMMs from alignments
> (including max fraction identical).  The parameters are documented in
> the source code: /projects/compbio/bin/scripts/build-weighted-model

  See comments above.

> I agree that functional information can be useful in choosing among
> weak hits.  Unfortunately, I lack the chemical expertise necessary to
> recognize when two differently described functions are likely to be
> similar or use similarly structured proteins.  The classification
> schemes for different databases seem to me to be quite different, so
> it is very difficult for me to see whether or not there is match.
> Writing an automatic script will be difficult, when even manual
> matching is too hard for me.  (I could write simple keyword matching,
> but I doubt that it would do much good.)  For some of the targets this
> time it seems that there is no known function, which makes it even
> harder.

  As a first automation step, just a match to the RCSB constrained
vocabulary would be reasonable (of course, this is not so important for
CASP but might be more handy for large-scale analysis). 
 
> You have had very good success with building motif-based HMMs.  I have
> not been particularly successful at creating them automatically, which
> is why I've stuck with the full-sequence model.  If you think a
> motif-based HMM might work better for one of the targets (like t87),
> feel free to try it.  I suspect that FIMs are more useful when
> insertions can be quite long, and that regular insert nodes are better
> when the insertions are short.

  Since the sequences from a SAM-T00 run that didn't provide a hit are
available, they can be used as input for something like MEME to generate
a shorter model. This motif based approach will help most when there is
a basic core that is elaborated upon using insertions of vastly
different lengths. One of the fundamental theoretical problems with the
current HMM topology is that it can't handle insertions of greatly
varying lengths. 

	-saira
-- 
I. Saira Mian
Life Sciences Division (Mail Stop 74-197)  E-mail: SMian@lbl.gov
Lawrence Berkeley National Laboratory      Tel:    (510) 486-6216
1 Cyclotron Road                           Fax:    (510) 486-6949
Berkeley, California 94720