SAM-T99 Automatic server

OVERVIEW

The SAM_T99 results, available from the CAFASP2 meta-server
(http://cafasp.bioinfo.pl/), were obtained from the SAM-T99 web server
(http://www.cse.ucsc.edu/research/compbio/HMM-apps/T99-query.html).

Fold recognition by the SAM-T99 server was performed using the SAM-T99
method using SAM version 3.1 [1], a refinement of SAM-T98 [3],
developed by this group for CASP3 [7] and tested for superfamily
recognition [4].  Both methods attempt to find and multiply align a
set of homologs to a given sequence, then create an HMM from that
multiple alignment.

The SAM-T99 method generates a multiple alignment which is used for
two further tasks: fold recognition and secondary-structure prediction.


SAM-T99 ALIGNMENT GENERATION

The initial step uses BLASTP to search NCBI's non-redundant protein
database, NR with two different thresholds to produce a set
of very close homologs and a set of possible homologs.

The method then uses multiple iterations of a selection, training, and
alignment procedure.  Each iteration involves an initial alignment, a
set of search sequences, a threshold value, and a transition
regularizer.

The first iteration uses a the target sequence as the initial
alignment and the close homologs found by BLASTP are used as the
search set.  The threshold is set very strictly, so that only good
matches to the sequence are considered.  This iteration uses a
transition regularizer that was intended to match the gap costs used
by BLASTP.

On subsequent iterations the input alignment is the output from the
previous iteration, the search set is the larger set of possible
homologs found by BLASTP, and the thresholds are gradually loosened.
The second through second-from-last iteration use a ``long-match''
transition regularizer, and the final iteration uses a transition
regularizer trained on FSSP alignments.

FOLD RECOGNITION

Given the SAM-T99 multiple alignment, a set of sequence weights is
determined.  Next, modelfromalign is used to build the model from the
alignment and the sequence weights.  Finally, hmmscore performs a
local, all-paths scoring of PDB sequences, using a reversed-sequence
normalization feature.  

A library of about 3000 template HMMs was created from PDB sequences
using SAM-T99, and the target sequence was scored with each of these
models.  When both target-model score and template-model score
are available for a template, they are averaged, and the e-value is
recalculated for the averaged score.  Because of naming differences
between the non-redundant list of PDB sequences and the HMM model
library, some bidirectional hits were not properly averaged.

The weighting method [3],
combines the Henikoffs' scheme [5], Dirichlet mixtures [6], and an
entropy method to set the final weights.


FOLD RECOGNITION SUBMISSION CRITERIA FOR CAFASP2

The submitted alignments are a subset of the SAM-T99 results available
from the CAFASP2 meta-server. The CAFASP2 meta-server results are for a
query with a loose threshold E-value of 50.0, so not necessarily all the
database hits on the meta-server are submitted as official SAM-T99 predictions
for evaluation. By the CAFASP2 rules, up to 5 of the top database hits could
be submitted for each target. In choosing how many of the top hits to submit,
we used the following criteria:

    1) the top hit is always submitted
    2) any of the top 5 hits with an e-value of at most 1 is submitted
    3) any of the  top 5 hits with an e-value of at most twice the e-value
       of the top hit is submitted.

With these criteria, we submitted the top hit only for 35 targets, 2 models
for 6 targets, and 3 models for 2 targets.

SECONDARY STRUCTURE PREDICTION

Three separate secondary structure predictors were used, and their
outputs averaged.  The most accurate of the predictors in tests is a
4-layer (3 hidden-layer) neural network.  The exact neural network
used may have changed over the course of the summer---check the
individual predictions for details---but the changes were relatively
minor.   This method is expected to have approximately 77-78% accuracy.

The other two predictors used hidden Markov models.

One has a non-profile HMM that models all proteins.
It emits both amino acids and secondary-structure codes, and the
probabilities for the secondary structure at any position in the
sequence is taken as the sum over all alignments of 
P(code| state)*P(state| position in sequence).
This method is expected to have approximately 72-74% accuracy.

The other predictor uses the target profile HMM to align a large set of
template sequences with known secondary structure, and assigns
probabilities to the structure codes for each match position according
to the probability of aligning template sequences with that code in
that position.  This method is expected to have about 72% accuracy.

Since all three methods produce a probability vector for the 3-letter
EHL alphabet at each position, they were combined by averaging the
three vectors.  In some preliminary tests, this offered very tiny
improvements in accuracy over the neural network alone.

References
[1] R. Hughey and A. Krogh, CABIOS 12(2): 95-107, 1996.
        http://www.cse.ucsc.edu/research/compbio/sam.html.  
[2] K. Karplus, K. Sjolander, C. Barrett, M. Cline, D. Haussler, R.
       Hughey, L. Holm, and C. Sander, Proteins: Structure, Function, and 
        Genetics, Suppl. 1, 134-9, 1997.
[3] K. Karplus, C. Barrett, and R. Hughey.
	Hidden Markov Models for detecting Remote Protein Homologies,
	Bioinformatics 14(10):846-856, 1998.
[4] J. Park, K. Karplus, C. Barrett, R. Hughey, D. Haussler, T. Hubbard,
        and C. Chothia, http://cyrah.med.harvard.edu/~jong/assess_final.html, 1998.
[5] S. Henikoff and J. C. Henikoff, JMB, vol 243, pp 574-578, Nov 1994.
[6] K. Sjolander, K. Karplus, M. P. Brown, R. Hughey, A. Krogh, I. S.
        Mian, and D. Haussler, CABIOS 12(4):327-345, 1996.
[7] Karplus, K; Barrett, C; Cline, M; Diekhans, M; Grate, L; Hughey, R. 
        Predicting protein structure using only sequence information.
        Proteins, 1999, Suppl 3:121-5.