\documentclass[10pt,twocolumn]{article}
\usepackage{epsfig,html}
\nonstopmode

\makeatletter
% Set the tolerances to allow slightly worse line breaks than the default.
% Set the tolerances even looser for two column printing.
% This reduces the number of complaints about overfull hboxes, by
% allowing some rather loose and rather tight lines.

	\pretolerance=400 
	\tolerance=2500
	\hbadness=2000	% complain about some undefull boxes that are accepted

    \newcommand{\subsubsubsection}{\paragraph}
    \renewcommand{\topfraction}{0.999}
    \renewcommand{\dbltopfraction}{0.999}
    \renewcommand{\floatpagefraction}{0.85}
    \renewcommand{\dblfloatpagefraction}{0.85}
    \renewcommand{\textfraction}{0.1}
	% squeeze lines onto page even if 90% figures!
	% Note \floatpagefraction+\textpagefraction should be < 1
	% to prevent figures from being indefinitely delayed
	% The difference should be enough to allow \@maxsep extra space.
	% With \@maxsep=4.6ex, this could be as much as .038 of height.


\makeatother


\textwidth=7in
\oddsidemargin=0in
\evensidemargin=0in

\newcommand{\secref}[1]{Section~\ref{sec:#1}}
\newcommand{\tabref}[1]{Table~\ref{tab:#1}}
\newcommand{\figref}[1]{Figure~\ref{fig:#1}}
\newcommand{\eqref}[1]{Equation~\ref{#1}}

\newcommand{\prob}{\mbox{Prob}}
\newcommand{\casp}{\mbox{\sc casp}}
\newcommand{\cafasp}{\mbox{\sc cafasp}}
\newcommand{\hmm}{\mbox{\sc hmm}}
\newcommand{\hmms}{\mbox{{\sc hmm}s}}
\newcommand{\Hmm}{\mbox{\sc Hmm}}
\newcommand{\pdb}{\mbox{\sc pdb}}
\newcommand{\Eval}{\mbox{\sc E-value}}

\newcommand{\undertaker}{\mbox{\sc undertaker}}
\newcommand{\Undertaker}{\mbox{\sc Undertaker}}
\newcommand{\fragfinder}{\mbox{\sc fragfinder}}


% comment: give str a nice format
\newcommand{\str}{\mbox{\sc str}}
\newcommand{\stride}{\mbox{\sc stride}}
\newcommand{\dssp}{\mbox{\sc dssp}}

\newcommand{\CA}{\ifmmode C_\alpha\else \(C_\alpha\)\fi}
\newcommand{\CB}{\ifmmode C_\beta\else \(C_\beta\)\fi}


\newcommand{\Angstrom}{{\AA}ngstrom}
\newcommand{\Angstroms}{{\AA}ngstroms}

\newcommand{\TODO}[1]{\begingroup \it \marginpar{***} #1 \endgroup}


\let\psframe=\centerline
\def\epsfw#1#2{\psframe{\psfig{figure=#1.eps,width=#2}}}
\def\rotepsfw#1#2{\psframe{\psfig{figure=#1.eps,angle=-90,width=#2}}}
\def\psfw#1#2{\psframe{\psfig{figure=#1.ps,width=#2}}}

% This produces the comments in the margin.
% \newcommand{\comment}[1]{\marginpar{\raggedright #1}}
\begin{document}
\title{SAM-T04: what's new in protein-structure prediction for CASP6}

\date{\today\\
\small This is a preprint of an article accepted for publication in
the {\casp6} special issue of \\
{\it Proteins: Structure, Function, and Bioinformatics}.
Copyright 2005.
}

\author{Kevin Karplus\thanks{email:karplus@soe.ucsc.edu 
Mailing address: Biomolecular Engineering Department,
University of California, Santa Cruz, CA 95064 USA.
Phone: 1-831-459-4250, Fax: 1-831-459-4829. 
Mail to other authors
may be similarly addressed.}, 
Sol Katzman,
George Shackleford,
Martina Koeva,
Jenny Draper,
Bret Barnes,
Marcia Soriano,
Richard Hughey
}
\maketitle

\normalsize


\begin{abstract}
\end{abstract}

\section{Introduction \label{sec:introduction}}

In previous {\casp} experiments, our team has concentrated on
fold-recognition using hidden Markov models (\hmms) with fairly good
results~\cite{UCSC-CASP2-proteins,UCSC-CASP3-proteins,SAMT2K-CASP4-proteins}.
We have also had some success using standard neural-net methods to
predict secondary structure~\cite{SAMT98}, as measured by the EVA
project~\cite{EVA-bioinf01}.
In 2000, 
we started incorporating secondary structure
prediction in our fold-recognition method for
{\casp4}~\cite{SAMT2K-CASP4-proteins}.  

We entered two automatic servers in {\casp6}, both of which are
somewhat old: the SAM-T99 and SAM -T02 servers.  These servers are
essentially the same as the ones used in {\casp5}~\cite{SAMT02-CASP5},
though the template library has grown over the past two years.
Neither server had particularly impressive performance in {\casp6}.
Results for an automatic method were submitted for evaluation as part
of our {\casp6} submissions, but the method has not yet been
implemented as a web service, so could not participate in the
evaluation of automatic servers.


For both the automatic and the human-assisted entries to {\casp6}, we
relied heavily on our fragment-packing program, {\undertaker}, which
has undergone substantial development since {\casp5}.
The same method was used for all targets, independent of the degree of
similarity to any targets that we found, but we focussed more of our
attention on new-fold and difficult fold-recognition targets, since
these were the targets where we felt we could make the biggest gains
by human intervention.

One new method for our group in {\casp6} was residue-residue contact
prediction.  We did not register enough predictions per target to be
evaluated on most of them, and did only adequately on the few for
which there enough predictions for the assessment method.  We will not
discuss contact prediction in this paper, but are preparing a separate
paper explaining our method an analyzing the results.

According to the {\casp6} assessors, our group had good results in the
non-template category, so improvements in the fragment-packing
program, {\undertaker}, will be the main focus of this paper.

\section{Methods \label{sec:methods}}

Although it has become popular to apply different techniques for
targets with easily found templates and targets without templates, we
applied the same protocol to all targets.  This protocol consisted of
fold recognition and fragment generation using {\hmms} followed by
conformation generation and scoring with a stochastic search program.
Human intervention consisted mainly of adding hand-picked constraints
to the cost function of the stochastic search.  There was little human
intervention on targets with easily-found templates, as we spent most
of our time working on the hardest targets.

For each target, we submitted one or more of the fold-recognition
results (doing sidechain replacement on a template backbone with no
refinement), a fully automatic prediction of the complete chain, and a
result with some human intervention.  In \secref{results}, we will
examine how much was gained by the automatic prediction over simple
sidechain replacement, and by human intervention over the fully
automatic procedure.

The SAM-T04 pipeline is very similar
to the previous generation, SAM-T02, used in CASP5~\cite{SAMT02-CASP5}.
\begin{itemize}
\item finding similar sequences with iterative search (using SAM-T2K
  and SAM-T04);
\item predicting local structure properties with neural nets;
\item finding possible fold-recognition templates using 2-track and 3-track
	{\hmm}s;
\item making alignments to the templates;
\item building a specific fragment library for the target (with {\fragfinder});
and
\item packing fragments and fold-recognition alignments to make a 3D
structure (with {\undertaker}).
\end{itemize}


\subsection{Iterative search}

The main differences in fold recognition and alignments were that we
used a new iterative search method (SAM-T04) in addition to the
SAM-T2K method that we introduced in 2000, and that we used more
multitrack {\hmm}s.

The new iterative search of the nonredundant protein database
NR~\cite{nr} differs in several minor ways from the SAM-T2K search.
The most notable differences are in the prefiltering and in the
regularizers used for transition probabilities.
\begin{itemize}
\item One of the
biggest constraints on the SAM-T2K search was that all sequences in
the final multiple alignment had to be found in the initial
prefiltering of the database, which was done by setting a large
E-value on a BLAST search~\cite{BLAST}.

In SAM-T04, prefiltering of the database is done
using one iteration of psi-BLAST~\cite{psiblast97,improved-psiblast-2001} at each iteration of the search.
This change allows the search to be much more sensitive, without
requiring extremely loose threhsolds on the prefilter.

The prefilter is set to limit the number of psi-BLAST hits to
3000---this cutoff is clearly visible in \figref{t2k-vs-t04}.
Occasionally one gets more than 3000
sequences in the mutiple alignment, because the target sequence aligns
multiple times with repeated domains in proteins (for example, 1ugnA,
one of the alleles of Lir1, has 8791 sequences in the multiple
alignment because many of the sequences have multiple copies of the
domain). 

Although SAM-T04 is generally more sensitive than SAM-T2K, sometimes
SAM-T04 gets fewer sequences.  One extreme case is 1wjpA which has
9144 sequences in the SAM-T2K alignment, but only 22 in the SAM-T04
alignment.  Except where the reduction in size is due to the cap on
the prefilter, the reduction is generally due to tighter thresholds on
the psi-blast filter than on the older blast filter.  It has not yet
been determined whether the the reduction has more effect on false
positives or true positives.


\item The transition regularizers for SAM-T99 and SAM-T2K were set to
  avoid ``choppy'' alignments that had frequent insertions and
  deletions, sweeping the gaps together in highly variable regions.
  This multiple alignments are easier for humans to read, and are
  generally preferred by biologists, but information is lost about
  residues that really do correspond.  In SAM-T04, a regularizer is
  used that keeps the costs of gaps fairly low even in the later
  iterations of the iterative search.  The resulting multiple
  alignments look worse, but seem to work better for predicting local
  structure and contacts. (As always during CASP season, we had to
  press the method into service before we had time for extensive testing.)
  
\end{itemize}


\begin{figure}
\epsfw{figures/t2k-vs-t04}{0.9\columnwidth}
\caption[t2k-vs-t04]{\label{fig:t2k-vs-t04}
This figure plots the number of sequences in the SAM-T04 multiple
alignments versus the number in the SAM-T2K multiple alignments, for
alignments that were computed using the same version of the
non-redundant protein database.

Note that the SAM-T04 method generally finds more sequences to be
similar, but is usually capped at 3000 sequences by the settings of
the psi-blast prefilter.  
The SAM-T04 alignments always contain at least two sequences, because
the seed sequence is included, as is the indentical sequence found in
the non-redundant protein database.
}
\end{figure}

\subsection{Local structure prediction}

We continue to use neural networks to predict various local structure
properties~\cite{karchin03-backbone-geometries,karchin-burial-alphabets}.
We are now predicting five backbone properties ({\dssp}, {\stride},
{\str2}, \(\alpha\) pseudotorsion angle, Bystroff's partition of the Ramachandran
plot) and two burial properties \(\CB\) coordination with a 14
{\Angstrom} radius sphere and a new count we call near-backbone).
We also combine the various predictions to get an averaged prediction
for a traditional three-state (strand, helix, other) prediction.

\TODO{Describe str2? Describe near-backbone?}

Our neural nets now have 42 inputs for each position: a one-hot
encoding of the amino acid in the target sequence (20), a probability
for each amino acid from a multiple alignment (20), and probabilities
of insertion and deletion (2).  The one-hot encoding of the target
sequence is new and permits slightly more precise predictions when
the target sequence differs from the dominant amino acid in the
multiple alignment.

We have not yet done extensive testing of the new neural nets to
quantify any improvement, but the combination of using the SAM-T04
multiple alignments, the extra inputs to the neural nets, and
retrained networks appears to have given slight improvements in
prediction of local structure.

\subsection{Fragment generation}

One of the most powerful operators in {\undertaker} is fragment
replacement, in which portions of the conformation are replaced by a
contiguous piece of protein structure.  This fragment replacement is
similar to that used in Rosetta~\cite{mini-thread-CASP3}, but includes
not just the backbone torsion angles, but full 3D information for all
backbone and sidechain atoms (except hydrogens) in the fragment.

The conformation generator in undertaker uses three sources of
backbone fragments for building the models: 
\begin{itemize}
\item short, generic fragments.  A library of about 1300 protein
  structures with good resolution is read in, and every fragment of
  length $\leq 4$ is indexed.  These {\it generic fragments\/} are
  used as possible replacements for exactly matching portions of the
  target chain.
\item large fragments and alignments. For each alignment to a template
  found by the fold-recognition process, sidechain replacement is done
  and the resulting incomplete conformation stored.  The sidechain
  replacement can be done either quickly by {\undertaker} without
  optimization or by Dunbrack's SCWRL 3.0~\cite{SCWRL,SCWRL-3.0}.
  The conformation
  generator can use the contiguous pieces of this conformation as
  fragments or can do replacement of the entire conformation as a unit.
\item medium-length fragments. Fragments of nine residues are found
  using the {\fragfinder} program of the SAM tool suite.  For
  {\casp6}, we used three-track {\hmm}s with amino-acid, str2, and
  {\CB} burial alphabets for finding medium-length fragments.  The
  fragments are reported as short alignments to sequences in the
  template library, and used by undertaker in exactly
  the same way as longer fragments.
\end{itemize}

The main changes in fragment generation since SAM-T02 are that we now
use three-track {\hmm}s for finding the medium-length fragments, and
{\undertaker} filters the fragments as it reads in the alignments,
unaligning residues that would be in improbable parts of the Ramachandran
plot for that residue type.  This filtering breaks some of the
fragments into smaller ones, but reduces the number of residues
predicted to be in the wrong conformation.  We are hoping to be able
to improve {\fragfinder} so that filtering its output is not necessary.

\subsection{Conformation generation}

The conformation generation in {\undertaker} is an adaptive genetic
algorithm that currently has 35 conformation-change operators. Three
of them are the fragment replacement decribed above: InsertFragment
for generic fragments, InsertSpecificFragments for fragments from
fold-recognition and {\fragfinder} alignments, and InsertAlignment for
replacing multiple fragments simultaneously. Also related is
TwoFragment, which picks two fragments at random and replaces both.
There is a standard crossover operation (CrossOver) for combining portions of
different conformations, and a specialized one that does a fragment
replacment at the crossover point (CrossAndInsert).
Some operators do fragment replacement to try to improve specific parts of the
cost function: ReduceClash, ReduceConstraint, ReduceBreak.

Another group of operators is associated with trying to close gaps in
the backbone: ReduceBreak, MoveGap, CloseGap, HealGap, and HealPeptide
(HealPeptide added after {\casp6}).
Several operators move sidechains without affecting the backbone:
Onerotamer, ClashingRotamer, and ClusteredRotamer.
Some operators do small rigid-body movements of disconnected portions
of the chain: JiggleSegment, JiggleSubtree, OptSegment, OptSubtree,
OptAllSegments, and TweakMultimer (TweakMultimer added after {\casp6}).
Some operators make small changes to torsion angles: TweakPhiSegment,
TweakPhiSubtree, TweakPsiSegment, TweakPsiSubtree, TweakPsiPhiSegment,
TweakPsiPhiSubtree, and TweakPeptide (TweakPeptide added after {\casp6}).
There are also a few rather specialized operators: InsertSSBond,
ImproveSSBond, ShiftSegment, and ShiftSubtree.

The genetic algorithm keeps track of which operators have made
improvements in the conformations and how big these improvements were,
favoring the use of operators that make large or frequent improvements.

To generate the starting conformations for the genetic algorithm, we
build a random conformation, then repeatedly try doing all possible
alignment replacements from our alignment library.  For targets for
which good templates and alignments are available, this generally gets
the core of the conformation correct, and the genetic algorithm is
mainly working on closing the loops and repacking sidechains, even
though no part of the conformation is frozen.

\subsection{Cost function}

The generate-and-test method used by {\undertaker} relies on a cost
function to guide the genetic algorithm toward protein-like conformations.
The cost function in {\undertaker} is not an energy function, as it
includes many non-physical terms.  The cost function itself is a
linear combination of any number of terms, selected at run time.
There are currently 38 built-in cost function components, plus several
parameterizable ones that can be read in from files.  Not all the
possible components were used in {\casp6}, and both the set used and
the weighting coefficients were modified by hand on each target.

The fully automatic predictions used 14 terms:
\begin{itemize}
\item 6 burial terms (wet6.5,
near\_backbone, way\_back, dry5, dry6.5, dry8, and dry12), each of which
counted residues (for near\_backbone and way\_back) or atoms (for the
others) within specific spheres near each residue.  The cost function
used negative log-probability of the observed burial, based on
residue-specific histograms trained on a set of about 1300 good
structures.   The near\_backbone and way\_back burial functions are
new---the others were used already in {\casp5}.

\item 4 hydrogen bond terms.  {\Undertaker} has a fairly sophisticated
  cost function for evaluating hydrogen bonds without explicit hydrogens.
  The cost function takes into account both distance and geometry, and
  uses different parameters for different types of hydrogen bonds.
  The different hydrogen bond terms use the same underlying cost
  function, but assign different weights to different classes of H-bonds.
  The four terms were hbond\_geom (for all hydrogen bonds),
  hbond\_backbone (giving extra weight for backbone-backbone H-bonds), 
  hbond\_geom\_beta (giving still more weight for backbone H-bonds
  that are not part of a helix), and hbond\_geom\_beta\_pair (giving even
  more weight for H-bonds that form part of a ladder between beta strands).

\item 2 clash terms.  Although {\undertaker} does not have a
  Lennard-Jones-style energy function for Van der Waals interactions,
  it does have a soft\_clashes function that provides increasing
  penalties for worse conflicts between atoms.  The definition of what
  consitutes a clash can be read from a file, and the particular clash
  table used for {\casp6} grouped the atoms into 49 types and had
  tables for minimum acceptable distance between pairs of atom types
  for the same residue, residues adjacent on the backbone, and
  residues with separation of two or more.
  In addition to the soft\_clashes term, there was a backbone\_clashes
  term which simply counted the number of pairs of backbone atoms that
  were closer than the minimum acceptabel distance in the clash table.

\item break cost
  ...
  
\item constraints
  ...

\item predicted \(\alpha\) torsion angle (predicted from both SAM-T2k
  and SAM-T04 multiple alignments)
  ...

\item hydrophobic radius of gyration
  ...

\item sidechain quality
  ...

\item bond angle at \CA
  ...

\item Ramachandran plot (bys) residue propensity
  ...
  
\end{itemize}

We sometimes added to the hand predictions terms for disulfide bonds.


\subsection{Human intervention}


\section{Results and Discussion \label{sec:results}}

\subsection{Smooth GDT measure}

\subsection{Automatic vs. alignment}

\subsection{Human intervention vs. automatic}


\section{Conclusion \label{sec:conclusion}}


\subsection*{Acknowledgments}
This work was supported primarily by NIH grant 1 R01 GM068570-01.
Marcia Soriano was supported by a Summer Undergraduate Research Fellowship
in Information Technology (SURF-IT), funded by the National Science
Foundation Research Experience for Undergraduates program.

We are grateful to David Haussler and Anders Krogh for starting the
hidden Markov model and Dirichlet mixture work at UCSC, as these
approaches were instrumental to our success.  We are also grateful to
Rachel Karchin, Christian Barrett, Spencer Tu, Sugato Basu, Mark
Diekhans, and Jonathan Casper, who made other contributions to the
techniques and software.

We began work on the first few  targets while Kevin Karplus was on sabbatical
in David Baker's lab and conversations with members of that lab were
fruitful in guiding our initial work on these targets.


\bibliographystyle{unsrt}
{\small
\bibliography{/projects/compbio/papers/tex/all}
}

\end{document}