\documentclass[10pt,twocolumn]{article} \usepackage{epsfig,html} \nonstopmode \makeatletter % Set the tolerances to allow slightly worse line breaks than the default. % Set the tolerances even looser for two column printing. % This reduces the number of complaints about overfull hboxes, by % allowing some rather loose and rather tight lines. \pretolerance=400 \tolerance=2500 \hbadness=2000 % complain about some undefull boxes that are accepted \newcommand{\subsubsubsection}{\paragraph} \renewcommand{\topfraction}{0.999} \renewcommand{\dbltopfraction}{0.999} \renewcommand{\floatpagefraction}{0.85} \renewcommand{\dblfloatpagefraction}{0.85} \renewcommand{\textfraction}{0.1} % squeeze lines onto page even if 90% figures! % Note \floatpagefraction+\textpagefraction should be < 1 % to prevent figures from being indefinitely delayed % The difference should be enough to allow \@maxsep extra space. % With \@maxsep=4.6ex, this could be as much as .038 of height. \makeatother \textwidth=7in \oddsidemargin=0in \evensidemargin=0in \newcommand{\secref}[1]{Section~\ref{sec:#1}} \newcommand{\tabref}[1]{Table~\ref{tab:#1}} \newcommand{\figref}[1]{Figure~\ref{fig:#1}} \newcommand{\eqref}[1]{Equation~\ref{#1}} \newcommand{\prob}{\mbox{Prob}} \newcommand{\casp}{\mbox{\sc casp}} \newcommand{\cafasp}{\mbox{\sc cafasp}} \newcommand{\hmm}{\mbox{\sc hmm}} \newcommand{\hmms}{\mbox{{\sc hmm}s}} \newcommand{\Hmm}{\mbox{\sc Hmm}} \newcommand{\pdb}{\mbox{\sc pdb}} \newcommand{\Eval}{\mbox{\sc E-value}} \newcommand{\undertaker}{\mbox{\sc undertaker}} \newcommand{\Undertaker}{\mbox{\sc Undertaker}} \newcommand{\fragfinder}{\mbox{\sc fragfinder}} % comment: give str a nice format \newcommand{\str}{\mbox{\sc str}} \newcommand{\stride}{\mbox{\sc stride}} \newcommand{\dssp}{\mbox{\sc dssp}} \newcommand{\CA}{\ifmmode C_\alpha\else \(C_\alpha\)\fi} \newcommand{\CB}{\ifmmode C_\beta\else \(C_\beta\)\fi} \newcommand{\Angstrom}{{\AA}ngstrom} \newcommand{\Angstroms}{{\AA}ngstroms} \newcommand{\TODO}[1]{\begingroup \it \marginpar{***} #1 \endgroup} \let\psframe=\centerline \def\epsfw#1#2{\psframe{\psfig{figure=#1.eps,width=#2}}} \def\rotepsfw#1#2{\psframe{\psfig{figure=#1.eps,angle=-90,width=#2}}} \def\psfw#1#2{\psframe{\psfig{figure=#1.ps,width=#2}}} % This produces the comments in the margin. % \newcommand{\comment}[1]{\marginpar{\raggedright #1}} \begin{document} \title{SAM-T04: what's new in protein-structure prediction for CASP6} \date{\today\\ \small This is a preprint of an article accepted for publication in the {\casp6} special issue of \\ {\it Proteins: Structure, Function, and Bioinformatics}. Copyright 2005. } \author{Kevin Karplus\thanks{email:karplus@soe.ucsc.edu Mailing address: Biomolecular Engineering Department, University of California, Santa Cruz, CA 95064 USA. Phone: 1-831-459-4250, Fax: 1-831-459-4829. Mail to other authors may be similarly addressed.}, Sol Katzman, George Shackleford, Martina Koeva, Jenny Draper, Bret Barnes, Marcia Soriano, Richard Hughey } \maketitle \normalsize \begin{abstract} \end{abstract} \section{Introduction \label{sec:introduction}} In previous {\casp} experiments, our team has concentrated on fold-recognition using hidden Markov models (\hmms) with fairly good results~\cite{UCSC-CASP2-proteins,UCSC-CASP3-proteins,SAMT2K-CASP4-proteins}. We have also had some success using standard neural-net methods to predict secondary structure~\cite{SAMT98}, as measured by the EVA project~\cite{EVA-bioinf01}. In 2000, we started incorporating secondary structure prediction in our fold-recognition method for {\casp4}~\cite{SAMT2K-CASP4-proteins}. We entered two automatic servers in {\casp6}, both of which are somewhat old: the SAM-T99 and SAM -T02 servers. These servers are essentially the same as the ones used in {\casp5}~\cite{SAMT02-CASP5}, though the template library has grown over the past two years. Neither server had particularly impressive performance in {\casp6}. Results for an automatic method were submitted for evaluation as part of our {\casp6} submissions, but the method has not yet been implemented as a web service, so could not participate in the evaluation of automatic servers. For both the automatic and the human-assisted entries to {\casp6}, we relied heavily on our fragment-packing program, {\undertaker}, which has undergone substantial development since {\casp5}. The same method was used for all targets, independent of the degree of similarity to any targets that we found, but we focussed more of our attention on new-fold and difficult fold-recognition targets, since these were the targets where we felt we could make the biggest gains by human intervention. One new method for our group in {\casp6} was residue-residue contact prediction. We did not register enough predictions per target to be evaluated on most of them, and did only adequately on the few for which there enough predictions for the assessment method. We will not discuss contact prediction in this paper, but are preparing a separate paper explaining our method an analyzing the results. According to the {\casp6} assessors, our group had good results in the non-template category, so improvements in the fragment-packing program, {\undertaker}, will be the main focus of this paper. \section{Methods \label{sec:methods}} Although it has become popular to apply different techniques for targets with easily found templates and targets without templates, we applied the same protocol to all targets. This protocol consisted of fold recognition and fragment generation using {\hmms} followed by conformation generation and scoring with a stochastic search program. Human intervention consisted mainly of adding hand-picked constraints to the cost function of the stochastic search. There was little human intervention on targets with easily-found templates, as we spent most of our time working on the hardest targets. For each target, we submitted one or more of the fold-recognition results (doing sidechain replacement on a template backbone with no refinement), a fully automatic prediction of the complete chain, and a result with some human intervention. In \secref{results}, we will examine how much was gained by the automatic prediction over simple sidechain replacement, and by human intervention over the fully automatic procedure. The SAM-T04 pipeline is very similar to the previous generation, SAM-T02, used in CASP5~\cite{SAMT02-CASP5}. \begin{itemize} \item finding similar sequences with iterative search (using SAM-T2K and SAM-T04); \item predicting local structure properties with neural nets; \item finding possible fold-recognition templates using 2-track and 3-track {\hmm}s; \item making alignments to the templates; \item building a specific fragment library for the target (with {\fragfinder}); and \item packing fragments and fold-recognition alignments to make a 3D structure (with {\undertaker}). \end{itemize} \subsection{Iterative search} The main differences in fold recognition and alignments were that we used a new iterative search method (SAM-T04) in addition to the SAM-T2K method that we introduced in 2000, and that we used more multitrack {\hmm}s. The new iterative search of the nonredundant protein database NR~\cite{nr} differs in several minor ways from the SAM-T2K search. The most notable differences are in the prefiltering and in the regularizers used for transition probabilities. \begin{itemize} \item One of the biggest constraints on the SAM-T2K search was that all sequences in the final multiple alignment had to be found in the initial prefiltering of the database, which was done by setting a large E-value on a BLAST search~\cite{BLAST}. In SAM-T04, prefiltering of the database is done using one iteration of psi-BLAST~\cite{psiblast97,improved-psiblast-2001} at each iteration of the search. This change allows the search to be much more sensitive, without requiring extremely loose threhsolds on the prefilter. The prefilter is set to limit the number of psi-BLAST hits to 3000---this cutoff is clearly visible in \figref{t2k-vs-t04}. Occasionally one gets more than 3000 sequences in the mutiple alignment, because the target sequence aligns multiple times with repeated domains in proteins (for example, 1ugnA, one of the alleles of Lir1, has 8791 sequences in the multiple alignment because many of the sequences have multiple copies of the domain). Although SAM-T04 is generally more sensitive than SAM-T2K, sometimes SAM-T04 gets fewer sequences. One extreme case is 1wjpA which has 9144 sequences in the SAM-T2K alignment, but only 22 in the SAM-T04 alignment. Except where the reduction in size is due to the cap on the prefilter, the reduction is generally due to tighter thresholds on the psi-blast filter than on the older blast filter. It has not yet been determined whether the the reduction has more effect on false positives or true positives. \item The transition regularizers for SAM-T99 and SAM-T2K were set to avoid ``choppy'' alignments that had frequent insertions and deletions, sweeping the gaps together in highly variable regions. This multiple alignments are easier for humans to read, and are generally preferred by biologists, but information is lost about residues that really do correspond. In SAM-T04, a regularizer is used that keeps the costs of gaps fairly low even in the later iterations of the iterative search. The resulting multiple alignments look worse, but seem to work better for predicting local structure and contacts. (As always during CASP season, we had to press the method into service before we had time for extensive testing.) \end{itemize} \begin{figure} \epsfw{figures/t2k-vs-t04}{0.9\columnwidth} \caption[t2k-vs-t04]{\label{fig:t2k-vs-t04} This figure plots the number of sequences in the SAM-T04 multiple alignments versus the number in the SAM-T2K multiple alignments, for alignments that were computed using the same version of the non-redundant protein database. Note that the SAM-T04 method generally finds more sequences to be similar, but is usually capped at 3000 sequences by the settings of the psi-blast prefilter. The SAM-T04 alignments always contain at least two sequences, because the seed sequence is included, as is the indentical sequence found in the non-redundant protein database. } \end{figure} \subsection{Local structure prediction} We continue to use neural networks to predict various local structure properties~\cite{karchin03-backbone-geometries,karchin-burial-alphabets}. We are now predicting five backbone properties ({\dssp}, {\stride}, {\str2}, \(\alpha\) pseudotorsion angle, Bystroff's partition of the Ramachandran plot) and two burial properties \(\CB\) coordination with a 14 {\Angstrom} radius sphere and a new count we call near-backbone). We also combine the various predictions to get an averaged prediction for a traditional three-state (strand, helix, other) prediction. \TODO{Describe str2? Describe near-backbone?} Our neural nets now have 42 inputs for each position: a one-hot encoding of the amino acid in the target sequence (20), a probability for each amino acid from a multiple alignment (20), and probabilities of insertion and deletion (2). The one-hot encoding of the target sequence is new and permits slightly more precise predictions when the target sequence differs from the dominant amino acid in the multiple alignment. We have not yet done extensive testing of the new neural nets to quantify any improvement, but the combination of using the SAM-T04 multiple alignments, the extra inputs to the neural nets, and retrained networks appears to have given slight improvements in prediction of local structure. \subsection{Fragment generation} One of the most powerful operators in {\undertaker} is fragment replacement, in which portions of the conformation are replaced by a contiguous piece of protein structure. This fragment replacement is similar to that used in Rosetta~\cite{mini-thread-CASP3}, but includes not just the backbone torsion angles, but full 3D information for all backbone and sidechain atoms (except hydrogens) in the fragment. The conformation generator in undertaker uses three sources of backbone fragments for building the models: \begin{itemize} \item short, generic fragments. A library of about 1300 protein structures with good resolution is read in, and every fragment of length $\leq 4$ is indexed. These {\it generic fragments\/} are used as possible replacements for exactly matching portions of the target chain. \item large fragments and alignments. For each alignment to a template found by the fold-recognition process, sidechain replacement is done and the resulting incomplete conformation stored. The sidechain replacement can be done either quickly by {\undertaker} without optimization or by Dunbrack's SCWRL 3.0~\cite{SCWRL,SCWRL-3.0}. The conformation generator can use the contiguous pieces of this conformation as fragments or can do replacement of the entire conformation as a unit. \item medium-length fragments. Fragments of nine residues are found using the {\fragfinder} program of the SAM tool suite. For {\casp6}, we used three-track {\hmm}s with amino-acid, str2, and {\CB} burial alphabets for finding medium-length fragments. The fragments are reported as short alignments to sequences in the template library, and used by undertaker in exactly the same way as longer fragments. \end{itemize} The main changes in fragment generation since SAM-T02 are that we now use three-track {\hmm}s for finding the medium-length fragments, and {\undertaker} filters the fragments as it reads in the alignments, unaligning residues that would be in improbable parts of the Ramachandran plot for that residue type. This filtering breaks some of the fragments into smaller ones, but reduces the number of residues predicted to be in the wrong conformation. We are hoping to be able to improve {\fragfinder} so that filtering its output is not necessary. \subsection{Conformation generation} The conformation generation in {\undertaker} is an adaptive genetic algorithm that currently has 35 conformation-change operators. Three of them are the fragment replacement decribed above: InsertFragment for generic fragments, InsertSpecificFragments for fragments from fold-recognition and {\fragfinder} alignments, and InsertAlignment for replacing multiple fragments simultaneously. Also related is TwoFragment, which picks two fragments at random and replaces both. There is a standard crossover operation (CrossOver) for combining portions of different conformations, and a specialized one that does a fragment replacment at the crossover point (CrossAndInsert). Some operators do fragment replacement to try to improve specific parts of the cost function: ReduceClash, ReduceConstraint, ReduceBreak. Another group of operators is associated with trying to close gaps in the backbone: ReduceBreak, MoveGap, CloseGap, HealGap, and HealPeptide (HealPeptide added after {\casp6}). Several operators move sidechains without affecting the backbone: Onerotamer, ClashingRotamer, and ClusteredRotamer. Some operators do small rigid-body movements of disconnected portions of the chain: JiggleSegment, JiggleSubtree, OptSegment, OptSubtree, OptAllSegments, and TweakMultimer (TweakMultimer added after {\casp6}). Some operators make small changes to torsion angles: TweakPhiSegment, TweakPhiSubtree, TweakPsiSegment, TweakPsiSubtree, TweakPsiPhiSegment, TweakPsiPhiSubtree, and TweakPeptide (TweakPeptide added after {\casp6}). There are also a few rather specialized operators: InsertSSBond, ImproveSSBond, ShiftSegment, and ShiftSubtree. The genetic algorithm keeps track of which operators have made improvements in the conformations and how big these improvements were, favoring the use of operators that make large or frequent improvements. To generate the starting conformations for the genetic algorithm, we build a random conformation, then repeatedly try doing all possible alignment replacements from our alignment library. For targets for which good templates and alignments are available, this generally gets the core of the conformation correct, and the genetic algorithm is mainly working on closing the loops and repacking sidechains, even though no part of the conformation is frozen. \subsection{Cost function} The generate-and-test method used by {\undertaker} relies on a cost function to guide the genetic algorithm toward protein-like conformations. The cost function in {\undertaker} is not an energy function, as it includes many non-physical terms. The cost function itself is a linear combination of any number of terms, selected at run time. There are currently 38 built-in cost function components, plus several parameterizable ones that can be read in from files. Not all the possible components were used in {\casp6}, and both the set used and the weighting coefficients were modified by hand on each target. The fully automatic predictions used 14 terms: \begin{itemize} \item 6 burial terms (wet6.5, near\_backbone, way\_back, dry5, dry6.5, dry8, and dry12), each of which counted residues (for near\_backbone and way\_back) or atoms (for the others) within specific spheres near each residue. The cost function used negative log-probability of the observed burial, based on residue-specific histograms trained on a set of about 1300 good structures. The near\_backbone and way\_back burial functions are new---the others were used already in {\casp5}. \item 4 hydrogen bond terms. {\Undertaker} has a fairly sophisticated cost function for evaluating hydrogen bonds without explicit hydrogens. The cost function takes into account both distance and geometry, and uses different parameters for different types of hydrogen bonds. The different hydrogen bond terms use the same underlying cost function, but assign different weights to different classes of H-bonds. The four terms were hbond\_geom (for all hydrogen bonds), hbond\_backbone (giving extra weight for backbone-backbone H-bonds), hbond\_geom\_beta (giving still more weight for backbone H-bonds that are not part of a helix), and hbond\_geom\_beta\_pair (giving even more weight for H-bonds that form part of a ladder between beta strands). \item 2 clash terms. Although {\undertaker} does not have a Lennard-Jones-style energy function for Van der Waals interactions, it does have a soft\_clashes function that provides increasing penalties for worse conflicts between atoms. The definition of what consitutes a clash can be read from a file, and the particular clash table used for {\casp6} grouped the atoms into 49 types and had tables for minimum acceptable distance between pairs of atom types for the same residue, residues adjacent on the backbone, and residues with separation of two or more. In addition to the soft\_clashes term, there was a backbone\_clashes term which simply counted the number of pairs of backbone atoms that were closer than the minimum acceptabel distance in the clash table. \item break cost ... \item constraints ... \item predicted \(\alpha\) torsion angle (predicted from both SAM-T2k and SAM-T04 multiple alignments) ... \item hydrophobic radius of gyration ... \item sidechain quality ... \item bond angle at \CA ... \item Ramachandran plot (bys) residue propensity ... \end{itemize} We sometimes added to the hand predictions terms for disulfide bonds. \subsection{Human intervention} \section{Results and Discussion \label{sec:results}} \subsection{Smooth GDT measure} \subsection{Automatic vs. alignment} \subsection{Human intervention vs. automatic} \section{Conclusion \label{sec:conclusion}} \subsection*{Acknowledgments} This work was supported primarily by NIH grant 1 R01 GM068570-01. Marcia Soriano was supported by a Summer Undergraduate Research Fellowship in Information Technology (SURF-IT), funded by the National Science Foundation Research Experience for Undergraduates program. We are grateful to David Haussler and Anders Krogh for starting the hidden Markov model and Dirichlet mixture work at UCSC, as these approaches were instrumental to our success. We are also grateful to Rachel Karchin, Christian Barrett, Spencer Tu, Sugato Basu, Mark Diekhans, and Jonathan Casper, who made other contributions to the techniques and software. We began work on the first few targets while Kevin Karplus was on sabbatical in David Baker's lab and conversations with members of that lab were fruitful in guiding our initial work on these targets. \bibliographystyle{unsrt} {\small \bibliography{/projects/compbio/papers/tex/all} } \end{document}