SAM-T02: Protein Structure Prediction with Neural Nets, Hidden Markov Models, and Fragment Packing Kevin Karplus, Rachel Karchin, Richard Hughey, Jenny Draper, Yael Mandel-Gutfreund, Jonathan Casper, and Mark Diekhans Center for Biomolecular Science and Engineering University of California, Santa Cruz karplus@soe.ucsc.edu The SAM-T02 human predictions start with the same method as the SAM-T02 server: Use the SAM-T2K method for finding homologs of the target and aligning them. Make local structure predictions using neural nets and the multiple alignment. We currently have 5 local-structure alphabets: DSSP STRIDE STR an extended version of DSSP that splits the beta strands into multiple classes (parallel/antiparallel/mixed, edge/center) ALPHA an discretization of the alpha torsion angle: CA(i-i), CA(i), CA(i+1), CA(i+2) DSSP_EHL2 CASP's collapse of the DSSP alphabet DSSP_EHL2 is not predicted directly by a neural net, but is computed as a weighted average of the other 4 networks (each probability vector output is multiplied by conditional probability matrix P(E|letter) P(H|letter) P(L|letter)). The weights for the averaging are the mutual information between the local structure alphabet and the DSSP_EHL2 alphabet in a large training set. We make four 2-track HMMs (1.0 amino acid + 0.3 local structure) and use them to score a template library of about 6200 templates. We also used a single-track HMM to score not just the template library, but a non-redundant copy of the entire PDB. [Difference from server: the web server did not include the ALPHA alphabet in either the DSSP_EHL2 computation or the 2-track HMMS.] One-track HMMs built from the template library multiple alignments were used to score the target sequence. All the logs of e-values were combined in a weighted average (with rather arbitrary weights, since we did not have time to optimize them), and the best templates ranked. Alignments of the target to the top templates were made using several different alignment methods (all using the SAM hmmscore program). After the large set of alignments were made the "human" methods and the server diverge significantly. The server just picks the best-scoring templates (after removing redundancy) and reports the local posterior-decoding alignments made with the 2-track AA+STR target HMM. The hand method used SAM's "fragfinder" program and the 2-track AA+STR HMM to find short fragments (9 residues long) for each position in the sequence (6 fragments were kept for each position). Then the "undertaker" program (named because it optimizes burial) is used to try to combine the alignments and the fragments into a consistent 3D model. No single alignment or parent template was used, though in many cases one had much more influence than the others. The alignment scores were not passed to undertaker, but were used only to pike the set of alignments and fragments that undertaker would see. A genetic algorithm with about 16 different operators were used to optimize a score function. The score function was hand-tweaked for each target (mainly by adding constraints to keep beta sheets together, but also by adjusting what terms were included in the score function and what weights were used). Undertaker was undergoing extensive modification during CASP season, so may have had quite different features available for different targets. Bower and Dunbrack's SCWRL was run on some of the intermediate conformations generated by undertaker, but the final conformation was chosen entirely by the undertaker score function. Optimization was generally done in many passes, with hand inspection of the best conformation after each pass, followed (often) by tweaking the score function to move the conformation in a direction we desired. In a few cases, when we started getting a decent structure that did not correspond well to our input alignments, we submitted the structure to VAST to get structure-structure alignments, to try to find some other possible templates to use as a base. In some cases, when several conformations had good parts, different conformations were manually cut-and-pasted, with undertaker run to try to smooth out the transitions. Because undertaker does not (yet) handle multimers, we often added "scaffolding" constraints by hand to try to retain structure in dimerization interfaces. This is a crude hack that we hope to get rid of when we have multimers implemented. Because undertaker does not (yet) have a hydrogen-bond scoring function, we often had to add constraints to hold beta sheets together. In some cases where the register was not obvious, we had to guess or try several different registers. In some cases, when we got desperate for initial starting points, we threw the Robetta ab-initio models into the undertaker pool, and optimized from them as well as the ones undertaker started with. For multiple-domain models, we generally broke the sequence into chunks (often somewhat arbitrary overlapping chunks), and did the full SAM-T02 method for each subchain. The alignments found were all tossed into the undertaker conformation search. In some cases, we performed undertaker runs for the subchains, and cut-and-pasted the pieces into one PDB file (with bad breaks) and let undertaker try to assemble the pieces.