The SAM-T08 hand predictions use methods similar to SAM_T06 in CASP7. We start with a fully automated method (implemented as the SAM-T08-server): Use the SAM-T2K, SAM-T04, and SAM-T06 methods for finding homologs of the target and aligning them. Make local structure predictions using neural nets and the multiple alignments. These neural nets have been newly trained for CASP8 with an improved training protocol. The neural nets for the 3 different multiple sequence alignments are independently trained, so combining them should offer improved performance. We currently use 15 local-structure alphabets: STR2 an extended version of DSSP that splits the beta strands into multiple classes (parallel/antiparallel/mixed, edge/center) STR4 an attempt at an alphabet like STR2, but not requiring DSSP. This alphabet may be trying to make some irrelevant distinctions as well. ALPHA an discretization of the alpha torsion angle: CA(i-i), CA(i), CA(i+1), CA(i+2) BYS a discretization of Ramachandran plots, due to Bystroff PB de Brevern's protein blocks N_NOTOR N_NOTOR2 O_NOTOR O_NOTOR2 alphabets based on the torsion angle of backbone hydrogen bonds N_SEP O_SEP alphabets based on the separation of donor and acceptor for backbone hydrogen bonds CB_burial_14_7 a 7-state discretization of the number of C_beta atoms in a 14 Angstrom radius sphere around the C_beta. near-backbone-11 an 11-state discretization of the number of residues (represented by near-backbone points) in a 9.65 Angstrom radius sphere around the sidechain proxy spot for the residue. DSSP_EHL2 CASP's collapse of the DSSP alphabet DSSP_EHL2 is not predicted directly by a neural net, but is computed as a weighted average of the other backbone alphabet predictions. We make 2-track HMMs with each alphabet with the amino-acid track having a weight of 1 and the local structure track having a weight of 0.1 (for backbone alphabets) or 0.3 (for burial alphabets). We use these HMMs to score a template library of about 14000 (t06), 16000 (t04), or 18000 (t2k) templates. The template libraries are expanded weekly, but old template HMMs are not rebuilt. The target HMMs are used to score consensus sequences for the templates, to get a cheap approximation of profile-profile scoring, which does not yet work in the SAM package. We also used single-track HMMs to score not just the template library, but a non-redundant copy of the entire PDB. This scoring is done with real sequences, not consensus sequences. All the target HMMs use a new calibration method the provides more accurate E-values than before, and can be used even with local-structure alphabets that used to give us trouble (such as protein blocks). One-track HMMs built from the template library multiple alignments were used to score the target sequence. Later this summer, we hope to be able to use multi-track template HMMs, but we have not had time to calibrate such models while keeping the code compatible with the old libraries, so the template libraries currently use old calibrations, with somewhat optimistic E-values. All the logs of e-values were combined in a weighted average (with rather arbitrary weights, since we still have not taken the time to optimize them), and the best templates ranked. Alignments of the target to the top templates were made using several different alignment settings on the SAM alignment software. Generate fragments (short 9-residue alignments for each position) using SAM's "fragfinder" program and the 3-track HMM which tested best for alignment. Residue-residue contact predictions are made using mutual information, pairwise contact potentials, joint entropy, and other signals combined by a neural net. Two different neural net methods were used, and the results submitted separately. CB-CB constraints were extracted from the alignments and a combinatorial optimization done to choose a most-believable subset. Then the "undertaker" program (named because it originally optimized burial) is used to try to combine the alignments and the fragments into a consistent 3D model. No single alignment or parent template was used as a frozen core, though in many cases one had much more influence than the others. The alignment scores were not used by undertaker, but were used only to pick the set of alignments and fragments that undertaker would see. The cost functions used by undertaker rely heavily on the alignment constraints, on helix and strand constraints generated from the secondary-structure predictions, and on the neural-net predictions of local properties that undertaker can measure. The residue-residue contact predictions are also given to undertaker, but have less weight. There are also a number of built-in cost functions (breaks, clashes, burial, ...) that are included in the cost function. The automatic script runs the undertaker-optimized model through gromacs (to fix small clashes and breaks) and repacks the sidechains using Rosetta, but these post-undertaker optimizations are not included in the server predictions. They can be used in subsequent re-optimization. After the automatic prediction is done, we examine it by hand and try to fix any flaws that we see. This generally involves rerunning undertaker with new cost functions, increasing the weights for features we want to see and decreasing the weights where we think the optimization has gone overboard. Sometimes we will add new templates or remove ones that we think are misleading the optimization process. We often do "polishing" runs, where all the current models are read in and optimization with undertaker's genetic algorithm is done with high crossover. Some improvements in undertaker include better communication with SCWRL for initial model building form alignments (now using the standard protocol that identical residues have fixed rotamers, rather than being reoptimized by SCWRL), more cost functions based on the neural net predictions, multiple constraint sets (for easier weighting of the importance of different constraints), and some new conformation-change operators (Backrub and BigBackrub). We also created model-quality-assessment methods for CASP8, which we are applying to the server predictions. We do two optimizations from the top 10 models with two of the MQA methods, and consider these models as possible alternatives to our natively-generated models. Although T0472 has fairly easy homology to 3bidA, it was not a trivial target to model. The 3bid structure has strand-swapped dimers, and T0472 has two copies of the 3bid monomer. It is not long enough, however, to contain a full strand-swapped dimer from 3bid and the ends of the monomers there are not close enough to make a good single chain. My main line of reasoning was that the C-terminal strand would still exist and swap to the N-terminal domain, but that the last strand of the first domain would simply be missing, allowing a fairly easy connection between domains. I achieved this structure mainly by adding constraints to the undertaker cost function, after superimposing models of the two domains on 3bidA and 3bidB, to get the basic shape to measure constraints from. Model 1 T0472.try10-opt3.pdb # < try9-opt3 < MQAX8-opt3 < SAM-T08-server_TS1 This is the model I like best, with essentially no breaks or clashes, but still a compact model with the final strand where I wanted it. 2 T0472.try6-opt3.pdb # < MQAX1-opt3 < BAKER-ROBETTA_TS4 This metaserver model represents a very different approach for handling the final strand---attaching it to the C-terminal domain. It is more compatible with the secondary-structure prediction, which predicts a helix for E53-A57, but I prefer a structure that has greater similarity between the tandem repeats. 3 T0472.MQAX7-opt3.pdb # < RAPTOR_TS1 I was not certain of the phase of the last strand. This is an earlier attempt that undertaker did not manage to close. I think the final strand is off by 2. 4 T0472.try5-opt3.pdb # < try4-opt3 < chimera-N1-C1-3bid This is a still earlier attempt to get the last strand in place, built on top of a model obtained by superimposing separately optimized domains on the 3bidAB pair. I think that I was trying to place the final strand off by 4 in phase. 5 T0472.MQAU1-opt3.gromacs0.pdb # < SAM-T08-server_TS1 This is an early metaserver prediction, before I had decided to force the C-terminal strand to swap to the first domain. It is based on the same server prediction as model1, but does not have the benefit of the constraints that held the configuration to look like 3bid. It is representative of the models generated by automatic mathods, where human intution was not applied.