SAM-T99 Automatic server OVERVIEW The SAM_T99 results, available from the CAFASP2 meta-server (http://cafasp.bioinfo.pl/), were obtained from the SAM-T99 web server (http://www.cse.ucsc.edu/research/compbio/HMM-apps/T99-query.html). Fold recognition by the SAM-T99 server was performed using the SAM-T99 method using SAM version 3.1 [1], a refinement of SAM-T98 [3], developed by this group for CASP3 [7] and tested for superfamily recognition [4]. Both methods attempt to find and multiply align a set of homologs to a given sequence, then create an HMM from that multiple alignment. The SAM-T99 method generates a multiple alignment which is used for two further tasks: fold recognition and secondary-structure prediction. SAM-T99 ALIGNMENT GENERATION The initial step uses BLASTP to search NCBI's non-redundant protein database, NR with two different thresholds to produce a set of very close homologs and a set of possible homologs. The method then uses multiple iterations of a selection, training, and alignment procedure. Each iteration involves an initial alignment, a set of search sequences, a threshold value, and a transition regularizer. The first iteration uses a the target sequence as the initial alignment and the close homologs found by BLASTP are used as the search set. The threshold is set very strictly, so that only good matches to the sequence are considered. This iteration uses a transition regularizer that was intended to match the gap costs used by BLASTP. On subsequent iterations the input alignment is the output from the previous iteration, the search set is the larger set of possible homologs found by BLASTP, and the thresholds are gradually loosened. The second through second-from-last iteration use a ``long-match'' transition regularizer, and the final iteration uses a transition regularizer trained on FSSP alignments. FOLD RECOGNITION Given the SAM-T99 multiple alignment, a set of sequence weights is determined. Next, modelfromalign is used to build the model from the alignment and the sequence weights. Finally, hmmscore performs a local, all-paths scoring of PDB sequences, using a reversed-sequence normalization feature. A library of about 3000 template HMMs was created from PDB sequences using SAM-T99, and the target sequence was scored with each of these models. When both target-model score and template-model score are available for a template, they are averaged, and the e-value is recalculated for the averaged score. Because of naming differences between the non-redundant list of PDB sequences and the HMM model library, some bidirectional hits were not properly averaged. The weighting method [3], combines the Henikoffs' scheme [5], Dirichlet mixtures [6], and an entropy method to set the final weights. FOLD RECOGNITION SUBMISSION CRITERIA FOR CAFASP2 The submitted alignments are a subset of the SAM-T99 results available from the CAFASP2 meta-server. The CAFASP2 meta-server results are for a query with a loose threshold E-value of 50.0, so not necessarily all the database hits on the meta-server are submitted as official SAM-T99 predictions for evaluation. By the CAFASP2 rules, up to 5 of the top database hits could be submitted for each target. In choosing how many of the top hits to submit, we used the following criteria: 1) the top hit is always submitted 2) any of the top 5 hits with an e-value of at most 1 is submitted 3) any of the top 5 hits with an e-value of at most twice the e-value of the top hit is submitted. With these criteria, we submitted the top hit only for 35 targets, 2 models for 6 targets, and 3 models for 2 targets. SECONDARY STRUCTURE PREDICTION Three separate secondary structure predictors were used, and their outputs averaged. The most accurate of the predictors in tests is a 4-layer (3 hidden-layer) neural network. The exact neural network used may have changed over the course of the summer---check the individual predictions for details---but the changes were relatively minor. This method is expected to have approximately 77-78% accuracy. The other two predictors used hidden Markov models. One has a non-profile HMM that models all proteins. It emits both amino acids and secondary-structure codes, and the probabilities for the secondary structure at any position in the sequence is taken as the sum over all alignments of P(code| state)*P(state| position in sequence). This method is expected to have approximately 72-74% accuracy. The other predictor uses the target profile HMM to align a large set of template sequences with known secondary structure, and assigns probabilities to the structure codes for each match position according to the probability of aligning template sequences with that code in that position. This method is expected to have about 72% accuracy. Since all three methods produce a probability vector for the 3-letter EHL alphabet at each position, they were combined by averaging the three vectors. In some preliminary tests, this offered very tiny improvements in accuracy over the neural network alone. References [1] R. Hughey and A. Krogh, CABIOS 12(2): 95-107, 1996. http://www.cse.ucsc.edu/research/compbio/sam.html. [2] K. Karplus, K. Sjolander, C. Barrett, M. Cline, D. Haussler, R. Hughey, L. Holm, and C. Sander, Proteins: Structure, Function, and Genetics, Suppl. 1, 134-9, 1997. [3] K. Karplus, C. Barrett, and R. Hughey. Hidden Markov Models for detecting Remote Protein Homologies, Bioinformatics 14(10):846-856, 1998. [4] J. Park, K. Karplus, C. Barrett, R. Hughey, D. Haussler, T. Hubbard, and C. Chothia, http://cyrah.med.harvard.edu/~jong/assess_final.html, 1998. [5] S. Henikoff and J. C. Henikoff, JMB, vol 243, pp 574-578, Nov 1994. [6] K. Sjolander, K. Karplus, M. P. Brown, R. Hughey, A. Krogh, I. S. Mian, and D. Haussler, CABIOS 12(4):327-345, 1996. [7] Karplus, K; Barrett, C; Cline, M; Diekhans, M; Grate, L; Hughey, R. Predicting protein structure using only sequence information. Proteins, 1999, Suppl 3:121-5.