Extracting and formattiing by script Steps for collecting and formating a SAM_T99 server prediction: 1) Get the server results for the target sequence as a text file from the CAFASP2 meta-server at http://cafasp.bioinfo.pl/target/ Using the results stored on the CAFASP meta-server will help ensure that the results we submit are the same as the results on the meta-server. Save the file in the dir from.cafasp. The naming convention used so far is from.cafasp/cafasp.samt99.t????.txt where the ???? is the target number. 2) Make a directory to place the formatted results in. Run the script prep-to-submit.perl to extract the database hits and secondary structure prediction from the meta-server text file. The script should be run as prep-to-submit.perl \ -runname \ -fromserver \ -method \ -registration \ -scopdata \ -checktarget -targetoffset \ #default value of 0 -filtertheomodels [on|off] \ #default value of on -filterscop [on}off] \ #default value of on -maxmodels max_models \ #default value of 5 The -fromserver argument should be the raw server output file from the meta-server obtained in step 1. The -method argument documents the methods used to generate the prediction. The standard file is in methods.fold. The casp4 registration code for SAM_T99 is 2000-8506-6614, and is in the file registration.fold in this directory. The -scopdata argument should be a file containing the scop classifications of all pdb structures. The file we will be using is at /projects/compbiodata/scop/dir.dom.scop.txt_1.50 The -checktarget argument should be a file containing the target sequence that is known to be good. This file is used to check that the target sequence has not been corrupted in any of the pairwise alignments. For casp4, these known good files should be in /projects/compbio/experiments/casp4/t????/T????.seq where the ???? are the target number. The -targetoffset can be used to adjust the indexes if the target indexes should start at something other than 1 (for example T0100 has its first 24 residues truncated, so this argument should be 24 for T0100). The -filtertheomodels is for enabling or disabling filtering out hits that are theoretical models. Similarly, -filterscop is for filtering out redundant scop superfamilies. By default, both are set to "on"; The -maxmodels argument is for setting a limit on how many models should be submitted. The default is 5. An example of the scripts use is: ../prep-to-submit.perl \ -runname t0090 \ -fromserver ../from.cafasp/cafasp.samt99.t0090.txt \ -method ../methods.fold \ -registration ../registration.fold \ -scopdata /projects/compbiodata/scop/dir.dom.scop.txt_1.50 \ -checktarget /projects/compbio/experiments/casp4/t90/T0090.seq The output of stderr and stdout is the table of database hits extracted from the meta-server raw output. Listed with the pdb id of each hit it its evalue, scop classification, and whether the hit is to be submitted or rejected, along with the reason why. Among the outputs of the script will be the file .submit which contains the file to email to the Prediction Center. 3. Check the summary and any warning message output from the script. If you use the theoretical model filter, you may get warning messages for hits whose pdb files have no EXPDTA record. These may need to be checked by hand to see if they contain real experimental models. The summary will list all the hits and whether they should be submitted or rejected along with the reason why. Check that the files ..a2m.pa and ..al2fasta.a2m have essentially the same alignment. 4. Mail the .submit.al files and .submit.ss file to submit@predictioncenter.llnl.gov Manual Formatting If you need to run the procedure by hand because of something the prep-to_submit.perl script cannot handle, follow the CASP3 README for submission procedures. Some minor changes to the procedure involve additional command line arguments to some of the formatting programs. The argument -checktarget should be given to the fasta2al and al2fasta programs if target sequence checking is needed. This argument should be the filename of the target sequence in fasta format. For CASP3, it would automatically look in the pce/casp3 subdirectories, but it is now a command line argument. The format-prediction script now accepts three additional command line arguments, which are: -registration -model this is optional, by default 1 -targetoffset optional, by default 0