RNACAD allows you to take a training structural multiple alignment with a consensus secondary structure and create a stochastic context-free grammar (SCFG) that can classify new sequences as being similiar to the sequences in the multiple alignment, create multiple alignments of new sequences, and predict secondary structure of new sequences.
SCFGs model both the primary and secondary structure information contained in the training structural multiple alignment.
RNACAD modeling starts with the following input files:
| >trna_1
GCCAAGGTGGCAGAGTTCGGCCTAACGCGGCGGCCTGCAGAGCCGCTC----ATCGCCGGTTCAAATCCGGCCCTTGGCT--- >trna_1 GCCAAGGTGGCAGAGTTCGGCCTAACGCGGCGGCCTGCAGAGCCGCTC----ATCGCCGGTTCAAATCCGGCCCTTGGCT--- >trna_2 GGGCGTGTGGCGTAGTC-GGT--AGCGCGCTCCCTTAGCATGGGAGAG----GTCTCCGGTTCGATTCCGGACTCGTCCA--- >trna_3 -GCCCCATCGTCTAGA--GGCCTAGGACACCTCCCTTTCACGGAGGCG----A-CGGGGATTCGAATTCCCCTGGGGGTA--- >trna_4 GGCGGCATAGCCAAGC--GGT--AAGGCCGTGGATTGCAAATCCTCTA----TTCCCCAGTTCAAATCTGGGTGCCGCCT--- >trna_5 GTCTGATTAGCGCAACT-GGC--AGAGCAACTGACTCTTAATCAGTGG----GTTGTGGGTTCGATTCCCACATCAGGCACCA >trna_6 GGGCGAATAGTGTCAGC-GGG--AGCACACCAGACTTGCAATCTGGTA----G-GGAGGGTTCGAGTCCCTCTTTGTCCACCA >trna_7 GGGGCTATAGTTTAACT-GGT--AAAACGGCGATTTTGCATATCGTTA----T-TTCAGGATCGAGTCCTGATAACTCCA--- >trna_8 AGCTTTGTAGTTTATGTG-----AAAATGCTTGTTTGTGATATGAGTGAAAT--------------------TGGAGCTT--- |
| >trna_1 1/75
>trna_1 2/74 >trna_1 3/73 >trna_1 4/72 >trna_1 5/71 >trna_1 6/70 >trna_1 7/69 >trna_1 10/28 >trna_1 11/27 >trna_1 12/26 >trna_1 13/25 >trna_1 30/46 >trna_1 31/45 >trna_1 32/44 >trna_1 33/43 >trna_1 34/42 >trna_1 52/68 >trna_1 53/67 >trna_1 54/66 >trna_1 55/65 >trna_1 56/64 |
| >trna_1
GCCAAGGTGGCAGAGTTCGGCCTAACGCGGCGGCCTGCAGAGCCGCTCATCGCCGGTTCAAATCCGGCCCTTGGCT |
With these inputs that describe the multiple alignment and consensus secondary structure we can do SCFG modeling.
runtRNA.exe
The software turns the multiple alignment and secondary structure into a SCFG that can search databases and create multiple alignments of sequences that are close to the training set in both primary and secondary structure information. The script assumes that you have RNACAD executable files in your path.
RNACAD is compute intensive. The program will take about 19 Megabytes of RAM and about four minutes for each full parse (about 8 minutes total for both iterations) on an Alpha EV6 (21264) processor at 500 MHz.
This script will output some text when it is running. You can largely ignore this. Any messages like "Did not understand # S:branch" are normal. Also without certain HMM packages, the script might try to remove files that were not created.
After this is complete, you can use the grammar for several purposes.
tdparse.exe result.tdgrammar.2 new.seq /dev/null > new.parse
RNACAD is compute intensive and this will take a couple of minutes to finish. Now pull out the scores from the resulting parses:
grep ">" new.parse
which gives:
|
>DradioLys 76 -56.092 71.0763 >Junk 76 -132.433 -39.0603 |
The last number on the line is the score of the sequence. A high number indicates a good fit to the model. The sequence DradioLys is a tRNA and the sequence Junk is not.
(Technical Note: Using the grammar result.tdgrammar.2 implies that you want a global alignment of the sequence. This means that the entire sequences is treated as a single tRNA. This is not what you what when you are examining arbitrary sequences that might contain a tRNA sequence in the middle with junk sequence on the beginning and end. To handle this case you want to use the grammar result.tdgrammar.FIM.2. This grammar contains "Free Insertion Modules" on the beginning and end of the model that can account for possible junk sequence.)
align.exe new.parse
This gives:
|
>struct -(((([[[[[[[[[(((--(((([--------]]]]]]]]]]]]]]]]))))-(((((-------)))))-----(((((-------))))))))]]]]]]]))))-.... >DradioLys -GGGT.........CGTTAGCTC.AATTGGCA................GAGCAGCTGACTCTTAATCAGCGGGTTGTAGGTTCGATTCCTACACG.......ACCCAcca. >Junk -AGCTtttcattctGACTG----cAACGGGCAatatgtctctgtgtgg----A-----TTAAAAA-----AAG-----------AGT-----GTCtgatagcAGCTTctga |
DradioLys aligns well to the model but Junk fits very badly deleting and inserting many positions.
To align these two sequences with the training set just concatenate the parse files together and align that file. Because the same grammar generated the parses, a consistent multiple alignment is created.
cat result.parse.2 new.parse > newall.parse
align.exe newall.parse
parsetobp.exe new.parse
this gives a listing of sequence name, nonterminal name, the 5' and 3' side of the basepair with the indices repeated, and the identify of the basepair. The first line here represents that for the sequence DradioLys the nonterminal helix_3_hm_1 generated a GC basepair at positions 1 and 72.
|
>DradioLys helix_3_hm_1 1 1 72 72 helix /G/C/ >DradioLys helix_3_hm_2 2 2 71 71 helix /G/C/ >DradioLys helix_3_hm_3 3 3 70 70 helix /G/C/ >DradioLys helix_3_hm_4 4 4 69 69 helix /T/A/ >DradioLys helix_3_hm_5 5 5 68 68 helix /C/G/ >DradioLys helix_3_hm_6 6 6 67 67 helix /G/C/ >DradioLys helix_3_hm_7 7 7 66 66 helix /T/A/ >DradioLys helix_0_hm_1 10 10 25 25 helix /G/C/ >DradioLys helix_0_hm_2 11 11 24 24 helix /C/G/ >DradioLys helix_0_hm_3 12 12 23 23 helix /T/A/ >DradioLys helix_0_hm_4 13 13 22 22 helix /C/G/ >DradioLys helix_1_hm_1 27 27 43 43 helix /G/C/ >DradioLys helix_1_hm_2 28 28 42 42 helix /C/G/ >DradioLys helix_1_hm_3 29 29 41 41 helix /T/A/ >DradioLys helix_1_hm_4 30 30 40 40 helix /G/C/ >DradioLys helix_1_hm_5 31 31 39 39 helix /A/T/ >DradioLys helix_2_hm_1 49 49 65 65 helix /G/C/ >DradioLys helix_2_hm_2 50 50 64 64 helix /T/A/ >DradioLys helix_2_hm_3 51 51 63 63 helix /A/T/ >DradioLys helix_2_hm_4 52 52 62 62 helix /G/C/ >DradioLys helix_2_hm_5 53 53 61 61 helix /G/C/ >Junk helix_3_hm_1 1 1 71 71 helix /A/T/ >Junk helix_3_hm_2 2 2 70 70 helix /G/C/ >Junk helix_3_hm_3 3 3 69 69 helix /C/G/ >Junk helix_3_hm_4 4 4 68 68 helix /T/A/ >Junk helix_3_hm_5 14 14 60 60 helix /G/C/ >Junk helix_3_hm_6 15 15 59 59 helix /A/T/ >Junk helix_3_hm_7 16 16 58 58 helix /C/G/ |
| >struct
-AAAAAAA--AAAA----.---..-BBBB-AAAAA-------BBBBB-----AAAAA-------BBBBBBBBBBBbB-... >trna_1 -GCCAAGGTGGCAGAGTTcGGCctAACGCGGCGGCCTGCAGAGCCGCTCATCGCCGGTTCAAATCCGGCCCTTGG.CT... >trna_2 -GGGCGTGTGGCGTAGTC.GGT..AGCGCGCTCCCTTAGCATGGGAGAGGTCTCCGGTTCGATTCCGGACTCGTC.CA... >trna_3 --GCCCCATCGTCTAGA-.GGCctAGGACACCTCCCTTTCACGGAGGCGA-CGGGGATTCGAATTCCCCTGGGGGt-A... >trna_4 -GGCGGCATAGCCAAGC-.GGT..AAGGCCGTGGATTGCAAATCCTCTATTCCCCAGTTCAAATCTGGGTGCCGC.CT... >trna_5 -GTCTGATTAGCGCAACT.GGC..AGAGCAACTGACTCTTAATCAGTGGGTTGTGGGTTCGATTCCCACATCAGG.CAcca >trna_6 -GGGCGAATAGTGTCAGC.GGG..AGCACACCAGACTTGCAATCTGGTAG-GGAGGGTTCGAGTCCCTCTTTGTC.CAcca >trna_7 -GGGGCTATAGTTTAACT.GGT..AAAACGGCGATTTTGCATATCGTTAT-TTCAGGATCGAGTCCTGATAACTC.CA... >trna_8 -AGCTTTGTAGTTTATGT.---g.AAAATGCTTGTTTGTGATATGAGTGA-----------AAT-----TGGAGC.TT... |
| >trna_1 76 -54.4812 73.4002
>trna_2 73 -42.666 84.4459 >trna_3 72 -54.4642 65.4248 >trna_4 72 -43.6463 81.0317 >trna_5 76 -53.2099 75.2343 >trna_6 75 -50.6107 76.9842 >trna_7 72 -44.5877 79.6736 >trna_8 55 -52.2669 34.5948 |
Below we show the distribution on the first basepair that is extracted from the tdgrammar file. Notice that GC has 0.77 probability and AT has 0.14 with the other basepairs having much lower probability. The grammar contains a lot of information describing all the distributions of loop and helix modeling.
| AA 0.00231532 AG 0.00255984 AC 0.001488870 AT
0.141798000
GA 0.00290563 GG 0.00163450 GC 0.767714000 GT 0.010205500 CA 0.00155117 CG 0.02913770 CC 0.000537313 CT 0.000942179 TA 0.02225780 TG 0.01061660 TC 0.001345760 TT 0.002990660 |
S:> loop_0 helix_3:( loop_1 helix_0:( loop_2 helix_0:) loop_3 helix_1:( loop_4 helix_1:) loop_5 helix_2:( loop_6 helix_2:) helix_3:) loop_7 loop_0:= LOOP 1 helix_3:= BULGEHELIX 7 loop_1:= LOOP 2 helix_0:= BULGEHELIX 4 loop_2:= LOOP 8 loop_3:= LOOP 1 helix_1:= BULGEHELIX 5 loop_4:= LOOP 7 loop_5:= LOOP 5 helix_2:= BULGEHELIX 5 loop_6:= LOOP 7 loop_7:= LOOP 1 |
This is a description of secondary structure with symbolic names. The special modifiers ":(" and ":)" represent the 5' and 3' sides of a helix. The modifier ":>" defines high level structure. The modifier ":=" defines low level structures such as loops and helices. The numbers in low level structures give the average length of the structure. Here helix_3 represents the acceptor stem and helix_1 is the anti-codon stem. The anti-codon is contained in loop_4. You can therefore determine the anti-codon of a trna by parsing it and searching for what bases the nonterminals starting with loop_4 derive.
File |
Meaning |
| correct.ma | Correct tRNA multiple alignment with basepairing indicated. |
| new.parse | Parse of the new sequences created in the tutorial. |
| new.seq | Sequences to be used in the tutorial. |
| newall.parse | Parses of training sequences and the new sequences created in tutorial. |
| result.align.0
result.align.1 result.align.2 |
Alignment of the training set for the three iterations of training. |
| result.constraints.0
result.constraints.1 result.constraints.2 |
Constraints on the parses for three iterations. If you do not have the HMM package the .1 and .2 files will have zero length. |
| result.hmm.0.mod
result.hmm.1.mod result.hmm.2.mod |
HMMs for the training set for three iterations. Will not be present without the HMM package. |
| result.meta | A high-level description of the tRNA multiple alignment created by the training script. |
| result.parse.0
result.parse.1 result.parse.2 |
Parses of the training set for three iterations. |
| result.seq | The training set sequences pulled from the training multiple alignment. |
| result.tdgrammar | A grammar with garbage probabilities. |
| result.tdgrammar.0
result.tdgrammar.1 result.tdgrammar.2 |
Estimated grammars for the three iterations. |
| result.tdgrammar.FIM.2 | The final grammar with FIMs that allow junk sequence on the begining and the end. |
| result.tdgrammar.null | A grammar with prior probabilities (estimated with NO sequences, just the prior) |
| runtRNA.exe | The script that trains the grammar |
| trna.ma | Input multple alignment of tRNA. |
| trna_1.bp | Input list of basepairs for the typical sequences that is used to infer basepairs for the multiple alignment. |
| trna_1.constraints | A list of constraints on the typical sequence that forces the parse to correctly account for basepairs. |
| trna_1.parse | A parse of the typical sequence that correctly accounts for basepairs. |
| trna_1.seq | Input primary sequence of the typical sequence. |
| tutorial_tRNA.html | This tutorial file. |