RNACAD t-RNA tutorial

This tutorial models t-RNA using a stochastic context-free grammar starting with a structural multiple alignment.

RNACAD allows you to take a training structural multiple alignment with a consensus secondary structure and create a stochastic context-free grammar (SCFG) that can classify new sequences as being similiar to the sequences in the multiple alignment, create multiple alignments of new sequences, and predict secondary structure of new sequences.

SCFGs model both the primary and secondary structure information contained in the training structural multiple alignment.






Inputs:


RNACAD modeling  starts with the following input files:
 

The system starts with a multiple alignment of eight t-RNA with a typical sequence heading the multiple alignment.  The typical sequence is only used to infer secondary structure for the alignment and is discarded. In this case, the sequence  trna_1 is the typcial sequence and is listed first (this is discarded) and is followed by a copy of itself  (which is not discarded). The file trna.ma contains this alignment:
>trna_1
GCCAAGGTGGCAGAGTTCGGCCTAACGCGGCGGCCTGCAGAGCCGCTC----ATCGCCGGTTCAAATCCGGCCCTTGGCT---
>trna_1
GCCAAGGTGGCAGAGTTCGGCCTAACGCGGCGGCCTGCAGAGCCGCTC----ATCGCCGGTTCAAATCCGGCCCTTGGCT---
>trna_2
GGGCGTGTGGCGTAGTC-GGT--AGCGCGCTCCCTTAGCATGGGAGAG----GTCTCCGGTTCGATTCCGGACTCGTCCA---
>trna_3
-GCCCCATCGTCTAGA--GGCCTAGGACACCTCCCTTTCACGGAGGCG----A-CGGGGATTCGAATTCCCCTGGGGGTA---
>trna_4
GGCGGCATAGCCAAGC--GGT--AAGGCCGTGGATTGCAAATCCTCTA----TTCCCCAGTTCAAATCTGGGTGCCGCCT---
>trna_5
GTCTGATTAGCGCAACT-GGC--AGAGCAACTGACTCTTAATCAGTGG----GTTGTGGGTTCGATTCCCACATCAGGCACCA
>trna_6
GGGCGAATAGTGTCAGC-GGG--AGCACACCAGACTTGCAATCTGGTA----G-GGAGGGTTCGAGTCCCTCTTTGTCCACCA
>trna_7
GGGGCTATAGTTTAACT-GGT--AAAACGGCGATTTTGCATATCGTTA----T-TTCAGGATCGAGTCCTGATAACTCCA---
>trna_8
AGCTTTGTAGTTTATGTG-----AAAATGCTTGTTTGTGATATGAGTGAAAT--------------------TGGAGCTT---
The system is given the secondary structure of a typical sequence. In this case the basepairing information for the sequence  trna_1 is given in the file trna_1.bp:
>trna_1 1/75
>trna_1 2/74
>trna_1 3/73
>trna_1 4/72
>trna_1 5/71
>trna_1 6/70
>trna_1 7/69
>trna_1 10/28
>trna_1 11/27
>trna_1 12/26
>trna_1 13/25
>trna_1 30/46
>trna_1 31/45
>trna_1 32/44
>trna_1 33/43
>trna_1 34/42
>trna_1 52/68
>trna_1 53/67
>trna_1 54/66
>trna_1 55/65
>trna_1 56/64
The format of this file contains lines that start with the sequence name and then i/j indicating that bases i and j are basepaired for that sequence numbering from one. RNACAD takes this structure as the consensus secondary structure and applies it to the multiple alignment to do modeling. Note that pseudoknots (interactions that are not well-nested ) are not allowed in SCFG modeling. The program expects the primary sequence of the typical sequence. In this case it is given in trna_1.seq:
>trna_1
GCCAAGGTGGCAGAGTTCGGCCTAACGCGGCGGCCTGCAGAGCCGCTCATCGCCGGTTCAAATCCGGCCCTTGGCT

With these inputs that describe the multiple alignment and consensus secondary structure we can do SCFG modeling.







 

Modeling

With these three files in place, RNACAD modeling can begin. Issue the command:

runtRNA.exe

The software turns the multiple alignment and secondary structure into a SCFG  that can search databases and create multiple alignments of sequences that are close to the training set in both primary and secondary structure information. The script assumes that you have RNACAD executable files in your path.

RNACAD is compute intensive. The program will take about 19 Megabytes of RAM and  about four minutes for each full parse (about 8 minutes total for both iterations) on an Alpha EV6 (21264) processor at 500 MHz.

This script will output some text when it is running. You can largely ignore this. Any messages like "Did not understand # S:branch" are normal. Also without certain HMM packages, the script might try to remove files that were not created.

After this is complete, you can use the grammar for several purposes.







 

Classify new sequences as tRNA or non-tRNA

With the estimated SCFG in result.tdgrammar.2, let's parse the new sequences in the file new.seq with no constraints (/dev/null):

tdparse.exe result.tdgrammar.2 new.seq /dev/null > new.parse

RNACAD is compute intensive and this will take a couple of minutes to finish. Now pull out the scores from the resulting parses:

grep ">" new.parse

which gives:

>DradioLys 76 -56.092 71.0763
>Junk 76 -132.433 -39.0603

The last number on the line is the score of the sequence. A high number indicates a good fit to the model. The sequence DradioLys is a tRNA and the sequence Junk is not.

(Technical Note:  Using the grammar result.tdgrammar.2 implies that you want a global alignment of the sequence. This means that the entire sequences is treated as a single tRNA. This is not what you what when you are examining arbitrary sequences that might contain a tRNA sequence in the middle with junk sequence on the beginning and end. To handle this case you want to use the grammar result.tdgrammar.FIM.2. This grammar contains "Free Insertion Modules" on the beginning and end of the model that can account for possible junk sequence.)



 

Create multiple alignments of sequences

Now that you have parsed the new sequences to get their probabilities, you can create multiple alignments of them using:

align.exe new.parse

This gives:
 

>struct
-(((([[[[[[[[[(((--(((([--------]]]]]]]]]]]]]]]]))))-(((((-------)))))-----(((((-------))))))))]]]]]]]))))-....
>DradioLys
-GGGT.........CGTTAGCTC.AATTGGCA................GAGCAGCTGACTCTTAATCAGCGGGTTGTAGGTTCGATTCCTACACG.......ACCCAcca.
>Junk
-AGCTtttcattctGACTG----cAACGGGCAatatgtctctgtgtgg----A-----TTAAAAA-----AAG-----------AGT-----GTCtgatagcAGCTTctga

DradioLys aligns well to the model but Junk fits very badly deleting  and inserting many positions.

To align these two sequences with the training set just concatenate the parse files together and align that file. Because the same grammar generated the parses, a consistent multiple alignment is created.

cat result.parse.2 new.parse > newall.parse

align.exe newall.parse


Predict Basepairing

To get a list of basepairs for the new sequences use:

parsetobp.exe new.parse

this gives a listing of sequence name, nonterminal name, the 5' and 3' side of the basepair with the indices repeated, and the identify of the basepair. The first line here represents that for the sequence DradioLys the nonterminal helix_3_hm_1 generated a GC basepair at positions 1 and 72.

>DradioLys helix_3_hm_1 1 1 72 72 helix /G/C/
>DradioLys helix_3_hm_2 2 2 71 71 helix /G/C/
>DradioLys helix_3_hm_3 3 3 70 70 helix /G/C/
>DradioLys helix_3_hm_4 4 4 69 69 helix /T/A/
>DradioLys helix_3_hm_5 5 5 68 68 helix /C/G/
>DradioLys helix_3_hm_6 6 6 67 67 helix /G/C/
>DradioLys helix_3_hm_7 7 7 66 66 helix /T/A/
>DradioLys helix_0_hm_1 10 10 25 25 helix /G/C/
>DradioLys helix_0_hm_2 11 11 24 24 helix /C/G/
>DradioLys helix_0_hm_3 12 12 23 23 helix /T/A/
>DradioLys helix_0_hm_4 13 13 22 22 helix /C/G/
>DradioLys helix_1_hm_1 27 27 43 43 helix /G/C/
>DradioLys helix_1_hm_2 28 28 42 42 helix /C/G/
>DradioLys helix_1_hm_3 29 29 41 41 helix /T/A/
>DradioLys helix_1_hm_4 30 30 40 40 helix /G/C/
>DradioLys helix_1_hm_5 31 31 39 39 helix /A/T/
>DradioLys helix_2_hm_1 49 49 65 65 helix /G/C/
>DradioLys helix_2_hm_2 50 50 64 64 helix /T/A/
>DradioLys helix_2_hm_3 51 51 63 63 helix /A/T/
>DradioLys helix_2_hm_4 52 52 62 62 helix /G/C/
>DradioLys helix_2_hm_5 53 53 61 61 helix /G/C/
>Junk helix_3_hm_1 1 1 71 71 helix /A/T/
>Junk helix_3_hm_2 2 2 70 70 helix /G/C/
>Junk helix_3_hm_3 3 3 69 69 helix /C/G/
>Junk helix_3_hm_4 4 4 68 68 helix /T/A/
>Junk helix_3_hm_5 14 14 60 60 helix /G/C/
>Junk helix_3_hm_6 15 15 59 59 helix /A/T/
>Junk helix_3_hm_7 16 16 58 58 helix /C/G/






Training Outputs

SCFG modeling outputs several files. The primary ones are listed here. This is how the SCFG aligns the training sequences in the file result.align.2. We can see that it agrees with the training multiple alignment so we can conclude that the SCFG has learned the information in the training set. Note that in the >struct line that "A" represents a 5' side of a basepair and "B" represents the 3' side.
>struct
-AAAAAAA--AAAA----.---..-BBBB-AAAAA-------BBBBB-----AAAAA-------BBBBBBBBBBBbB-...
>trna_1
-GCCAAGGTGGCAGAGTTcGGCctAACGCGGCGGCCTGCAGAGCCGCTCATCGCCGGTTCAAATCCGGCCCTTGG.CT...
>trna_2
-GGGCGTGTGGCGTAGTC.GGT..AGCGCGCTCCCTTAGCATGGGAGAGGTCTCCGGTTCGATTCCGGACTCGTC.CA...
>trna_3
--GCCCCATCGTCTAGA-.GGCctAGGACACCTCCCTTTCACGGAGGCGA-CGGGGATTCGAATTCCCCTGGGGGt-A...
>trna_4
-GGCGGCATAGCCAAGC-.GGT..AAGGCCGTGGATTGCAAATCCTCTATTCCCCAGTTCAAATCTGGGTGCCGC.CT...
>trna_5
-GTCTGATTAGCGCAACT.GGC..AGAGCAACTGACTCTTAATCAGTGGGTTGTGGGTTCGATTCCCACATCAGG.CAcca
>trna_6
-GGGCGAATAGTGTCAGC.GGG..AGCACACCAGACTTGCAATCTGGTAG-GGAGGGTTCGAGTCCCTCTTTGTC.CAcca
>trna_7
-GGGGCTATAGTTTAACT.GGT..AAAACGGCGATTTTGCATATCGTTAT-TTCAGGATCGAGTCCTGATAACTC.CA...
>trna_8
-AGCTTTGTAGTTTATGT.---g.AAAATGCTTGTTTGTGATATGAGTGA-----------AAT-----TGGAGC.TT...
The parse files fully describe how the grammar most likely derived the sequences including probabilities. These probabilites are pulled from the file result.parse.2 using the command grep ">" result.parse.2 . The last number is the savings in bits over the null model, higher is better fit. As you can see trna_8 fits the model but less well than the others.
 
>trna_1 76 -54.4812 73.4002
>trna_2 73 -42.666 84.4459
>trna_3 72 -54.4642 65.4248
>trna_4 72 -43.6463 81.0317
>trna_5 76 -53.2099 75.2343
>trna_6 75 -50.6107 76.9842
>trna_7 72 -44.5877 79.6736
>trna_8 55 -52.2669 34.5948
The SCFG contains the information of the multiple alignment in probabilistic form in the file result.tdgrammar.2.

Below we show the distribution on the first basepair that is extracted from the tdgrammar file. Notice that GC has 0.77 probability and AT has 0.14 with the other basepairs having much lower probability.  The grammar contains a lot of information describing all the distributions of loop and helix modeling.
AA 0.00231532 AG 0.00255984 AC 0.001488870 AT 0.141798000
GA 0.00290563 GG 0.00163450 GC 0.767714000 GT 0.010205500
CA 0.00155117 CG 0.02913770 CC 0.000537313 CT 0.000942179
TA 0.02225780 TG 0.01061660 TC 0.001345760 TT 0.002990660

The program automatically infers a high level grammar based on the structural alignment that is uses in the construction of the grammar:
S:> loop_0
helix_3:( loop_1
helix_0:( loop_2
helix_0:) loop_3
helix_1:( loop_4
helix_1:) loop_5
helix_2:( loop_6
helix_2:) helix_3:) loop_7
loop_0:= LOOP 1
helix_3:= BULGEHELIX 7
loop_1:= LOOP 2
helix_0:= BULGEHELIX 4
loop_2:= LOOP 8
loop_3:= LOOP 1
helix_1:= BULGEHELIX 5
loop_4:= LOOP 7
loop_5:= LOOP 5
helix_2:= BULGEHELIX 5
loop_6:= LOOP 7
loop_7:= LOOP 1

This is a description of secondary structure with symbolic names. The special modifiers ":(" and ":)" represent the 5' and 3' sides of a helix. The modifier ":>" defines high level structure. The modifier ":=" defines low level structures such as loops and helices. The numbers in low level structures give the average length of the structure.  Here helix_3 represents the acceptor stem and helix_1 is the anti-codon stem. The anti-codon is contained in loop_4. You can therefore determine the anti-codon of a trna by parsing it and searching for what bases the nonterminals starting with loop_4 derive.



File Contents:

Here is a list of all files in the directory:
 

File

Meaning

correct.ma  Correct tRNA multiple alignment with basepairing indicated.
new.parse  Parse of the new sequences created in the tutorial.
new.seq  Sequences to be used in the tutorial.
newall.parse  Parses of training sequences and the new sequences created in tutorial.
result.align.0
result.align.1
result.align.2
 Alignment of the training set for the three iterations of training.
result.constraints.0
result.constraints.1
result.constraints.2
 Constraints on the parses for three iterations. If you do not have the HMM package the .1 and .2 files will have zero length.
result.hmm.0.mod
result.hmm.1.mod
result.hmm.2.mod
 HMMs for the training set for three iterations. Will not be present without the HMM package.
result.meta  A high-level description of the tRNA multiple alignment created by the training script.
result.parse.0
result.parse.1
result.parse.2
 Parses of the training set for three iterations.
result.seq  The training set sequences pulled from the training multiple alignment.
result.tdgrammar  A grammar with garbage probabilities.
result.tdgrammar.0
result.tdgrammar.1
result.tdgrammar.2
 Estimated grammars for the three iterations.
result.tdgrammar.FIM.2  The final grammar with FIMs that allow junk sequence on the begining and the end.
result.tdgrammar.null  A grammar with prior probabilities (estimated with NO sequences, just the prior)
runtRNA.exe  The script that trains the grammar
trna.ma  Input multple alignment of tRNA.
trna_1.bp  Input list of basepairs for the typical sequences that is used to infer basepairs for the multiple alignment.
trna_1.constraints  A list of constraints on the typical sequence that forces the parse to correctly account for basepairs. 
trna_1.parse  A parse of the typical sequence that correctly accounts for basepairs.
trna_1.seq  Input primary sequence of the typical sequence.
tutorial_tRNA.html  This tutorial file.