==== RNACAD README ==== # ################################ # RNACAD - RNA Modeling Using Stochastic Context-Free Grammars # Copyright (C) 2000 # Author Michael P S Brown # All Rights Reserved # This source code is distributed under the terms of the GNU General # Public License. See the files COPYING and LICENSE for details. # ################################ Michael P.S. Brown http://www.cse.ucsc.edu/~mpbrown mpbrown@cse.ucsc.edu brown314@earthlink.net mpsb@hnc.com ==== RNACAD ==== RNACAD is a stochastic context-free grammar(SCFGs) RNA modeling package that accounts for both primary and secondary structure information. SCFGs are roboust and allow arbitrary deviations from a pattern. One of its primary features is the use of a constraint system that increases its running performance considerably allowing large problems to be solved efficiently. If you want to search for known RNA structure and produce multiple alignments of it with the consideration that some hits might only be distantly related, then this package might be the answer. Starting the tutorial is probably the quickest way to see what the system is capable of. While the software does have some restrictions, it has been applied to ribosomal RNA modeling and is capable of generating high quality rRNA multiple alignments. It has also been used to model tRNA, search genomes for rho-independent terminator signals, and approximately search for a simple pseudoknotted RNA structure. Other related work includes Sean Eddy's COVE and RNABOB (http://www.genetics.wustl.edu/eddy/). Michael Zuker's MFOLD (http://www.ibc.wustl.edu/~zuker/). The relase of RNACAD is influenced by Sean Eddy's HMMer (http://hmmer.wustl.edu/) (a great job! Check it out if you haven't already.) ==== Getting ==== Right now the most stable web presence for this software can be found off of the UCSC computational biology homepage: http://www.cse.ucsc.edu/research/compbio A slightly more direct route may be: http://www.cse.ucsc.edu/research/compbio/ssurrna.html ==== Installing ==== See the file INSTALL.txt. Should install and run for most UNIXs. Requires some common tools like perl, gawk, sort, rm, ... See the COPYING.txt and LICENSE.txt files. ==== Starting ==== Point your browser to the tutorial.html file in the tutorial subdirectory and start with the tRNA tutorial. Documentation is in the doc subdirectory. ==== Roadmap ==== ./bin contains the calling programs after the package has been installed. ./doc contains man pages for programs in html and a large postscript or pdf document describing the system ./src source files for the system ./tutorial short tutorials to see what the system can do. After learning Java by simply following the tutorials, I strongly recomend them to quickly learn the system. ==== What it can do ==== At a high level: ---- It can do database searches for RNA secondary structure. It can produce structural multiple alignments of a set of structurally related sequences. It can take a set of sequences and a secondary structure and produce a model that reflects the secondary structure and the statistics of the sequences. It can take a secondary structure descriptor (much like RNAMOT or RNABOB) and produce a model that reflects the secondary structure with the statistics of "average" RNA structure. More specifically: ---- It can take a grammar and a string and return the most likely parse with its corresponding probability. It can take a grammar and a set of strings and estimate a new grammar based on the strings and a rich prior that accounts for different structures such as basepairs, singletons, helix bulges, loop inserts, and deletes. It can take a set of parses and return a multiple alignment. It can take a high level description of secondary structure and produce a SCFG that accurately models it. It can take a multiple alignment and a secondary structure for one of the strings and return an estimated grammar that reflects the given secondary structure. ==== What it doesn't do ==== RNACAD does not estimate the most likely secondary structure for a set of raw sequences. This turns out to be a difficult problem. The system must be presented with a likely secondary struture. With this it can tune a model around this secondary structure as best it can and answer whether a new string has this secondary structure. Thus given a set of t-RNA sequences and a secondary structure describing t-RNA's cloverleaf structure, a SCFG can be estimated that can very accurately discriminate between t-RNA and non t-RNA and produce structural multiple alignments of t-RNA. However it cannot take a set of raw t-RNA sequences and deduce that they form cloverleaf structures. Solving this problem usually involves computing mutual information between all pairs of columns of a multiple alignment and using this information to identify pairs of positions with high mutual information or evidence of pairing interactions. See papers by Durbin, Eddy or Grate for more information. The software computes the Viterbi most-likely parse. It does not compute the full inside probability because it is primarily constraint based it would return only the probability that is consistent with the constraints and not the true probability. It also does not compute outside because, again, this would not be the true probability. The software could be extended to compute these full probabilities. Also, the software is primarily coded to handle a DNA alphabet of size four. It would have to be extended to handle arbitrary alphabet sizes. ==== The grammar ==== The primary use of this system is through an easy-to-use high-level speicification of RNA secondary structure that is translated to an actual SCFG. A SCFG is a set of singleton, pair, chain, and branch productions. The software can be extended to include new types of productions, anything that returns the probability of a subword $i$ to $j$. The primary restriction on the grammar is that it does not contain any null cycles. Every cycle in the grammar must derive at least one terminal. Without this constraint, the parsing algorithm will not terminate. To fully apply the constraint system, the grammar must be of a restrictive form that is a generalization of HMM left-to-right profile models. This restrictive form allows constraint information to be propagated between nonterminals thereby increasing performance. Several tools are used to easily allow specification of this type of grammar. Following the example of the tRNA tutorial will automatically give you this type of grammar. ----- Michael P.S. Brown