EvoFold

Introduction

EvoFold is a comparative method for identifying functional RNA structures in multiple-sequence alignments. It is based on a probabilistic model-construction called a phylo-SCFG and exploits the characteristic differences of the substitution process in stem-pairing and unpaired regions to make its predictions. Each prediction consists of a specific secondary structure and a folding potential score.

We have created an 8-way human-referenced genomic vertebrate-alignment (which includes human, chimpanzee, mouse, rat, dog, chicken, pufferfish, and zebrafish), identified the conserved regions, and applied EvoFold to these. This has resulted in a set of 48,479 deeply conserved structural predictions. The false positive rate among these varies with score, size, and other attributes of the predictions, and we tentatively estimate this set to contain 18,500 true functional structures.

Prediction sets

The predictions have been classified according to their size, genomic location, and their shape and subsequently ranked by their score. Based on the characteristics of known microRNAs, we have also defined a set of 187 microRNA candidates. Finally the folds have been grouped into paralogous families based on primary sequence homology. Annotated versions of these prediction sets with links to the "UCSC Genome Browser" can be found by following the links below. All genomic coordinates are relative to the Human May 2004 (hg17) assembly.

Binary and source code

A statically compiled (i386) Linux binary (5.5Mb), a control file (save file and view in text editor), and some documentation are made available for the EvoFold program. Source code is available upon request (jsp@soe.ucsc.edu).

Predictions in other species

EvoFold predictions have also been made in Drosophila.

Reference

Pedersen, J.S., Bejerano, G., Siepel, A., Rosenbloom, K., Lindblad-Toh, K., Lander, E., Rogers, J., Kent, J., Miller, W., and Haussler, D. Identification and Classification of Conserved RNA Secondary Structures in the Human Genome. PLoS Computat Biol 2(4), e33 (2006).