Algorithms Used In the Intronerator

How the cDNA Alignments Were Generated

The cDNA alignments are the foundation of this database. They were generated in a two pass manner. In the first pass an algorithm similar to BLAST collected "hits" where 16 nucleotide fragments of the cDNA aligned perfectly with the genome. Areas of the genome with promising hits were passed to the fuzzyFinder program to finish the alignment. The genomic data is from the Sanger November 1998 release. The cDNA is all the C. elegans mRNA sequence in GenBank as of Oct. 1, 1999.

The first pass algorithm starts out much like BLAST. First it scans the cDNA sequence and creates a list of all unique 16-base subregions ("tiles") which contain no N's and are not capable of matching degenerate repeating DNA of period six or less. (This last criteria prevents tttttttttttttttt, actactactacta, and aaccggaaccggaacc from being used as tiles.) The tiles are loaded into a hash table, using the least significant bits of the tile as a hash function, and the genome is scanned for matches ("hits") to the tiles. Since the comparison advances over the genome in non-overlapping steps of 16 bases the algorithm can overlook perfect matches up to 30 bases long, and find them as short as 16 bases. For the relatively clean cDNA sequences we wished to include in our database, requiring a 30 base pair match somewhere in the sequence seemed reasonable. Next the algorithm groups hits that are adjacent, or nearly so in both cDNA and genome into "crude exons". Next it groups crude exons that are adjacent or nearly so in cDNA, and within 25,000 bases in the genome into "crude genes". The crude genes are scored by summing the squares of the number of tiles in each crude exon, and to this adding the number of exons. (This scoring serves to weigh adjacent hits more heavily than separated hits.) The crude genes scoring less than 10% of the best scoring crude gene (typically the regions with only one hit) are thrown out. The remainder are used to define a region of the genome for the fuzzyFinder algorithm to work on.

In many ways the fuzzyFinder algorithm parallels the first pass algorithm, but it takes a somewhat finer grained approach in searching for hits, and ultimately must extend the hits through sequencing noise. Upon starting up fuzzyFinder counts up how frequently each base is used in the genomic sub-sequence, and uses this to assign a probability to each base. Armed with this information, the program creates tiles by scanning the cDNA. A tile starts out one base long, and is elongated until the probability product of all of the tile's bases is less than 1/(stringency*numBasesInGenomicSubsequence) where stringency is an adjustable parameter that typically is set to 100 or 1000. In a typical run this results in tiles eight to twelve bases long with the shorter ones being more G/C rich (since the C. elegans genome is nearly 70% A/T). Once the tiles are assembled they are used to scan the genomic subsequence for hits. The hits are then grouped as before into crude exons, and crude genes, scored, and the best scoring crude gene chosen for further extension.

The extension first merges adjacent hits and expands their ends as far as the cDNA and genomic DNA match perfectly. Then it expands the ends further, allowing N's in the cDNA to match any single base. At this point the alignment looks something like:

cDNA  cacatagtxxxxxxxxxxxxxxxxxxxxxgattatcgnt xxxxxxxx anacctgan
      ||||||||                     |||||||| |          | ||||||
geno ycacatagtyyyyyyyyyyyyyyyyyyyyygattatcgctyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyagacctgacyyy

where the x's and y's represent unaligned areas of the cDNA and genome respectively. The program then recurses, making up tiles and trying to match in the unaligned areas. Since the genomic area to search has been reduced, the tile size is smaller in the second run. This effect is enhanced by reducing the stringency by a factor of two for each level of recursion. The recursion runs until either no tiles are found or until the gap between aligned blocks in the genome or cDNA becomes less than 6. At this point the alignment looks something like:

cDNA   cacatagtxxxxxxaccgxxxataaxxxxxgattatcgnt  xxxtatac                   anacctgan
       ||||||||      ||||   ||||     |||||||| |     |||||                   | |||||| 
geno  ccacatagtyyyyyyaccgyyyataayyyyygattatcgctyyyyytacacyyyyyyyyyyyyyyyyyyyagacctgacttt

Next the ends of alignments are expanded allowing gaps and inserts of one, and then again allowing gaps and inserts of up to 12. At this point the alignment looks something like:

cDNA   cacatagt.aaagcaaccgctcataaatatcgattatcgnt..tcttatac                   anacctgan
       |||||||| ||||| |||| | |||| ||| |||||||| |  || |||||                   | |||||| 
geno  ccacatagttaaagc.accgataataattatggattatcgcttgtcatacacacgtttttctctccttttaagacctgacttt

At this point some of the gaps in the alignment resemble introns, such as the one shown in blue above. Where possible the intron is allowed to slide so as to maximize the how closely it matches to the consensus gt ... ag. This results in the final alignment depicted below:

cDNA   cacatagt.aaagcaaccgctcataantatngattatcgnt..tcntatacan                   acctgan
       |||||||| ||||| |||| | |||| ||| |||||||| |  || ||||||                    |||||| 
geno  ccacatagttaaagc.accgataataattatggattatcgcttgtcatacacacgtttttctctccttttaagacctgacttt

Since the alignment algorithms are somewhat complex we performed a number of cross checks. 2730 out of the 73910 cDNAs (this was before Kohara's latest round of ESTs) in our database wouldn't align well (defined as less than 50% of the cDNA bases part of aligning blocks). These were run against a database of all C. elegans cosmids in GenBank using NCBI BLASTN. 91 out of the 2730 did show greater than 50% alignment via BLASTN. 37 of the 91 were due to the sequences lacking any 30 base regions of perfect alignment. 15 of the 91 were due to the cosmids ZC47, W07G1, W04A8 and Y106G6G being in GenBank but not the Sanger Nov. 1998 release. 39 of the 91 were due to the cosmids Y53C10A, R06B10, Y81G3A, Y53C10A, R06B10, T24E12, and Y49A3A being different in the GenBank and Sanger Nov. 1998 data. We also ran spot checks on successful alignments to see if the highest scoring BLASTN alignment was on the same cosmid as our alignments. Since BLASTN is not designed for introns, we used Gap2 to check the alignment within a cosmid. The BLASTN/Gap2 alignments were in most cases identical with our alignments. In some cases Gap2 alignments extended further into noisy areas at the ends of cDNA. On the whole our alignment algorithms seem to do a good job of rapidly aligning cDNA sequences, but are completely unsuitable for homology searches.

How the Intron Database Was Generated

The intron database was generated from our cDNA alignments. Each of the approximately 100,000 cDNA alignments was examined for relatively large gaps (at least twelve nucleotides) in the genomic side of the alignment with no corresponding gap in the cDNA side of the alignment. In addition at least five bases on either side of the genomic gap had to match perfectly, and at least seven more bases on either side had to match almost perfectly. If these conditions were all met the genomic gap was considered an intron and put in the database, which currently holds 28476 distinct introns. The average intron is supported by 2.5 cDNA sequences, though half of the introns are only supported by a single cDNA.

How the Alt-Splicing Table was Generated

The alternative splicing table was generated from the intron list and the cDNA alignments. The intron list was sorted, and distinct introns (defined as differing by at least 2 bases from each other) that overlapped were put on a intron/intron overlap list. A list of exons was generated by merging together blocks separated by no more than one or two bases of noise in the cDNA alignments. The exon list was sorted and checked for overlaps against the intron list. To reduce the amount of noise the overlaps between intron and exons were required to be at least 3 bases long in general, and a bit longer than that if there appeared to be noise in the sequence near the overlap. The intron/intron overlap and the intron/exon overlap lists were merged, and sorted by the name of the ORF or nameless cluster closest to the overlap. This resulted in a list of 1150 possible alternatively spliced genes. This list was inspected by hand and items that were clearly the result of noise or overlapping transcriptions on opposite strands were removed. The 939 remaining items are in the Alt-Splicing Table. It is likely that some noise remains, but the web-based cDNA alignment viewer hyper-linked to the table make it an easy matter to judge how strong the evidence of alternernative splicing for a particular gene is.

Jim Kent, August 1999