Use this page to extract sequence data from the Intronerator database relative to such markers as translation start or splice boundaries. To quickly get the sequence of a named gene, cosmid, chromosome range or cDNA try the simple GetGene utility. For more detailed instructions on this page follow the help link.
Source: Recursive Genome Chromosomes Cosmids Genes Sequence Names: Restrict: Region exons introns intergenic skipped exon skipped intron alt 5' intron alt 3' intron Start Offset End Offset Relative To Region Start End C. briggsae Homology Any High Invert Anywhere Mostly Throughout Piecemeal cDNA Hits Full Length EST Either Invert Anywhere Mostly Throughout Piecemeal Stage Either Embryo Mixed Embryo Only Mixed Only Gene Predictions CDS Coding Invert Anywhere Mostly Throughout Piecemeal from AceDb Genie Output Format: Fasta Recursive Name Only Hyperlinked Lines Every Spaces Every Capitalize Coding All None Base Numbers
This program allows you to extract sequence data from the Intronerator database in a very precise and flexible manner. Some things you can do with this program are get every gene for which the entire mRNA has been sequenced; extract 1000 base pairs before the start codon on all unc genes; get all of the sequence that is highly conserved between C. elegans and C. briggsae; get DNA for all of the exons which are sometimes spliced in and sometimes spliced out. The price you pay for this flexibility is a complex set of controls which is explained in some detail in this help link.
The overall flow of the controls goes from top to bottom and from left to right. The controls are broken into three major sections - those that specify which sequences to start with (the source); those that restrict which parts of the sequences are used (the restrict controls); and those that define the output format. It is possible to use the output from one round of the program as input for another round using the recursive options. The controls in general have intelligent defaults. If all you want to do is collect the sequence for all unc genes, just type "unc-*" into the large "sequence names" text box, and press submit.
In the source section you can specify whether you want to extract data from the entire genome, from a set of chromosomes, from a set of cosmids, or from a set of genes by selecting the appropriate radio button. (The "recursive" radio button will be explained in the last section of this help.) Unless you're working on the entire genome (which is often quite slow) you'll need to put some names in the "sequence names" text box. These names can be entered in either upper or lower case, and can include the wildcard characters '?' which matches any single letter and '*' which matches anything. The allowed chromosome names are i, ii, iii, iv, v, x, and m (for mitochondrial). The cosmid names are standard C. elegans cosmids and yacs, such as ZC101. The gene names can either be in geneticist format such as unc-47, or in AceDB ORF format such as T20G5.6.
In the restrict section you can restrict the parts of the sources that you are working on and also specify regions relative to another region. There are a large number of controls in this region, but they are broken up logically into lines, each of which is fairly simple. The order in which the restrictions are applied is the same order in which the controls appear.
Region - by default the region is blank, which means that any region of the source is passed through. You can restrict this to only exons, only introns, and only intergenic regions via a drop-down list. This will break up your source into the corresponding pieces. If you're interested in alternative splicing you can also select "skipped exon," which restricts the region to only exons which are present in some isoforms but not others. The "skipped intron," "alt 5' intron" and "alt 3' intron" are also of interest to alt-splicing folks. These alt-splicing options are relatively slow - taking about ten minutes if applied to the whole genome.
Start and End Offset - These fields adjust the start and boundaries of the regions. The work in conjunction with the "relative to" control below. Leaving these fields blank is the same thing as setting them to zero.
Relative To - This controls how the start and end offset are applied. In the default setting - relative to region - the start offset adjusts the start of the region and the end offset adjusts the end of the region. Relative to start adjusts the region so that the new start and end are both relative to the old start position. Relative to end adjusts the region so that the new start and end are both relative to the old end position. Some examples may be in order. To collect intron sequence data but exclude the start and the end of the intron set the region control to "introns", set the start offset to "10", set the end offset to "-10", and leave the Relative To control at "region." To collect candidate sequences to search for transcription initiation factor binding sites input a list of genes in the sequence names (perhaps as "???-*"), leave the Region control blank, set the start offset to -1000, the end offset to 0, and set Relative To Start.
C. Briggsae Homology - This set of controls lets you restrict your sources to only those with C. briggsae homologs. When the first drop-down box is blank homology is ignored. You can restrict your source set to only those with any C. briggsae homology or those with high C. briggsae homology using this drop-down. The second drop down can invert the effect of the first - selecting only regions without C. briggsae homology. The next drop down controls where the homology needs to be. If one patch of homology anywhere in the source sequence is good enough to keep it, leave the default value of "Anywhere" in this box. If you require that at least 50%, or that 100% of the source sequence be homologous select the "Mostly" or "Throughout" items. If you'd like to chop up your source sequence and only keep the homologous parts select the "Piecemeal" option.
cDNA Hits - Here you can restrict your sources to only those that have (or don't have) a particular type of cDNA match in the database. The first drop down lets you choose between paying attention only to ESTs, paying attention to only full-length (or more precisely non-EST) cDNA, or paying attention to either type of cDNA. Leaving the first drop-down blank ignores cDNA entirely. The second drop-down inverts the effect of the first. As with the C. briggsae homology the third drop down lets you decide if a match anywhere in the source is enough to keep the source, whether instead the source must match mostly or throughout to be kept, or whether the source is broken up piecemeal, and only the matching parts kept.
The final drop-down in this line lets you control whether embryonic, mixed stage, or either type of cDNA will be considered. (In a perfect world it would be adult rather than mixed stage, but in fact over 99% of the current cDNA data is either embryonic or mixed.) The "embryo" option will select regions that have embryonic cDNA hits. The "embryo only" option will select regions that have embryonic cDNA hits but no mixed stage cDNA hits. The "mixed" and "mixed only" work similarly.
Gene Predictions - Here you can restrict your sources relative to AceDB or Genie predictions about where genes are and what's coding. Do bear in mind that these are only predictions unless there is cDNA data to back them up. (The AceDB predictions were made only considering cDNA that was available in late 1998.) As with the other lines, a blank in the first drop-down means that the source is not restricted. Selecting something in this drop-down lets you restrict the output to CDS (coding exons and introns), not CDS, coding (exons only) or non-coding. The next two drop downs work as with C. briggsae homology and cDNA hits. The last control selects whether AceDB or Genie gene predictions are used. (Note this control also effects how things are broken up into introns and exons in the "Region" control. The alt-splicing regions however bypass the gene predictions and look directly at the cDNA.)
The output format controls are flexible but relatively straightforward. The first drop down controls whether you want Fasta output including the sequence, a simple list of sequence names, a list of sequence names hyperlinked to the Intronerator tracks display, or a recursive format which I'll explain in the final section. The"Lines Every" text box controls how many bases are displayed in each line. You can set this to zero if you desire the whole sequence to be in one long line. The "Spaces Every" control puts spaces between groups of nucleotides. Popular values for this are 10 and 3. If left blank no spaces are inserted. The "Capitalize" drop-down controls whether the nucleotides will be upper or lower case, or (by default) with the coding regions in upper case and the non-coding regions in lower case. If "Base Numbers" is checked a number displaying the number of nucleotides written so far is appended to the end of each line.
Though the combination of source and restriction controls discussed so far enables you to control the part of the genome you wish to extract quite precisely, on occasion it is not enough. The restrictions are always applied in the same order as the controls. In some cases this may not be the order you need. To get around this you can apply recursion to us the output of one run of the program as input to another run. To do this select "Recursive" for the output format. Copy the output (which is the same as the Fasta format output, except lacking the nucleotide sequence) to the clipboard. Go back to this page, select "Recursive" for source, and paste your data back into the "Sequence Names" box. Reset your restrictions to what you want for the second pass. If this is to be the final pass adjust the output format back to Fasta, and hit "Submit."
-Jim Kent Nov. 1999 Return toIntronerator.