GenBank/RefSeq Genome Setup

The page describes that process of setting up the GenBank/RefSeq update process for a new genome or new assembly. It's assumed that the automated download and build already in place and in $gbRoot.

Initial Alignment

Building the initial alignments is a long process, involving gigabytes of disk space and over a day of cluster time. Do not be surprised if it does not go smoothly. Manual intervention maybe required to correct problems.
  1. Make sure ssh is configure to not require a passphase.
  2. In this document, $db refers to the database being aligned. Substitute the actual database name (e.g. hg15).
  3. If this is the first time this organism has been aligned:
  4. Edit $gbRoot/etc/genbank.conf to configure this databases. Must set: You may want to set some database load options to override the defaults.
  5. ssh eieio
    It is important to run on eieio to avoid creating a heavy NFS load .
  6. cd /cluster/store5/genbank
    This directory is $gbRoot.
  7. nice bin/gbAlignStep -verbose=1 -initial $db&

    This will run the entire alignment process. The -initial option defaults several parameters for and initial alignment and prevents this alignment from blocking the automatic daily alignments. In particular:

    If you want to use the bluearc cluster file system, add

    before the $db.
    Warning: gbAlignStep and other GenBank do not currently accept options after the positional arguments.

    If you use the iservers, $gbRoot/etc/genbank.conf should already have the list of working servers.

    All output is saved in the log file.

    If this is the first time a databases is built for an organism, it's a good idea to start out by aligning and loading just the the mRNAs, as this will go much faster. Two options control what is aligned:

    Note that since the alignment processs only aligns what neededs to be aligned, no option is required when doing the ESTs after an initial mRNA alignment.

    If anything fails, a subset of the trasks done by gbAlignStep script can be rerun after correcting the problem. This is done using the -continue=subtask option with subtask is either

    If the parasol alignment run fails, then can be continued using parasol directly, followed by an gbAlignStep with -continue=finish. If parasol loses track of the jobs, one can use the parasol recover command to generate a new jobs file with the jobs that have not completed.
  8. When run with the -initial, the working directory is not deleted. This is to assist in debugging any problems. After doing the initial load of the database anc checking the results, work/initial.$db should be removed. Note that this hierarchy is very large, its good remove it on the NFS server for the file system.

Initial Database Load

  1. nice bin/gbDbLoadStep -verbose=1 -drop -initialLoad $db