GenBank/RefSeq Genome Setup

The page describes that process of setting up the GenBank/RefSeq update process for a new genome or new assembly. It's assumed that the automated download and build already in place and in /cluster/data/genbank/.

Initial Alignment

Building the initial alignments is a long process, involving gigabytes of disk space and over a day of cluster time. Do not be surprised if it does not go smoothly. Manual intervention maybe required to correct problems.
  1. Make sure ssh is configure to not require a passphase.
  2. In this document, $db refers to the database being aligned. Substitute the actual database name (e.g. hg15).
  3. If this is the first time this organism has been aligned, some source files need to be edited. The genbank update code is under kent/src/hg/makeDb/genbank/
  4. Edit /cluster/data/etc/genbank.conf to configure this databases. Must set: You may want to set some database load options to override the defaults.
  5. ssh eieio
    It is important to run on eieio to avoid creating a heavy NFS load .
  6. cd /cluster/data/genbank
    This directory is $gbRoot.
  7. nice bin/gbAlignStep -verbose=1 -initial $db&

    This will run the entire alignment process. The -initial option defaults several parameters for and initial alignment and prevents this alignment from blocking the automatic daily alignments. In particular:

    If you want to use the bluearc cluster file system, add

    before the $db.
    Warning: gbAlignStep and other GenBank do not currently accept options after the positional arguments.

    If you use the iservers, /cluster/data/genbank/etc/genbank.conf should already have the list of working servers.

    All output is saved in the log file.

    If this is the first time a databases is built for an organism, it's a good idea to start out by aligning and loading just the the mRNAs, as this will go much faster. Two options control what is aligned:

    Note that since the alignment processs only aligns what needs to be aligned, no option is required when doing the ESTs after an initial mRNA alignment.

    If anything fails, a subset of the trasks done by gbAlignStep script can be rerun after correcting the problem. This is done using the -continue=subtask option with subtask is either

    If the parasol alignment run fails, then can be continued using parasol directly, followed by an gbAlignStep with -continue=finish. If parasol loses track of the jobs, one can use the parasol recover command to generate a new jobs file with the jobs that have not completed.
  8. When run with the -initial, the working directory is not deleted. This is to assist in debugging any problems. After doing the initial load of the database anc checking the results, work/initial.$db should be removed. Note that this hierarchy is very large, its good remove it on the NFS server for the file system.

Initial Database Load

  1. nice bin/gbDbLoadStep -verbose=1 -drop -initialLoad $db

Realigning Tracks

It maybe necessary to realign and reload tracks to change alignment parameters or other attributes. This is fairly straight forward when a genome databases is initially being built. It's more complex if one has to sync up multiple systems.
  1. If automated alignment or update has been enabled for the database, disable it by editing the scripts in $gbRoot/etc/.
  2. Make sure an automated alignment isn't current running.
  3. To triger a realignment, on needs to remove the related files for some partation of the data for all updates. These live under either the genbank or refseq alignment directories, for example: To realign native RefSeq mRNAs for hg16, one would remove: To realign xeno GeneBank ESTs for hg16, one would remove:
  4. Do an initial alignment as described above, restricting with -srcDb and -type.
  5. Reload the database with the partation of data that was realigned. The -srcDb and -type options restrict the subset. The organism category (native or xeno) isn't specified. Reloading of ESTs isn't supported, use -drop and -initialLoad instead.