/cluster/data/genbank/
.
ssh
is configure to
not require a passphase.
$db
refers to the database being aligned.
Substitute the actual database name (e.g. hg15
).
kent/src/hg/makeDb/genbank/
hg
of hg13
) and the species names
used in GenBank needs to be defined. This is done by editing
genbank/src/lib/gbGenome.c
and rebuilding the programs. It maybe
necessary to define multiple species name mappings. The list of
species used by GenBank can be obtained by using a command in the form:cd /cluster/data/genbank/data awk 'BEGIN{FS="\t"}{print $4}' processed/genbank.135.0/full/*.gbidx |sort -u >species.txton the latest GenBank release.
genbank/src/align/gbBlat
to add the ooc
file.
If there is no ooc
, this script still needs to be
modified to indicate this fact.
kent/src/hg/makeDb/genbank/
)
make
to test if the source builds
make install-server
to update /cluster/data/genbank/
.
/cluster/data/etc/genbank.conf
to
configure this databases. Must set:
$db.genome
$db.lift
ssh eieio
eieio
to avoid creating
a heavy NFS load .
cd /cluster/data/genbank
$gbRoot
.
nice bin/gbAlignStep -verbose=1 -initial $db&
This will run the entire alignment process.
The -initial
option defaults several parameters for
and initial alignment and prevents this alignment from blocking
the automatic daily alignments. In particular:
work/initial.$db/align
.
var/initial/2003.05.23-21:51:12.$db.align.log
.
If you want to use the bluearc cluster file system, add
-iserver=no -clusterRootDir=/cluster/bluearc/genbank
$db
.
Warning: gbAlignStep
and other GenBank do not currently accept options after
the positional arguments.
If you use the iservers, /cluster/data/genbank/etc/genbank.conf
should already have the list of working servers.
All output is saved in the log file.
If this is the first time a databases is built for an organism, it's a good idea to start out by aligning and loading just the the mRNAs, as this will go much faster. Two options control what is aligned:
-srcDb=name
- Restrict the source
database to either genbank
or refseq
.
-type=name
- Restrict the type of sequence
processeed to either mrna
or est
.
If anything fails, a subset of the trasks done by
gbAlignStep
script can be rerun after correcting
the problem. This is done using the
-continue=subtask
option with
subtask
is either
copy
- continue with coping to the iserver,
this skips extracting the sequences to align.
run
- Continue with parasol blat run.
finish
- finish, alignments, doing
lifting and filtering.
gbAlignStep
with
-continue=finish
. If parasol loses track of
the jobs, one can use the parasol recover
command to generate a new jobs file with the jobs
that have not completed.
-initial
, the working directory
is not deleted. This is to assist in debugging any problems.
After doing the initial load of the database anc checking
the results, work/initial.$db
should be
removed. Note that this hierarchy is very large, its good
remove it on the NFS server for the file system.
nice bin/gbDbLoadStep -verbose=1 -drop -initialLoad $db
-drop
option drops any existing GenBank or RefSeq
tables before loading.
-initialLoad
option
when loading the ESTs.
$gbRoot/etc/
.
data/aligned/genbank.139.0/hg16/
data/aligned/refseq.139.0/hg16/
data/aligned/refseq.139.0/hg16/*/mrna.native.*
data/aligned/refseq.139.0/hg16/*/est.*.xeno.*
-srcDb
and -type
.
-srcDb
and -type
options restrict
the subset. The organism category (native or xeno) isn't
specified. Reloading of ESTs isn't supported, use -drop
and -initialLoad
instead.
nice bin/gbDbLoadStep=1 -reload -srcDb=genbank -type=mrna $db