Parasol provides a convenient way for multiple users to run large batches of jobs on computer clusters of up to thousands of CPUs. Parasol was developed initially by Jim Kent, and extended by other members of the Genome Bioinformatics Group at the University of California Santa Cruz. Parasol is currently a fairly minimal system, but what it does it does well. It can start up 500 jobs per second. It restarts jobs in response to the inevitable systems failures that occur on large clusters. If some of your jobs die because of your program bugs, Parasol can help manage restarting the crashed jobs after you fix your program as well. The parasol source is at http://www.soe.ucsc.edu/~kent/src/parasol.tgz.
To start things rolling you need to make a directory to put the batch in, and create a job list in this directory. A basic job list is just a series of command lines, one for each job. A fancier job list can contain checks on the input and output files for each job. While generally jobs in a batch are somehow related, they need not be. Here’s a sample job list that compiles some code in parallel. :
cc –c lions.c
cc –c tigers.c
cc –c bears.c
cc –c turkeys.c
cc –c bats.c
To run this on parasol you’d log onto the machine running the parasol server and
mkdir compileTheAnimals
cd compileTheAnimals
para make ../job.lst
assuming you’d already created job.lst . The para program will hang out periodically printing a little information until all the jobs are done or one of them fails repeatedly.
If your job has problems you can retrieve them with
para problems
This will among other things copy over the standard error output from the cluster nodes. You may need to put the parasol host in your .rhosts file for this to work.
Parasol is really meant for large batches of jobs. One big set of jobs we do at the Genome group is comparing the mouse vs. the human genomes. Since humans and mice have a common ancestor, there are stretches of DNA that are similar (homologous) between the two species. However since this common ancestor was almost 100 million years ago most of the DNA has changed quite a bit. It’s not possible to find homologous regions with a simple string search. Instead sophisticated “alignment” techniques must be used. Aligning whole genomes against each other is one of the most compute intensive areas in bioinformatics. Fortunately it is a problem that can be easily distributed across many machines. We can break the human genome into approximately 200 pieces, and the mouse genome into approximately 1000 pieces. We then align each piece of the human against each piece of the mouse. This lets us split divide the big job into 200,000 little jobs. Each of these little jobs might take about 10 minutes on one computer. On 1000 computers the whole set of jobs should take less than two days.
Since parasol wants a line of input for each job, clearly we need an automatic way of generating the job list. This is where the program “gensub2” comes in. Gensub2 takes as input two lists of files and a template file. The template file contains three parts – a preamble (everything before the #LOOP line) which is literally copied to the output, a middle section which is repeated in the output with the filenames from the lists substituted in, and a postscript (everything after the #ENDLOOP line) that is copied literally to the output. We’ve used gensub2 with a number of schedulers – Condor and Codine as well as Parasol – and it has proven a simple but effective tool. Here is a gensub2 template file for creating mouse/human alignments with the BLAT program:
#LOOP
/cluster/bin/blat $(path1) $(path2) out/$(root1)_$(root2)
#ENDLOOP
Note depending on how parasol was set up on your system you may need to include the full path name for executables. To create a job list do the following steps:
mkdir mouseVsHuman
cd mouseVsHuman
mkdir out
ls –1 /data/human/dna/pieces/*.seq > human.lst
ls –1 /data/mouse/dna/pieces/*.seq > mouse.lst
gensub2 human.lst mouse.lst template jobList
You could at this point run “para make” on jobList. In the end if nothing went wrong you’d have a directory ‘out’ full of files that look something like mouse001_human001, mouse001_human002, …. However this is a big enough job that it’s very likely something will go wrong. It would be good to do a little sanity check on the output.
It wouldn’t be a bad idea to write a program to double-check the output yourself. However Parasol can do some basic checks automatically. If we change our template file to read:
#LOOP
/cluster/bin/blat $(path1) $(path2) {check out line+
out/(root1)_$(root2)}
#ENDLOOP
then Parasol will check that each file has at least one line, and that the last line is complete. This won’t catch every problem, but it will catch the vast majority of them. Parasol will also of course detect program crashes. It’s possible to put checks on input files as well.
Since this is a job liable to take more than a day, rather than running ‘para make’, I like to take a more interactive approach as follows:
para create jobList # Create job tracking database
para try # Run ten jobs
para check # Check up on jobs
para push # Push up to 100,000 jobs
para check # Check up on jobs
para push # Retry any crashed jobs so far
para time # Collect timing information.
It’s always good to do a ‘try’ before a ‘push’, letting the first ten jobs run for five minutes or so and then checking up on them. Note that “para push” will only resubmit jobs that have already crashed when para push is run. If you’re interactively monitoring your jobs periodically pushing and checking (and timing) is useful. Once you decide things are going well you can issue a “para shove” command, which will will check on your jobs every few minutes, restarting jobs if necessary, until the jobs are all done or one of the jobs has failed (crashed repeatedly). If you want to go back to interactive monitoring of your jobs after a “para shove”, just hit <control C> to stop the shoving.
Typically there are three types of problems you encounter with a big batch of jobs. The most common problem is a new bug in the program you’re running, which causes all or most of your jobs to crash. Doing a “para try” will usually find these problems without wasting a lot of cluster time. Another common problem is when one of the machines in the cluster is acting flaky for some reason – often because it is having an i/o problem of some sort. When a “para check” shows that some jobs are crashing this is the first thing to check. Do a “para problems” to get a report on the jobs that are having troubles, and see if they are all on the same host. If so tell the system administrators about the problem. They can use the “parasol remove machine” command to remove the machine from parasol service until the problem can be fixed. The third common problem is due to rarely manifesting bugs in your code. The symptom of this is the same job crashing repeatedly on different machines. There’s no cure for this except doing some debugging – usually starting with isolating the minimum input needed to cause the problem.
Parosol sets up a few environment variables prior to running jobs. The PARASOL variable is set to the version of Parasol that is currently running. Programs can use this to determine whether they are running under Parasol. The JOB_ID variable is set to a unique number for each job that is run. How the PATH variable is set up depends on how Parasol was installed. At UCSC we have it set up to run things out of the bin directory of each user as well as /bin, /usr/bin, and /cluster/bin. In general Parasol’s PATH will be different from the path in your shell, so when in doubt include the appropriate directory names in front of your executables. You can also depend on the USER, HOME, and HOST variables being set to your user name, your home directory and the name of the machine the program is running on. Beyond this what is in the environment varies from installation to installation.
To install parasol you’ll need to create a list of the machines in your cluster, and then run the ‘paraNodeStart’ which will launch a ‘paraNode’ server on each machine. Next pick a machine, preferably one without much else going on in it to launch a ‘paraHub’ server. Parasol users will then log onto this machine to launch jobs either using the ‘para’ or the ‘parasol’ commands. You can bring down the paraNode servers with the ‘paraNodeStop’ command. You can bring down the paraHub with ‘paraHubStop’.
Here’s a list of the user programs in the Parasol package:
· para – a user-level command which manages a single batch of jobs
· gensub2 – a user-level command for generating large batches of jobs, each on a different set of files.
Here’s a list of the administrative programs:
· parasol – administrative command for looking at all active jobs in the system and adding or removing nodes on the fly. Users may find this command handy too, though several aspects of it can only be accessed by root.
· paraNode – runs jobs on a compute node
· paraHub – job scheduler
· paraNodeStart – start up paraNode daemons
· paraNodeStop – bring down paraNode daemons
· paraNodeStatus – query status of paraNode daemons
· paraHubStop – bring down paraHub daemon
The first thing an administrator needs to do is create a list of machines to use as compute nodes. This list is a tab-separated file with one line for each machine and the following fields:
· <host name> - Network name of host.
· <number of cpus> - Number of CPUs to use in the machine
· <meg of memory> - Megabytes of memory in machine
· <local temp dir> - A directory in the machine for temp files. Parasol puts the standard error output here. Ideally this directory should periodically have files not accessed for a week removed.
· <local data dir> - A directory where local data resides. Ignored for now.
· <local data gig> - Size of local data directory. Ignored for now.
· <network switch> - Name of network switch this is on. Ignored for now.
Here is a small example:
testBad.node.ucsc 2
1024 /tmp /scratch 36000 r1
kkr1u01.kilokluster.ucsc.edu 2 1024
/tmp /scratch 36000 r1
kkr1u02.kilokluster.ucsc.edu 2 1024
/tmp /scratch 36000 r1
To start up the Parasol system do the following:
ssh parasolhost
mkdir parasolRootDir
cd parasolRootDir
cp machineList .
su root
paraNodeStart machineList hub=parasolHost
userPath=bin sysPath=/share/bin:/usr/local/bin
log=/tmp/log
paraHub machineList –log=log &
You may well want to customize the userPath and sysPath according to the conventions used at your own installation. The hub log will create a log that averages about 500 bytes per job, which can be useful for debugging. Omit the –log parameter for no log (not even syslog). For further security use the –subnet parameter to paraHub which will restrict incoming messages to the hub to a particular subnet. A common example of this would be
paraHub machineList –log=log –subnet=124.168 &
The paraHub daemon will detect machines that are down and work around them. Every
ten minutes it will see if a machine that is down has come back up. It’s possible to add new machines without bringing down the daemon using
parasol add machine tempDir
The Parasol system consists primarily of a scheduling server and two clients: parasol and para. The parasol scheduling system consists of a hub daemon, a heartbeat daemon, and number of spoke daemons running on the server system, and a node daemon running on each compute node. The parasol and para clients both communicate only with the hub daemon. The parasol client is quite lightwieght doing little more than forwarding messages from the command line to the parasol hub and printing the replies. The para client is more substantial. It creates a database around a batch of jobs submitted by the user and tracks the progress of these jobs through the scheduler. Para does not expect the scheduler or the compute nodes to be completely reliable, and will resubmit jobs as necessary.

Figure 1. Parasol processes (circles) and message flow (arrows). All processes reside on the scheduling machine except for the node processes. A spoke process can send messages to any node.
The hub daemon is the heart of the parasol scheduling system. The hub daemon spawns the heartbeat daemon and the spoke deamons on startup. The hub daemon then goes into a loop processing messages it recieves on the hub TCP/IP socket. The hub daemon does not do anything time consuming in this loop. The main thing the hub daemon does is put jobs on the job list, move machines from the busy list to the free list, and call the 'runner' routine.
The runner routine looks to see if there is a free machine, a free spoke, and a job to run. If so it will send a message to the spoke telling it to run the job on the machine, and then move the job from the 'pending' to the 'running' list, the spoke from the freeSpoke to the busySpoke list, and the machine from the freeMachine to the busyMachine list. This
indirection of starting jobs via a separate spoke process avoids the hub daemon itself having to wait to find out if a machine is down.
When a spoke is done assigning a job, the spoke sends a 'recycleSpoke' message to the hub, which puts the spoke back on the freeSpoke list. Likewise when a job is done the machine running the jobs sends a 'job done' message to the hub, which puts the machine back on the free list, writes the job exit code to a results file, and removes the job
from the system.
Sometimes a spoke will find that a machine is down. In this case it sends a 'node down' message to the hub as well as the 'spoke free' message. The hub will then move the machine to the deadMachines list, and put the job back on the top of the pending list.
The heartbeat daemon simply sits in a loop sending heartbeat messages to the hub every so often (every 30 seconds currently), and sleeping the rest of the time. When the hub gets a heartbeat message it does a few things:
· It calls runner to try and start some more jobs. (Runner is also called at the end of processing a recycleSpoke, jobDone, addJob or addMachine message. Typically runner won't find anything new to run in the heartbeat, but this is put here mostly just in case of unforseen issues.)
· It calls a routine to see if machines on the dead list have come back to life.
· It calls a routine to see if jobs the system thinks have been running for a long time are still running on the machine they have been assigned to. If the machine has gone down it is moved to the dead list and the job is reassigned. If the machine is up it lets the hub know if it has already finished the job. This currently does happen, albeit in perhaps 1 in 10,000 jobs due to failure for the jobDone message to get through. If the job is still running the hub gets notified, and check back on the job five or ten minutes later.
This whole system depends on the hub daemon being able to finish processing messages fast enough to keep the connection queue on the hub socket from overflowing. Each job involves 3 messages to the hub socket:
addJob - from a client to add the job to the system
recycleSpoke - from the spoke after it's dispatched the job
jobDone - from the compute node when the job is finished
On some of the earlier Linux kernals we had trouble with the connection queue overflowing when dispatching lots of short jobs. This seemed to be from the jobDone messages coming in faster than Linux could make connections rather than the
hub daemon being slow. On the kilokluster with a more modern kernal this has been less of a problem – only manifesting on 1,000 CPUs when running 50,000 jobs that do nothing. (The shorter the job the more stressful it is on the scheduler). Should overflow occur the heartbeat processing will gradually rescue the system in any case, but the throughput will be greatly reduced.
When the hub daemon first comes up it queries each node daemon for running and recently finished jobs. If the hub is brought down for maintenance or crashes due to a program error, CPU time already invested in running jobs on nodes is usually recovered. Currently jobs that have not yet been sent to nodes will be lost, though if they are submitted via the ‘para’ client below, the para client will resubmit these jobs after noting ‘tracking errors.’ This feature of reconnecting to running jobs is new as of paraHub version 2.
The node daemon is relatively simple. It forks off a process to run a job in response to a “run” message over it’s TCP/IP socket from a spoke. It sends a “jobDone” message to the hub with the return status code and user and system run times when a job finishes. It keeps a count of CPUs free, and refuses to start a job unless a CPU is free. It also will kill a job in response to a “kill” message from the hub.
The node daemon also participates in the heartbeat processing. It responds to a “resurrect” message with an “alive” message in the process that leads to a machine that was down being restored to use. It responds to “check jobId” messages with a response saying whether or not the job is running, finished or has never been seen.
The node daemon needs to be run as root, so that it can setuid to run jobs as any particular user.
The para client manages batchs of jobs through the scheduler. It is designed to catch jobs which may have run into problems of any sort, and give the user a chance to rerun them after the problem is fixed. The major input to para is a job list. Each job can have checks associated with it before and after the job itself is run. Initially para reads the job list and transforms it into a job database. The central routine of para, paraCycle, reads the job database, queries the hub to see what jobs are running and waiting, looks at the results file to see what jobs are finished, performs output checks on the finished jobs, sends unsubmitted jobs or jobs that need to be rerun to the hub, updates the database in memory, and writes it back out. The database is in a comma-delimited text format with one job per line. The job database keeps track of the timing and status of each job submission. The code to read and write this database was generated with AutoSql. para will avoid loading the hub with more than 100,000 jobs at a time, and will only submit failed jobs three times before giving up on them.
Note that all commands will produce a usage summary if run with no arguments. In general this summary will be more up to date than the descriptions here.
Manage a batch of jobs in
parallel on a compute cluster.
Normal usage is to do a 'para
create' followed by 'para push' until job is done. Use 'para check' to check status
usage:
para command [command-specific arguments]
The commands are:
para create jobList
This makes the job-tracking database from a text file with the
command line for each job on a separate line. See below for
further description of the jobList. The ‘in’ checks in the
jobList will be performed at this time.
para push
This pushes forward the batch of jobs, submitting jobs to
parasol.
It will limit parasol queue size to something not too big and
retry failed jobs
options:
-retries=N Number of
retries per job - default 4.
-maxQueue=N Number of
jobs to allow on parasol queue
-
default 200000
-minPush=N Minimum
number of jobs to queue - default 1.
Overrides maxQueue
-maxPush=N Maximum number of jobs to queue - default 100000
-warnTime=N Number of minutes job runs before hang warning
- default 4320 (3 days)
-killTime=N Number of minutes job runs before push kills it
- default 20160 (2 weeks)
para try
This is like para push, but only submits up to 10 jobs
para shove
Push jobs in this database until all are done or one fails after
N
retries
para make jobList
Create database and run all jobs in it if possible. If one job
crashes repeatedly this will fail. Suitable for inclusion in
makefiles. Same as a
'create' followed by a 'shove'.
para check
This checks on the progress of the jobs.
para stop
This stops all the jobs in the batch
para chill
Tells system to not launch more jobs in this batch, but
does not stop jobs that are already running.
para finished
List jobs that have finished
para hung
List hung jobs in the batch
para crashed
List jobs that crashed or failed output checks the last time
they
were run.
para failed
List jobs that crashed after repeated restarts.
para problems
List jobs that had problems (even if successfully rerun).
Includes host info
para running
Print info on currently running jobs
para time
List timing information
Job lists are files with one command per line. The following two lines would be a simple job list:
echo hello boss
ls -l
Job list can have build in checks on either the input or the output.
The overall format for a check is:
{check in|out exists|exists+|line|line+ fileName}
this will be replaced by simply fileName when the job is run.
Checks "in" are used to make sure that input files are correct.
Checks "out" are used to make sure that output files are correct.
There are four types of checks:
exists - file must exist
exists+ - file must exist and be non-empty
line - file must end with a complete line
line+ - file must end with a complete line and be non-empty
Here's an example of a blat job spec:
blat {check in line+ human.fa} {check in line+ mouse1.fa} {check out line+ hm1.psl}
blat {check in line+ human.fa} {check in line+ mouse2.fa} {check out line+ hm2.psl}
Currently it’s not possible to redirect input or output in a job list.
Generate job submission file from template and two file lists.
usage:
gensub2 <file list 1> <file list 2> <template
file> <output file>
This will substitute each file
in the file lists for $(path1)
and $(path2)in the template
between #LOOP and #ENDLOOP, and write
the results to the output. Other substitution variables are:
$(path1) - full path
name of first file
$(path1) - full path
name of second file
$(dir1) - first directory. Includes trailing slash
if any.
$(dir2) - second
directory
$(lastDir1) - The last directory in the first path. Includes
trailing slash
if any.
$(lastDir2) - The last directory in the second path. Includes
trailing slash if any.
$(root1) - first
file name without directory or extension
$(root2) - second
file name without directory or extension
$(ext1) - first
file extension
$(ext2) - second
file extension
$(file1) - name
without dir of first file
$(file2) - name
without dir of second file
$(num1) - index of
first file in list
$(num2) - index of
second file in list
The <file list 2>
parameter can be 'single' if there is only one file
list. By default the order is diagonal, meaning if the first list is
ABC and the secon list is abc the combined order is Aa Ba Ab Ca Bb Ac Cb Bc Cc.
This tends to put the largest jobs first if the file lists are both
sorted by size. The following options can change this:
-group1 - write elements in order Aa Ab Ac Ba Bb Bc Ca Cb Cc
-group2 - write elements in order Aa Ba Ca Ab Bb Cb Ac Bc Cc
Parasol is the name given to the
overall system for managing jobs on
a computer cluster and to this
specific program. This program is
intended primarily for system
administrators, and some options of this command can only be used if logged in
as root. Regular users can do a
‘parasol status,’ or any of the ‘parasol list’ commands though, and may find
these useful to find out how busy the computer cluster is.
parasol add machine machineName
tempDir - add new machine to pool
parasol remove machine machineName - remove machine from pool
parasol add spoke - add a new spoke daemon
parasol [options] add job command-line - add job to list
options: -out=out -in=in -dir=dir -results=file -verbose
parasol remove job id - remove job of given ID
parasol ping [count] - ping hub server to make sure it's alive.
parasol remove jobs userName [jobPattern] - remove jobs
submitted
by user that match jobPattern. The jobPattern may include
the wildcards ? and *, though these will need to be
preceded
by escapes (‘\’) if run from most shells.
parasol list machines - list machines in pool
parasol list jobs - list jobs one per line
parasol list users – list users one per line
parasol list batches – list batches one per line
parasol status - print status of machines, jobs, and spokes.
paraHub
- parasol hub server.
usage:
paraHub machineList
Where machine list is a file
with machine names in the
first column, and number of CPUs
in the second column.
This file may include additional
columns as well.
options:
spokes=N Number of processes that feed jobs to nodes - default
30
jobCheckPeriod=N
Minutes between checking on job - default 10
machineCheckPeriod=N Seconds between checking on machine
-
default 20
subnet=XXX.YYY.ZZZ Only accept connections from subnet
(example subnet=192.168)
nextJobId=N Starting job ID number, by default 100,000 past
last job number used in previous run.
log=logFile Write a log to logFile. Use 'stdout' here for
console
logFlush Flush log with every write. This will slow down the
system somewhat, but can be useful in debugging.
noResume Don’t
attempt to reconnect with jobs still running
or recently finished on nodes. Used for debugging.
paraHubStop - Shut down paraHub
daemon
usage:
paraHubStop now
paraNode
- parasol node server.
usage:
paraNode start
options:
log=file - file may be 'stdout' to go to console
umask=022 – file creation mask, defaults to 002
so that files are world readable and group writable.
userPath=bin:bin/i386 User dirs to add to path
sysPath=/sbin:/local/bin System dirs to add to path
hub=host – restict access to connections from hub
cpu=N - Number of CPUs to use.
Default 1
paraNodeStart - Start up parasol
node daemons on a list of machines
usage:
paraNodeStart machineList
where machineList is a file
containing a list of hosts
Machine list contains the
following columns:
<name> <number of cpus>
It may have other columns as
well
options:
exe=/path/to/paranode
umask=022 – file creation mask, defaults to 002
userPath=bin:bin/i386 User dirs to add to path
sysPath=/sbin:/local/bin System dirs to add to path
log=/path/to/log/file – this is best on a local disk of node
hub=machineHostingParaHub - nodes will ignore
messages from elsewhere
rsh=/path/to/rsh/like/command
paraNodeStop - Shut down parasol
node daemons on a list of machines
usage:
paraNodeStop machineList
paraNodeStatus - Check status of
paraNode on a list of machines
usage:
paraStat machineList
options:
-long List details of
recent and current jobs.