runManySections.py - Easy Interface to CMSLPC Condor CAF and CERN's LSF Batch System

Introduction
Quick Overview
Setup
Basic Idea
Including a Tarball
Using runManySections.py to Create Command File
Running Compiled Root Macros
Debugging Jobs Locally
CERN (LSF) versus FNAL (Condor) Differences

Introduction

runManySections.py is designed to make it easy to run many different sections (or jobs) at once on the CMSLPC CAF or CERN's batch system. It is designed to complement CRAB as runManySections.py is designed to be used with non-cmsRun executables. The general idea is that you pass in a list of commands you would like run and you get the output of these commands back.

It is currently configured to run for the Condor system at CMSLPC CAF and CERN's LSF batch system. It is very easy to configure to run on other condor systems and fairly easy to modify for most other batch systems. Please email me at cplager+cmshelp@fnal.gov if you are interested.

Important: At the LPC at FNAL, the CAF is a very powerful and useful tool. It is also, unfortunately, relatively easy to do "bad" things that will effect all of the computers at the LPC. To avoid this:

Make sure your jobs only write local output and do not write directly to your home areas. All created files will be returned to the location from where you launched the jobs. (This suggests that you probably want to run on your ~/nobackup area).
While it is o.k. to have access your code and shared object libraries, etc. the home disks (/uscms/home/username or /uscms_data/d1/username), please do not read many input files from there. If the input files are large, please put larger files on one of the storage elements. For smaller input files, one can easily include a tarball containing these.

At CERN, these are not a concern and it is allowed to read to and write from the /afs disk space.

Quick Overview

Write text file that has list of commands you want run.
Use runManySections.py to Create Command File
Make sure job runs locally
runManySections.py --submitCondor listOf.commands
or
runManySections.py --submitLsf --lsfOptions "-q 8nh" listOf.commands

Setup

If you are running on cmslpc machines, I recommend using the scripts in my area (~cplager/bin/runmanySections.py). If you are running elsewhere or prefer your own copy of the scripts, you can grab the two needed files with the commands below:

wget http://home.fnal.gov/~cplager/log/RunMany/runManySections.py
wget http://home.fnal.gov/~cplager/log/RunMany/runMany.bash

Note that:

You need to put both files in the same directory, and
These files need to be visible to the batch system computers (this is true of the LPC CAF if you put them in your home directory).

Basic Idea

The idea is to have a file that lists all jobs that you want run. To submit the jobs:

FNAL

~cplager/bin/runManySections.py --submitCondor myJobs.cmd

CERN

~cplager/bin/runManySections.py --submitLsf  --lsfOptions "-q 8nh" myJobs.cmd

where myJobs.cmd file has a header that sets up everything and a body that lists commands (and "-q 8nh" tells LSF I want to use the 8 natural hour queue)x.

(I highly recommend Debugging Jobs Locally before submitting them to the batch system).

Below I show how to write one of these files "by hand." If you already have a list of jobs that you want run, you should consider Using runManySections.py to Create Command File below.

# -*- sh -*- # for font lock mode

######################
## Setup Everything ##
######################

# How the environment be setup 
- env = cd /uscms/home/cplager/work/cmssw/CMSSW_3_5_7; . /uscmst1/prod/sw/cms/bashrc prod; eval `scramv1 runtime -sh`; cd -

##############
## Commands ##
##############

# logFileName       Command
out1.log            myFirstCommand   name1.output
out2.log            mySecondCommand  name2.output

This above file:

Tells it to use my CMSSW_3_5_7 project area to setup CMSSW environment.
Note: This gets run in a bash environment, so remember that one needs to use the sh and not csh options.
Runs myFirstCommand name1.output where out1.log contains all stdout and stderr from running the job.
Runs mySecondCommand name2.output where out2.log contains all stdout and stderr from running the job.

All of the output files (e.g., out1.log, out2.log, name1.output, and , name2.output) will all be returned to the directory where you were when you called runManySections.py.

Important: At FNAL, it is ok to read a few small files from your home area on the batch system. Please:

Do Not write output directly to your home area. Make sure your job writes its output to the local directory and you let Condor move the files back to your home directory.
Do Not read many or large files from your home area. If you have many files, you can send a tarball with your job (see below). If you are using large root files, please make sure they are in one of our mass storage areas.

At CERN, while the same things are still recommended, they are not required.

Putting Unique Job ID in Output Filenames

(Also using environment variables in job command file)

Although not required, I highly recommend setting up your jobs so that the output file names (i.e., the output of your job as well as the log files) contain information about with CAF job and section they are run in. The reasons are:

This ensures that you don't accidentally overwrite files from previous jobs.
It makes it easier to follow what has been happening.
- Log and output files are very easily matched.
- You can now track which output files came from which session.

To do this is quite simple, as the environment variable $JID contains the necessary information. In the following example, myFirstCommand takes the only argument as the output filename. So instead of the line from above:

# logFileName       Command
out1.log            myFirstCommand   name1.output

We should write:

# logFileName       Command
out1_$(JID).log     myFirstCommand   name1_$(JID).output

This will cause the log file name to be something like output_JID_6123_1.log and the output name to be name1_6123_1.output.

Note that you can access any environment variable this way $(AnyEnvironmentVariable).

Including a Tarball

If you would like to include a tarball that will be automatically untarred when running your jobs, you can add a line to your command file

- tarfile = myTarBall.tgz

The tarball will be untarred by default in a directory called tardir/. The tarball will be untarred before the environment is setup. This can be a nice way of setting up your environment. If include a script setupMyEnvironment.bash, then your commands file could just contain the line.

- env = . tardir/setupMyEnvironment.bash

where . tardir/setupMyEnvironment.bash is the bash equivalent to tcsh's source tardir/setupMyEnvironment.tcsh.

Note: Any files that are in the working directory on the condor system are copied back to your home area. Any files that are in a subdirectory (e.g., tardir) are not. This is why files are not untarred into the main directory.

This assumes you have a gzipped tarball. E.g.,

tar czvf myTarBall firstFile.config secondFile.config thirdfile fourthfile

Using `runManySections.py` to Create Command File

You can use the script itself to help generate the command file. Start with just a simple list of commands you wish to run:

cplager@cmslpc16> cat commands.listOfJobs
root -l -b -q -n   tardir/runSilly.C("output_$(JID).root", 1)
root -l -b -q -n   tardir/runSilly.C("output_$(JID).root", 2)
root -l -b -q -n   tardir/runSilly.C("output_$(JID).root", 3)
root -l -b -q -n   tardir/runSilly.C("output_$(JID).root", 4)
root -l -b -q -n   tardir/runSilly.C("output_$(JID).root", 5)

Now we can use this file and have the script setup most or all of the other details:

~cplager/bin/development/runManySections.py --createCommandFile --cmssw --addLog --setTarball=tarball.tgz \
  commands.listOfJobs commands.cmd

--createCommandFile - tells it to load list of jobs commands.listOfJobs and create command file commands.cmd
--cmssw - tells it to use current version of CMSSW as the environment
--addlog - tells it input list of commands does not contain a log file
--setTarball=tarball.tgz - tells it to send local tarball.tgz file.

Except for the --createCommandFile above, all of the other options are optional. You can also use:

"--envString=XXX;YYY" - tells it to run XXX;YYY to setup the environment.

Running a Compiled Root Script

Here is an example of how to do it in the situation where you have a file macro.cc which contains a void macro (TString outputname, int mode). Here is a silly example:

`silly.C`

#include <iostream>
#include "TString.h"

void silly (TString name, int mode)
{
   std::cout << "Hi " << name << ", " << mode << std::endl;
}

Now compile this script inside of root:

cplager@cmslpc16> root -l
root [0] .L silly.C+
Info in <TUnixSystem::ACLiC>: creating shared library /uscms_data/d2/cplager/shabnam/./silly_C.so
root [1]

Now we need a macro that will load silly and run it:

`runSilly.C`

void runSilly (TString name, int mode)
{
   gSystem->Load("tardir/silly_C.so");
   silly(name, mode);
}

Create a tarball containing the shared object library and the load script:

tar czvf tarball.tgz silly_C.so runSilly.C

The command used to run this looks like:

root -l -b -q -n   tardir/runSilly.C("output_$(JID).root", 1)

See Using runManySections.py to Create Command File above for a detailed example of submitting this to the queue.

Debugging Jobs Locally

Important: I recommend logging out and back in to the LPC and NOT setting up any CMSSW or any other environment.

Before submitting jobs to the queue, it is a good idea to run (at least a short) one locally. To do this, we use the --testSection and --runTest flags.

Log into cmslpc machines WITHOUT setting up any CMSSW or whatever environment.

To see the command to run, for example, the 2nd job in your command file (remeber, the first job is job 0):

cplager@cmslpc16> ~cplager/bin/runManySections.py --testSection=2 commands.cmd 
/uscms/home/cplager/bin/runMany.bash /uscms/home/cplager/bin /uscms_data/d2/cplager/shabnam/commands.cmd 2 123

You can either copy and paste the above output, or ask the script to run it for you:

cplager@cmslpc16> ~cplager/bin/runManySections.py --testSection=2 --runTest commands.cmd

Make sure your job runs correctly here before submitting it to the batch system.

CERN (LSF) Versus FNAL (Condor) Differences

This system has been designed with Condor in mind and then adapted to LSF. Here are some of the differences of which you should be aware.

Writing to home area
- Condor assumes that you do not have the ability to write to your home area during the job. All output files are automatically copied back to the directory from which you submitted your job.
- LSF does not make these assumptions, nor does it automatically copy your files back. runManySections.py emulates this, copying all files in the main working directory of the worker node that are not (a) a directory or (b) start with a ..
  If you do not want this feature, please use --noLsfCopy option.
One job, many sections.
- Condor lets you submit one "job" with many different "sections". runManySections.py only runs condor_submit once.
- LSF only let you submit one "section" at a time. runManySections.py runs bsub once for each job listed in the command file. As a result, each line the the command file will generate its own LSFJOB_ID directory. When things are run successfully, there is no useful information in these directories, so they can be deleted easily enough.

Last modified: Wed Dec 15 14:10:05 CST 2010 by cplager+cmshelp@fnal.gov