runManySections.py - Easy Interface to CMSLPC Condor CAF and CERN's LSF Batch System

Introduction

runManySections.py is designed to make it easy to run many different sections (or jobs) at once on the CMSLPC CAF or CERN's batch system. It is designed to complement CRAB as runManySections.py is designed to be used with non-cmsRun executables. The general idea is that you pass in a list of commands you would like run and you get the output of these commands back.

It is currently configured to run for the Condor system at CMSLPC CAF and CERN's LSF batch system. It is very easy to configure to run on other condor systems and fairly easy to modify for most other batch systems. Please email me at cplager+cmshelp@fnal.gov if you are interested.

Important: At the LPC at FNAL, the CAF is a very powerful and useful tool. It is also, unfortunately, relatively easy to do "bad" things that will effect all of the computers at the LPC. To avoid this:

At CERN, these are not a concern and it is allowed to read to and write from the /afs disk space.

Quick Overview

Setup

If you are running on cmslpc machines, I recommend using the scripts in my area (~cplager/bin/runmanySections.py). If you are running elsewhere or prefer your own copy of the scripts, you can grab the two needed files with the commands below:

wget http://home.fnal.gov/~cplager/log/RunMany/runManySections.py
wget http://home.fnal.gov/~cplager/log/RunMany/runMany.bash
Note that:

Basic Idea

The idea is to have a file that lists all jobs that you want run. To submit the jobs:

FNAL

~cplager/bin/runManySections.py --submitCondor myJobs.cmd

CERN

~cplager/bin/runManySections.py --submitLsf  --lsfOptions "-q 8nh" myJobs.cmd
where myJobs.cmd file has a header that sets up everything and a body that lists commands (and "-q 8nh" tells LSF I want to use the 8 natural hour queue)x.

(I highly recommend Debugging Jobs Locally before submitting them to the batch system).

Below I show how to write one of these files "by hand." If you already have a list of jobs that you want run, you should consider Using runManySections.py to Create Command File below.

# -*- sh -*- # for font lock mode

######################
## Setup Everything ##
######################

# How the environment be setup 
- env = cd /uscms/home/cplager/work/cmssw/CMSSW_3_5_7; . /uscmst1/prod/sw/cms/bashrc prod; eval `scramv1 runtime -sh`; cd -

##############
## Commands ##
##############

# logFileName       Command
out1.log            myFirstCommand   name1.output
out2.log            mySecondCommand  name2.output
This above file:

All of the output files (e.g., out1.log, out2.log, name1.output, and , name2.output) will all be returned to the directory where you were when you called runManySections.py.

Important: At FNAL, it is ok to read a few small files from your home area on the batch system. Please:

At CERN, while the same things are still recommended, they are not required.

Putting Unique Job ID in Output Filenames

(Also using environment variables in job command file)

Although not required, I highly recommend setting up your jobs so that the output file names (i.e., the output of your job as well as the log files) contain information about with CAF job and section they are run in. The reasons are:

  1. This ensures that you don't accidentally overwrite files from previous jobs.
  2. It makes it easier to follow what has been happening.

To do this is quite simple, as the environment variable $JID contains the necessary information. In the following example, myFirstCommand takes the only argument as the output filename. So instead of the line from above:

# logFileName       Command
out1.log            myFirstCommand   name1.output
We should write:
# logFileName       Command
out1_$(JID).log     myFirstCommand   name1_$(JID).output
This will cause the log file name to be something like output_JID_6123_1.log and the output name to be name1_6123_1.output.

Note that you can access any environment variable this way $(AnyEnvironmentVariable).

Including a Tarball

If you would like to include a tarball that will be automatically untarred when running your jobs, you can add a line to your command file

- tarfile = myTarBall.tgz

The tarball will be untarred by default in a directory called tardir/. The tarball will be untarred before the environment is setup. This can be a nice way of setting up your environment. If include a script setupMyEnvironment.bash, then your commands file could just contain the line.

- env = . tardir/setupMyEnvironment.bash

where . tardir/setupMyEnvironment.bash is the bash equivalent to tcsh's source tardir/setupMyEnvironment.tcsh.

Note: Any files that are in the working directory on the condor system are copied back to your home area. Any files that are in a subdirectory (e.g., tardir) are not. This is why files are not untarred into the main directory.

This assumes you have a gzipped tarball. E.g.,

tar czvf myTarBall firstFile.config secondFile.config thirdfile fourthfile

Using runManySections.py to Create Command File

You can use the script itself to help generate the command file. Start with just a simple list of commands you wish to run:

cplager@cmslpc16> cat commands.listOfJobs
root -l -b -q -n   tardir/runSilly.C("output_$(JID).root", 1)
root -l -b -q -n   tardir/runSilly.C("output_$(JID).root", 2)
root -l -b -q -n   tardir/runSilly.C("output_$(JID).root", 3)
root -l -b -q -n   tardir/runSilly.C("output_$(JID).root", 4)
root -l -b -q -n   tardir/runSilly.C("output_$(JID).root", 5)
Now we can use this file and have the script setup most or all of the other details:
~cplager/bin/development/runManySections.py --createCommandFile --cmssw --addLog --setTarball=tarball.tgz \
  commands.listOfJobs commands.cmd

Except for the --createCommandFile above, all of the other options are optional. You can also use:

Running a Compiled Root Script

Here is an example of how to do it in the situation where you have a file macro.cc which contains a void macro (TString outputname, int mode). Here is a silly example:

silly.C

#include <iostream>
#include "TString.h"

void silly (TString name, int mode)
{
   std::cout << "Hi " << name << ", " << mode << std::endl;
}
Now compile this script inside of root:
cplager@cmslpc16> root -l
root [0] .L silly.C+
Info in <TUnixSystem::ACLiC>: creating shared library /uscms_data/d2/cplager/shabnam/./silly_C.so
root [1] 
Now we need a macro that will load silly and run it:

runSilly.C

void runSilly (TString name, int mode)
{
   gSystem->Load("tardir/silly_C.so");
   silly(name, mode);
}

Create a tarball containing the shared object library and the load script:

tar czvf tarball.tgz silly_C.so runSilly.C
The command used to run this looks like:
root -l -b -q -n   tardir/runSilly.C("output_$(JID).root", 1)
See Using runManySections.py to Create Command File above for a detailed example of submitting this to the queue.

Debugging Jobs Locally

Important: I recommend logging out and back in to the LPC and NOT setting up any CMSSW or any other environment.

Before submitting jobs to the queue, it is a good idea to run (at least a short) one locally. To do this, we use the --testSection and --runTest flags.

Make sure your job runs correctly here before submitting it to the batch system.

CERN (LSF) Versus FNAL (Condor) Differences

This system has been designed with Condor in mind and then adapted to LSF. Here are some of the differences of which you should be aware.


Last modified: Wed Dec 15 14:10:05 CST 2010 by cplager+cmshelp@fnal.gov