Search
uscms.org  uscms.fnal.gov  www 

Data and Computing Facility Operations

Facility Operations: Batch System

Batch Systems:

The batch system available for users of the UAF is condor which allows the user to submit jobs into the lpc batch farm or the production farm. On this page we will describe how to use this batch system.

  1. How do I use Condor to submit to the lpc batch farm?
  2. How do I use CondorG to submit to the production batch farm?
  3. How do I use condor scratch space to speed up my jobs?

For any information not covered below, visit the condor user's manual

1. How do I use Condor to submit to the lpc batch farm?

The first step to using the condor system is writing the condor submit description file. This file will tell the system what you want it to do and how. Below is an example, which will run a system program that will sleep for one minute, then quit. If you want to try this out, copy everything in red(commands) and green(code) and paste it into your terminal to create the file you see below named "sleep_condor". Click on any of the green lines below to see what it does.

cat > sleep_condor << +EOF

universe = vanilla
Executable = /bin/sleep
Requirements = Memory >= 199 &&OpSys == "LINUX"&& (Arch != "DUMMY" )
Should_Transfer_Files = NO
Output = /uscms_data/d1/${LOGNAME}/sleep_\$(Cluster)_\$(Process).stdout
Error = /uscms_data/d1/${LOGNAME}/sleep_\$(Cluster)_\$(Process).stderr
Log = /uscms_data/d1/${LOGNAME}/sleep_\$(Cluster)_\$(Process).log
notify_user = ${LOGNAME}@FNAL.GOV
Arguments = 60
Queue 5

+EOF
  • To submit only to 64-bit nodes, change the Arch attribute in the Requirements to (Arch == "X86_64").
  • To submit to any (32 or 64) bit nodes, change the Arch attribute in the Requirements to (Arch != "DUMMY")

After you've created the file, you can submit it to the condor system using the command condor_submit followed by the name of your submit description file, in this example's case "sleep_condor":

condor_submit sleep_condor

Your output should look something like this:

[langley@cmswn094 ~]$ condor_submit sleep_condor
Submitting job(s).....
Logging submit event(s).....
5 job(s) submitted to cluster 154.

You can see the status of all jobs submitted from the node you are logged on to by using the following command:

condor_q

Your queue ought to show the processes you just submitted, they may be idle for up to a minute or so, maybe longer if the system is very busy:

[langley@cmswn094 ~]$ condor_q

-- Submitter: cmswn094.fnal.gov : <131.225.207.241:37285> : cmswn094.fnal.gov
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
154.0 langley 7/27 10:33 0+00:00:00 I 0 0.0 sleep 60
154.1 langley 7/27 10:33 0+00:00:00 I 0 0.0 sleep 60
154.2 langley 7/27 10:33 0+00:00:00 I 0 0.0 sleep 60
154.3 langley 7/27 10:33 0+00:00:00 I 0 0.0 sleep 60
154.4 langley 7/27 10:33 0+00:00:00 I 0 0.0 sleep 60
5 jobs; 5 idle, 0 running, 0 held

In condor, each computer has a separate list of requests (queue into condor), so sometimes the Job (cluster) number is not sufficient to uniquely identify a job. In other words, every computer in the condor system has its own list of jobs and job numbers. If you are still logged into the computer you submitted your job from, then just using the job number will work, but if you are on a different computer you must specify which computer you submitted the job from.

You can specifically get a list of all the jobs and their status for a specific user from any machine using this command:

condor_q -submitter

If you want to view the entire queue for a machine that you are not logged onto then you can use the following command. This is gives you the same information as condor_q (albeit in a different format) without needing to be logged into that particular machine. Say you submitted the job from cmswn088.fnal.gov:

condor_status -submitter cmswn094

This gives all the jobs from all users on the machine in question:

[langley@cmswn094 ~]$ condor_status -submitter cmswn094
Name Machine Running IdleJobs HeldJobs
langley@fnal.gov cmswn094.f 5 0 0

  RunningJobs IdleJobs HeldJobs
langley@fnal.gov 5 0 0
Total 5 0 0

You can view information about all requests and their submitters across all the system with this command:

condor_status -submitters

To cancel a job type condor_rm followed by the job number, for this example, 154:

condor_rm 154

Again if you are now logged into a different node you must supply the name of the computer from which you submitted the job, for example if you had been logged onto cmswn088.fnal.gov:

condor_rm -name cmswn094 154

If you don't remember what machine you submitted the job from, use the condor_q -submitter command from above, it will tell you what machine you used for your requests.


universe = vanilla
The universe variable defines an execution environment for your job, in this example we use the vanilla universe which has the least amount of built in services, but also the least amount of restrictions. For a complete list of universes and what they do, so the condor user's manual under 2. Users' Manual > 2.4 Road-map for Running Jobs > 2.4.1 Choosing a Condor Universe.
BACK

Executable = /bin/sleep
This is the program you want to run. If the program is in the same directory as your batch file, just the name will work, example: yourscript.csh. If it is in a different directory than your batch file then you must give the pathname, example: myscripts/yourscript.csh runs the script yourscript.csh located in the directory myscripts.
BACK

Requirements = Memory >= 199 && OpSys == "LINUX" && (Arch !="dummy")
Any requirements for the machine chosen to run your program should be here. If you exclude this line condor will look for a computer with the same structure as yours. The requirements here specifiy an Intel machine or 64-bit machine running a Linux operating system with at least 199 megabytes of memory. For a complete and long list of possible requirement settings, see the condor user's manual.
BACK

Should_Transfer_Files = NO
This option, if you say yes, will take input files from your computer and send output files back. If this option is not activated then you must provide input through some other means and extract the output yourself.
BACK

Output = /uscms_data/d1/${LOGNAME}/sleep_$(Cluster)_$(Process).stdout
This directs the standard output of the program to a file, in other words, everything that would normally be displayed on the screen, so that you can read it after it is finished running. Where you see $(Cluster) condor will substitute the job number, and $(Process) will become the process number, in this case, 0-4.
BACK

Error = /uscms_data/d1/${LOGNAME}/sleep_$(Cluster)_$(Process).stderr
This is the same as the Output line, except it applies to standard error, this is extremely useful for debugging or figuring out what is going wrong (all most always something). Where you see $(Cluster) condor will substitute the job number, and $(Process) will become the process number, in this case, 0-4.
BACK

Log = /uscms_data/d1/${LOGNAME}/sleep_$(Cluster)_$(Process).log
The log file contains information about the job in the condor system, the ip address of the computer that is processing the job, the time it starts and finishes, how many attempts were made to start the job and other such data. It is recommended to use a log file, where you see $(Cluster) condor will substitute the job number, and $(Process) will become the process number, in this case, 0-4.
BACK

notify_user = you@FNAL.GOV
Specifies to whom the system will automatically email when the job finishes (your email), in the example, the computer should have put your email address here. You will recieve a seperate email for every process in you job that completes.
BACK

Arguments = 60
Here you put any command line arguments for your program, if you have none, exclude this line. In this example the program needs one argument for the number of seconds to wait. This argument tells the program to wait for one minute.
BACK

+LENGTH="SHORT"
Include this line to forfeit your job if it runs longer than an hour real time, without this line your job will be forfeit if it runs longer than one day real time. It is recommended you include this line for small jobs. A portion of the production farm is reserved for jobs labeled short, this combined with the hour limit on everyone in front of you makes the chances of your job running soon much higher than without it. With or without this line your job is only dropped after the time limit when another user would be blocked from the queue by you, so if no one else is trying to use the system, your job will keep running past the time limit.
BACK

Queue 5
This is how many times you want to run the program, without this line it runs only once. The processes will be numbered starting at zero, so in this example they will be: 0, 1, 2, 3, and 4.
BACK

2. How do I use CondorG to submit to the production batch farm?

To access the production batch farm you must first have your grid credentials. If you do not have your grid credentials please follow the procedures outlines on the SRM page.

Once you have you grid credentials you need to get you grid proxy by running

voms-proxy-init -voms cms

Next you need to modify your job description file as follows:

  • change universe = vanilla to universe = globus
  • add globusscheduler = cmsosgce.fnal.gov/jobmanager

Finally submit you condor job as normal with condor_submit.

3. How do I use condor scratch space to speed up my jobs?

Condor jobs are executed on their own 10Gb partition. You can make us of this space for your job output.

Use the variable
$_CONDOR_SCRATCH_DIR
for your output paths in your job description file to take advantage of this space.

To make reasonable use of it you also need add/change you job description file as follows
ShouldTransferFiles = YES
WhenToTransferOutput = ON_EXIT

Webmaster | Last modified: Wednesday, 16-Apr-2008 14:48:37 CDT