|
Batch
Systems:
The batch system available for users of the UAF is condor which allows
the user to submit jobs into the lpc batch farm or the production farm.
On this page we will
describe how to use this batch system.
1. How do I use
Condor to submit to the lpc batch farm?
2. How do I use
CondorG to submit to the production batch farm?
3.
How do I use condor scratch space to speed up my jobs?
For any information not covered below, visit
the condor user's
manual
1. How do I use
Condor to submit to the lpc batch farm?
The first step to using
the condor system is
writing the condor submit description file. This file will tell
the system
what you want it to do and how. Below is an example, which will
run a system program that will sleep for one minute, then quit. If you want to try this out, copy everything in red(commands) and green(code) and paste it into your terminal to create the file you
see below named "sleep_condor". Click on any of the green lines
below to see what it does.
cat > sleep_condor << +EOF
universe
= vanilla
Executable
= /bin/sleep
Requirements
= Memory >= 199 &&OpSys == "LINUX"&& (Arch !=
"DUMMY" )
Should_Transfer_Files
= NO
Output
=
/uscms_data/d1/${LOGNAME}/sleep_\$(Cluster)_\$(Process).stdout
Error
= /uscms_data/d1/${LOGNAME}/sleep_\$(Cluster)_\$(Process).stderr
Log
= /uscms_data/d1/${LOGNAME}/sleep_\$(Cluster)_\$(Process).log
notify_user
= ${LOGNAME}@FNAL.GOV
Arguments
= 60
Queue
5
+EOF
To submit only to 64-bit nodes, change the Arch
attribute in the Requirements to
(Arch == "X86_64").
To submit to any (32 or 64) bit nodes, change the
Arch attribute in the Requirements to
(Arch != "DUMMY")
After you've created the file, you can
submit it to the condor system using the command condor_submit followed
by the name of your submit description file,
in this example's case "sleep_condor":
condor_submit
sleep_condor
Your output should look something like this:
[langley@cmswn094 ~]$ condor_submit sleep_condor
Submitting job(s).....
Logging submit event(s).....
5 job(s) submitted to cluster 154.
You can see the status of all jobs submitted
from the node you are logged on to by using the following command:
condor_q
Your queue ought to show the processes you
just submitted, they may be idle for up to a minute or so, maybe longer
if the system is very busy:
[langley@cmswn094 ~]$ condor_q
-- Submitter: cmswn094.fnal.gov :
<131.225.207.241:37285> : cmswn094.fnal.gov
ID
OWNER SUBMITTED
RUN_TIME ST PRI SIZE CMD
154.0
langley 7/27
10:33 0+00:00:00
I 0 0.0
sleep 60
154.1
langley 7/27
10:33 0+00:00:00
I 0 0.0
sleep 60
154.2
langley 7/27
10:33 0+00:00:00
I 0 0.0
sleep 60
154.3
langley 7/27
10:33 0+00:00:00
I 0 0.0
sleep 60
154.4
langley 7/27
10:33 0+00:00:00
I 0 0.0
sleep 60
5 jobs; 5 idle, 0 running, 0 held
In condor, each computer has a separate list
of requests (queue into
condor), so sometimes the Job (cluster) number is not sufficient to
uniquely identify a
job. In other words, every computer in the condor system has its
own list of jobs and job numbers. If you are still logged into
the computer you submitted your job from, then just using the job
number will work, but if you are on a different computer you must
specify which computer you submitted the job from.
You can specifically get a list of all the
jobs and their status for a specific user from any machine using this
command:
condor_q
-submitter <username>
If you want to view the entire queue for a
machine that you are not logged onto then you can use the following
command. This is gives you the same information as condor_q
(albeit in a different format) without needing to be logged into that
particular machine. Say you submitted the job from
cmswn088.fnal.gov:
condor_status
-submitter cmswn094
This gives all the jobs from all users on the
machine in question:
[langley@cmswn094 ~]$ condor_status
-submitter cmswn094
Name
Machine
Running IdleJobs HeldJobs
langley@fnal.gov
cmswn094.f
5
0 0
RunningJobs
IdleJobs
HeldJobs
langley@fnal.gov
5
0
0
Total
5
0
0
You can view information about all requests
and their submitters across all the system with this command:
condor_status
-submitters
To cancel a job type condor_rm followed by
the job number, for this example, 154:
condor_rm
154
Again if you are now logged into a different
node you must supply the name of the computer
from which you submitted the job, for example if you had been logged
onto cmswn088.fnal.gov:
condor_rm
-name cmswn094 154
If you don't remember what machine you
submitted the job from, use the condor_q -submitter command from above,
it will tell you what machine you used for your requests.

universe = vanilla
The universe
variable defines an execution environment for your job, in this example
we use the vanilla universe which has the least amount of built in
services, but also the least amount of restrictions. For a
complete list of universes and what they do, so the condor
user's
manual under 2. Users' Manual >
2.4 Road-map for Running Jobs > 2.4.1 Choosing a Condor Universe BACK
Executable =
/bin/sleep
This
is the program
you want to run. If the program is in the same directory as your
batch file, just the name will work, example: yourscript.csh. If
it is
in a different directory than your
batch file then you must give the pathname, example:
myscripts/yourscript.csh runs the script yourscript.csh located in the
directory myscripts. BACK
Requirements
= Memory >= 199 && OpSys == "LINUX"
&& (Arch !="dummy")
Any
requirements for
the
machine chosen to run your program should be here. If you exclude
this
line condor will look for a computer with the same structure as
yours. The requirements here specifiy an Intel machine or 64-bit
machine running a
Linux operating system with at least 199 megabytes of memory. For
a complete and long list of possible requirement
settings, see the condor
user's
manual BACK
Should_Transfer_Files
= NO
This option, if you
say
yes, will take input files from your computer and send output files
back. If this option is not activated then you must provide input
through some other means and extract the output yourself. BACK
Output = /uscms_data/d1/${LOGNAME}/sleep_$(Cluster)_$(Process).stdout
This directs the
standard
output of the program to a file, in other words, everything that would
normally be displayed on the screen, so that you can read it after it
is finished running. Where you see $(Cluster) condor will
substitute the
job
number, and $(Process) will become the process number, in this case,
0-4. BACK
Error =
/uscms_data/d1/${LOGNAME}/sleep_$(Cluster)_$(Process).stderr
This is the same as
the
Output line, except it applies to standard error, this is extremely
useful for debugging or figuring out what is going wrong (all most
always something). Where you see $(Cluster) condor will substitute the
job
number, and $(Process) will become the process number, in this case,
0-4. BACK
Log = /uscms_data/d1/${LOGNAME}/sleep_$(Cluster)_$(Process).log
The log file
contains
information about the job in the condor system, the ip address of the
computer that is processing the job, the time it starts and finishes,
how
many attempts were made to start the job and other such data. It
is
recommended to use a log file, where you see $(Cluster) condor will substitute the job
number, and $(Process) will become the process number, in this case,
0-4. BACK
notify_user =
you@FNAL.GOV
Specifies to whom
the system will
automatically email when the job finishes (your email), in the example,
the computer should have put your email address here. You will
recieve a seperate email for every process in you job that completes. BACK
Arguments = 60
Here you put any
command line arguments for your program, if you have none, exclude
this line. In this example the program needs one argument for the
number of seconds to wait. This argument tells the program to
wait for one minute. BACK
+LENGTH="SHORT"
Include this line
to
forfeit your job if it runs longer than an hour real time, without this
line your
job will be forfeit if it runs longer than one day real time. It
is recommended
you include this line for small jobs. A portion of the production farm
is reserved
for jobs labeled short, this combined with the hour limit on everyone
in front of you makes the chances of your job running soon much higher
than without it. With or without this line your job is only
dropped
after the time limit when another user would be blocked from the queue
by you, so if no one else is trying to use the system, your job will
keep running past the time limit. BACK
Queue 5
This is how many
times you want to run the program, without this line it runs only
once. The processes will be numbered starting at zero, so in this
example they will be: 0, 1, 2, 3, and 4
2.
How do I use
CondorG to submit to the production batch farm?
To access the production batch farm you must first have your
grid credentials. If you do not have your grid credentials please
follow the procedures outlines on the SRM page.
Once you have you grid credentials you need to get you grid
proxy by running
voms-proxy-init -voms cms
Next you need to modify your job description file as follows:
- change universe =
vanilla to universe = globus
- add globusscheduler
=
cmsosgce.fnal.gov/jobmanager
Finally
submit you condor job as normal with condor_submit.
3.
How do I use condor scratch space to speed up my jobs?
Condor jobs are executed on their own 10Gb partition. You can make us
of this space for your job output.
Use the variable
$_CONDOR_SCRATCH_DIR
for your output paths in your job description file to take advantage of
this space.
To make reasonable use of it you also need add/change you job
description file as follows
ShouldTransferFiles
= YES
WhenToTransferOutput
= ON_EXIT
BACK

Patrick Gartung gartung@fnal.gov
|