To configure the Glidein Factory you need to create a directory with a set of files. This is done by the command line tools described below. A set of configuration example is provided, too.
The configuration file is a XML document.
It is composed of two parts:
You definitely need to set the following arguments:
<glidein
glidein_name="your name">
The name of the configuration. It will be used to advertise the entry
points, will be defined as Condor glidein attribute GLIDEIN_Name, and is used also to create the
directory names.
Choose a short name that describes the set of Grid resources it
represents and append a version number (like "fnalcms_1"). Starting with v2.0 of glideinWMS,
you can use the factory reconfig tool
to make changes to the factory configuration. You will only need new
configuration for the factories during major upgrade. For more details
refer the Glidein Factory management
section
<glidein><condor_tarballs>
<condor_tarball os="os" arch="arch"base_dir="directory"/>
Where to find the Condor binaries.
You can list as many as you need, but at least one is required.
It is recommended to have one default entry with os="default" arch="default".
You most probably want to set the following arguments, too:
<glidein><submit
base_dir="directory"/>
Where to create the glidein submit directory. The default is the user
home directory.
<glidein><stage
base_dir="web dir"
web_base_dir="URL"/>
These two define where the Web server directories are located.
The defaults are reasonable, but you may have different needs.
<glidein><monitor
base_dir="web dir"
javascriptRRD_dir="web dir"
flot_dir="web dir"
jquery_dir="web dir" >
The base_dir defines where the monitoring web are is.
The other entries point to where javascriptRRD, Flot and JQuery libraries
are.
<glidein
factory_name="your name">
Changing this value from the name of the machine allows you to move the
factory without disrupting the system.
<glidein><security allow_proxy="frontend[,factory]" pub_key="RSA"/>
Enable the proxy passing between frontend and factory. Define if the
frontend needs to pass a proxy ("frontend"), if it must use the
factory one ("factory"), or if both methods are supported
("frontend,factory").
Some other arguments you might want to set, are:
<glidein schedd_name="schedd name[,schedd
name]*">
If you want to use multiple Condor schedds or you don't like the default
name, you definitely need to set this. If you specify more than a single
schedd, the various entries will be equally spread among all the listed
schedds.
Possible values include (but are not limited to) "myschedd@mymachine.mydomain" and "myschedd_g1@mymachine.mydomain,myschedd_g2@mymachine.mydomain,myschedd_g3@mymachine.mydomain".
<glidein loop_delay="seconds"
advertise_delay="nr" >
Defines how active the glidein factory should be.
The glidein factory works in polling mode.
loop_delay defines how much time
should pass between each polling loop, with the collector being updated
every advertise_delay loops.
<restart_attempts="nr"
restart_interval="seconds" >
Defines how many times restart_attempts should be applied within restart_interval seconds for an entry if the entry crashes.
<glidein><attrs><attr
name="attr name" value="value" const="True" parameter="True" publish="True"
glidein_publish="True" comment="comment"/>
Attributes you want to publish that effect all the factory entries
To set Attributes specific to an entry point, set them in /glidein/entries/entry/attrs section.
Table below describes the <attrs ... > tag.
Attribute Name | Attribute Description |
name | Name of the attribute |
value | Value of the attribute |
const | If this attribute is a constant so that VO Frontend can not change it. If set to const, the attribute will be available in the constants file created in the staging area. Used only if parameter is True. |
parameter | Set True if the attribute should be passed as a parameter. Always set this to True unless you know what you are doing. |
publish | If set to True, the attribute will be published in Factory's classad |
glidein_publish | If set to True, the attribute will be available in the condor_startd's classad. Used only if parameter is True. |
job_publish | If set to True, the attribute will be available in the user job's environment. Used only if parameter is True. |
comment | You can specify description of the attribute here. |
type | Type of the attribute. Supported types are 'int', 'string' and 'expr'. Typeexpr is equivalent to condor constant/expression in condor_vars.lst |
<attrs> <attr name="VOpilot" value="CMS" publish="True" parameter="True" const="True" glidein_publish="True" comment=”Just a test attribute”/> <attr name="CondorVersion" value="v6.9.1" publish="True" parameter="True" const="True" glidein_publish="True"/> </attrs>
<glidein><attrs><attr
name="attr name" value="value" const="False" parameter="True" job_publish="True"
comment="comment"/>
Attributes you want to push to the user jobs.
A list of all the attributes can be found on the dedicated configuration variables
page.
The other arguments are for advanced admins only, and are explained in a dedicated section.
Each entry point will have its own root tag:
<glidein><entries><entry
name="entry name">
Specify an easy to remember name.
For each entry point, you definitely need to set the following
arguments:
<glidein><entries><entry
name="entry name"
gatekeeper="gatekeeper">
The identifier of your Grid resource (like "cmsitbsrv01.fnal.gov/jobmanager-condor").
<glidein><entries><entry
name="entry name"
rsl="rsl"> may also be needed
(like '(condorsubmit=(universe
vanilla)(requirements \"(ISMINOSAFS=?=True)\"))').
Please check the Grid site documentation and/or ask the Grid site
administrator.
The current implementation has been tested with Globus v2 Gatekeepers
only, but if you want to test it with different Condor Grid types, please
use <glidein><entries><entry
name="entry name"
gridtype="grid
type"> ("gt2" is the default).
You most probably want to set the following arguments, too:
<glidein><entries><entry
name="entry name"><attrs><attr
name="GLIDEIN_Site" value="value" const="True" parameter="True" publish="True"/>
This defines the glidein attribute GLIDEIN_Site, both for use of the Frontend and
for the use of the job negotiation.
Logically defining a site is useful, so that you can change entry points
but the user jobs do still known where they are running. If not
specified, it defaults to the entry point name in the startd ClassAd.
<glidein><entries><entry
name="entry name"
work_dir="WN
dir">
This argument defines where the glidein should run once on the worker
node.
Most OSG sites are known to crash if you use your starting directory to
run. For those sites, it is good practice to specify "Condor" if they are running Condor as the
underlying batch system, and "OSG" else.
On EGEE sites, "." is usaully fine.
<glidein><entries><entry
name="entry name"
proxy_url="Proxy
URL">
If you have a Web cache you can use, you set it here (like "cmsitbsquid002.fnal.gov:3128").
On OSG resources, you can set it to "OSG",
and the default OSG squid will be used.
If you cannot use any Web cache server, you can skip this argument (the
default is not to use caching).
If defined, the user jobs will be able to use it as "GLIDEIN_Proxy_URL" environment variable.
Some other arguments you might want to set, are:
<glidein><entries>
<entry name="entry name"><attrs>
<attr name="CONDOR_OS" value="os" type="string" const="True" parameter="True" publish="False"/>
<attr name="CONDOR_ARCH" value="arch" type="string" const="True" parameter="True" publish="False"/>
Select a non-default condor binary.
The entry will default to CONDOR_OS="default" CONDOR_ARCH="default", if not otherwise defined.
<glidein><entries><entry
name="entry name"><attrs><attr
name="attr name" value="value" const="True" parameter="True" publish="True"
glidein_publish="True" comment="comment"/>
Attributes you want to publish.
These are used by the VO
frontend matchmaking and job matchmaking.
Example attributes are:
<glidein> <entries> <entry name="myentry"> <attrs> <attr name="HasMySoftware" value="True" publish="True" parameter="True" const="True" glidein_publish="True" comment=”My users cannot live without”/> <attr name="OS" value="Linux" publish="True" parameter="True" const="True" glidein_publish="True"/> </attrs>
<glidein><entries><entry
name="entry name"
schedd_name="schedd
name">
If you have an entry that needs a dedicated schedd, you can set it here
(to something like "myveryspecialschedd@mymachine.mydomain")
<glidein><entries><entry
name="entry name"
enabled="True/False">
You can define an entry point even if you do not plan to use it right
away.
The entry point directory will be created independently of the enabled flag, but will only be used by the
glidein factory if it is set to True. (Defaults to True).
<glidein><entries><entry
name="entry name"
verbosity="std/fast/nodebug">
Specify the verbosity level and termination time in case of validation
errors:
std (default) – reasonable verbosity (including the condor log files) and 20min sleep in case of error (to reduce the damage resulting from broken nodes)
fast – same verbosity as std, but will only wait 2 mins before terminating in case of error (good for debugging)
nodebug – very low verbosity, if you want to save on disk space
The other arguments are for advanced admins only, and are explained below.
While the above is enough for setting up a personal glidein pool on the local area network, you will need to do more fine tuning when deploying a larger one. In this section, the various advanced aspects of glidein pools will be presented.
As you may have noticed, all of the glideins are submitted with the same service proxy. While this has the advantage of simplifying the architecture and improve both efficiency and VO control, it does have a few problems:
All glidein scripts and Condor daemons,
AND user jobs all run under the same Unix UID. So users can interfere
with the glidein tasks, possibly hacking the system.
Plus, when several glideins start on the same node (on multi
processor/core machines), one user job can interfere with another user
job.
The real user is never authenticated against the Grid site authorization infrastructure. This makes it impossible for the sites to enforce their policies, nor can they analyze the usage of their resources; they see only glideins. This makes them very unfriendly toward the glidein based WMS.
To solve this problem, some Grid sites are deploying gLExec on the worker nodes. gLExec is a service that, taken
the user proxy, and
the desired command
More details about scripts in general can be found in the "custom code" section.
to the condor_config.
As of version 7.1.3 of Condor, a new, better glexec operation mode is supported; in the old operation mode, condor_startd invoked condor_starter through glexec. The result was that condor_starter was running under the same UID as the user job, leaving it vulnerable to attack from a malicious user. The new operating mode solves this by having condor_starter run the user jobs via glexec; this adds a little more overhead to handle the user jobs, but makes the system much more secure.
Note that you still need to set GLEXEC_BIN, too.
Warning: Use it only if you use Condor 7.1.3 or later, as it will not
work on any older Condor version!
Condor daemons need two way communication in order to work properly. This clashes with the network policies of most Grid sites, that have worker nodes in private networks or implement a restrictive firewall.
Condor provides two mechanisms to address this:
where:
NONE
Do not use GCB (a good way to selectively disable it)
RANDOM
Randomly distributes between the listed GCBs
ROUNDROBIN
(or RR)
Round robin between them, based on the job submission number.
SEQUENTIAL
(or SEQ)
Keep the order. Essentially always tries the first one first (the others
will be used only if the first one fails)
GCBLOAD
Order by GCB load. All GCBs must support the freesockets query and you must upload the gcb_broker_query binary, too. See below.
Please be aware that the above will configure the glideins only; you still
need to properly configure the Collector and the submit machines.
and you are done. Just make sure you follow the suggested scalability guidelines described in the Condor manual.
As mentioned in the startup page,
the glidein pool must be properly configured to protect it from hackers and
malicious users. The same page also describes what needs to be done on the
collector machine.
The glidein itself can also be configured. The default configuration works
fine for most users, but you may need to change them.
The values are set using the <attr
/> option, and the default values are:
SEC_DEFAULT_ENCRYPTION=OPTIONAL
SEC_DEFAULT_INTEGRITY=REQUIRED
DELEGATE_JOB_GSI_CREDENTIALS=False
As of Condor version 7.1.3 condor also supports a more efficient authentication mechanism between the condor_schedd/condor_shadow and condor_startd/condor_starter. This method uses the match ClaimId as a shared password for authentication between these daemons. Since using a shared secret is much cheaper that using GSI authentication, this should be used every time it is feasible.
To enable this option, you need to set an attribute using the <attr /> option:
USE_MATCH_AUTH=True
Please be aware that this will configure the glideins only; you still need to
properly configure the Collector machine. See Condor
documentation for more details.
By default, Condor uses only one Collector for the glidein (user) pool.However, if the
load becomes too high, you can configure multiple collectors in a chain.
You will need a master and a set of slave collectors. The slave collectors
forward the startd adds to the master collector.
The negotiator and the schedds will talk to the master collector, while the
startds will talk to one of the slave ones.
To set up slave collector in the glidein
(user) pool, one way is to set the following env variables before
starting up the condor_master:
COLH=`condor_config_val COLLECTOR_HOST` LD=`condor_config_val LOCAL_DIR` export _CONDOR_COLLECTOR_HOST=$COLH: export _CONDOR_MASTER_NAME=collector_ export _CONDOR_DAEMON_LIST="MASTER, COLLECTOR" export _CONDOR_LOCAL_DIR=$LD/$_CONDOR_MASTER_NAME export _CONDOR_LOCK=$_CONDOR_LOCAL_DIR/lock # Forward all the traffic to the main collector export _CONDOR_CONDOR_VIEW_HOST=$COLH:9618
As with any Condor pool, you may need to set the startd start
and rank
conditions.
For a glidein, you can set this with the <attr /> options:
GLIDEIN_Start=expression
GLIDEIN_Rank=expression
The whole concept of gliding into Grid resources is based on the idea that
you are getting those resources on a temporary basis. This implies
that you need to leave the slot as soon as possible, else your jobs will
simply be killed by the annoyed Grid administrators.
On the other hand, submitting new glideins is not cost free, so you want to
keep the resource for at least some period of time.
The glideins have two mechanisms to regulate this:
After a specified amount of time, the
glidein will enter the RETIRING
state. This means, it will wait for the current job to finish (or
kill it if it does not end within a configurable timeout) and exit
immediately afterwards. This obviously implies that no new jobs will
start after it entered that state.
The two timeouts can be set with the <attr /> options:
GLIDEIN_Retire_Time=nr_of_seconds
GLIDEIN_Job_Max_Time=nr_of_seconds
The two default to 2 and 100 hours.
If a glidein is not claimed within a
configurable timeout, the glidein will exit.
The timeout can be set qith:
GLIDEIN_Max_Idle=nr_of_seconds
The default is 20 minutes.
Since v1_4_1, the pseudo-interactive monitoring uses a dedicated startd in the glideins for monitoring purposes. This allows for monitoring even when the job starter enters the “Retiring” activity.
The side effect is that you do not have anymore the cross-VM statistics and the names of the slots is also different.
To enable the old mode, use:
While provided code should cover most of the general purpose use cases, some administrators may have additional needs. For these cases, the glidein creation command adds the following options:
<glidein>
[entries><entry>]
<files>
<file absfname="script
name" executable="True" comment="comment"/>
Path to the custom script.
The script will be copied in the Web-accessible area, and when a glidein
starts, the glidein startup script will pull it and execute it. If any
parameters are needed, they can be specified using <attr
/>, or stored in a file (see below).
For more detailed information, see the page dedicated to writing custom scripts.
<glidein>
[entries><entry>]
<files>
<file absfname="script
name" wrapper="True"
comment="comment"/>
Path to the wrapper custom script.
The script will be copied in the Web-accessible area, and will be sourced
just before a user job starts starts; i.e. it will become part of the user
job wrapper.
<glidein>
[entries><entry>]
<files>
<file absfname="loacl file
name" relfname="target file
name" const="Bool" executable="False" comment="comment"/>
Path to the config file.
The file will be copied in the Web-accessible area, and pulled by the
glidein startup script when a glidein starts. It can be then used by any
script (see above).
Please be cautious in using the const flag; if set to
False, the content of the file will not be verified by the glidein
startup script and could be tampered in transit by a malicious user. So
never put sensitive data (like the switch to disable security checks) in
a changeable file.
<glidein>
[entries><entry>]
<files>
<file absfname="loacl file
name" untar="True" comment="comment">
<untar_options cond_attr="conf_sw"
dir="dir
name" absdir_outattr="attr
name">
Sometimes it is useful to transfer a whole set of files, or even
directories, and that is much easier to accomplish by means of a
tar-ball. A subsystem is the glidein way to describe a compressed tarball
that is delivered to the worker nodes, untarred in a separate directory
and advertised to the other scripts.
where:
absfname
Path to the costum tarball. (like "/tmp/mytar_v12.5.tgz")
conf_sw
Name of a configuration switch. (like "ENABLE_KRB5")
The tarball will be unpacked only if that parameter will be set to 1.
Use the <attr /> switch to define the default value.
A special name TRUE can be used to always untar it.
dir
Name of the subdirectory to untar it in. (like "krb5")
absdir_outattr
Name of a variable name. (like "KRB5_SUBSYS_DIR")
The variable will be set to the absolute path of the directory where
the tarball was unpacked, if and only if the unpacking actually
happened. It will not be define
Please notice that files and subsystems will be downloaded before the scripts, and that the user provided scripts will be executed in the specified order, and before the Condor daemons are started up.
RepositoryCVSROOTcvsuser@cdcvs.fnal.gov:/cvs/cd Package(s)glideinWMS/creation |
Author(s)Since Aug. 14th - Igor Sfiligoi (Fermilab Computing Division) |