ARC CE and condor using puppet
We have started testing Condor and ARC CE with the intention
of moving away from Torque. Almost one
third of cluster has been moved to condor and we are quite satisfied with
Condor as a batch system. Condor setup was fairly easy but
configuring ARC CE was bit challenging.
I believe that new version of ARC CE has fixed most of the issue I faced. Andrew Lahiff was of great help in
troubleshooting our problems .Our setup
consists of
1
CE : Configured
as ARC CE and Condor submit host and
runs Condor SCHEDD process
2
Central manager : Condor Server and runs Condor COLLECTOR and NEGOTIATOR process
3
WN’s :
Runs Condor STARTD process, also
installed emi-wn and glexec metapackages.
CE , Central Manager and condor part of WN’s were completely configured
with puppet. I have to run yaim
on WN’s t configure emi-wn and glexec.
I used puppet modules from https://github.com/HEP-puppet which
were initially written by Luke Kreczko from Bristol. We are using Hiera to pass parameters but
most puppet modules works without Hiera as well. I am not intending to go into details of
condor or ARC CE but rather use of puppet modules to install and configure
Condor and ARC CE.
Condor :
It was a pleasing experience to configure condor with
puppet.
Git clone https://github.com/HEP-Puppet/htcondor.git
to module directory on puppet server
include htcondor
on CE, Central Manager and WN’s and then Hiera tells that which
service has to be configured on a particular machine.
# Condor
htcondor::managers:
-
't2condor01.physics.ox.ac.uk'
htcondor::computing_elements:
-
't2arc01.physics.ox.ac.uk'
htcondor::worker_nodes:
-
't2wn*.physics.ox.ac.uk'
htcondor::uid_domain:
'physics.ox.ac.uk'
htcondor::collector_name:
'SOUTHGRID_OX'
htcondor::pool_password:
'puppet:///site_files/grid/condor_pool_password'
This configures a basic condor cluster.
There is no user account at this stage so a test user account can be
created on all three machines and basic condor jobs can be tested. Htcondor manual is here
Setting up user
accounts :
I used this module to
create user accounts only for central
manager and ce. Since I have to run yaim
on WN’s to setup emi-wn and glexec so created
user account on WN through yaim.
This puppet module can parse a glite type users.conf to
create users account or range of id’s
can be passed to the module.
Setting up voms
server :
It is used to set voms client on central-manager and
ce. One way to use this module is to pass
name of each VO separately as described in the readme file of the module.
Class { ‘voms::atlas’}
I have used small
wrapper class to pass all VO’s as array to wrapper class
include include setup_grid_accounts
Then pass name of the VO’s through Hiera setup_grid_accounts::vo_list:
- 'alice'
- 'atlas'
- 'cdf'
- 'cms'
- 'dteam'
- 'dzero'
ARC CE :
include arc_ce and on CE and then pass configuration
parameters from Hiera. It has a very long list of configurable parameters and
most of the default values works ok.
Since most of values are passed through Hiera so arc Hiera file is quite
long, I am giving few of the examples
arc_ce::infosys_registration:
clustertouk1:
targethostname:
'index1.gridpp.rl.ac.uk'
targetport: '2135'
targetsuffix:
'Mds-Vo-Name=UK,o=grid'
regperiod: '120'
arc_ce::queues:
gridAMD:
default_memory: '2048'
cluster_cpudistribution:
- '1cpu:4'
cluster_description:
OSFamily: 'linux'
OSName: 'ScientificSL'
OSVersion: '6.5'
OSVersionName: 'Carbon'
CPUVendor:
'GenuineIntel'
CPUClockSpeed: '2334'
CPUModel: 'xeon'
NodeMemory: '2048'
totalcpus: '168'
This almost sets up condor cluster with arc ce. There are few bits in arc and puppet modules which are there as a workaround for things which have already been fixed upstream. It needs some testing and clean up.
WN's needs some small runtime env setting specific to ARC. When jobs arrive at WN's it looks into /etc/arc/runtime/ directory for ENV settings
Our's runtime tree is like this.
├── APPS
│ └── HEP
│ └── ATLAS-SITE-LCG
└── ENV
├── GLITE
└── PROXY
├── APPS
│ └── HEP
│ └── ATLAS-SITE-LCG
└── ENV
├── GLITE
└── PROXY
It can be just empty files. SAM-Nagios doesn't submit jobs if ARC CE is not publishing GLITE env.
I may have missed few things so please feel free to point it out.