Friday, April 24, 2015

Simple CVMFS puppet Module

Oxford was one of the first site to test CVMFS and also to use cern CVMFS module. Initially installation of CVMFS was not well documented  so cern cvmfs puppet module was very helpful in installing and configuring cvmfs.
Installation became easy and more clear with the newer version of cvmfs. One of my ops action was to install gridpp multi vo cvmfs repo with cern cvmfs puppet module. We realized that it is easy to write a trimmed down version of cern  cvmfs module rather than use cern cvmfs module directly. The result is cvmfs_simple module which is available on GitHub.

'include cvmfs_simple' will set up LHC repos and gridpp repo

Only mandatory parameter is

cvmfs_simple::config::cvmfs_http_proxy : 'squid-server'

It is also possible to add local cvmfs repository. Extra repos can be configured by passing values from hiera

cvmfs_simple::extra::repo: ['gridpp', 'oxford']

Oxford is using a local cvmfs repo to distribute software for local users. oxford.pp can be used as template for setting new local cvmfs repo.

cvmfs_simple doesn't support all use cases and it expects that everyone is using hiera ;) . Please feel free to change it for your use case. 

Monday, October 13, 2014

Nagios Monitoring for Non LHC VO’s

A brief description of monitoring framework before coming to the actual topic of Non LHC VO's monitoring.
Service Availability Monitoring (SAM) is a framework for monitoring grid sites remotely. It consists of many components to perform various functions. It can be broadly divided into
‘What to Monitor’ or Topology Aggregation:  Collection of service endpoints and metadata from different sources like GOCDB, BDII, VOMS etc. Custom topological source (VO Feeds) can also be used.
Profile Management:  Mapping of services to the test to be performed.  This service is provided by POEM ( Profile Management) database.  It provides a web based interface to group various metrics into profiles.
Monitoring: Nagios is used as monitoring engine. It is automatically configured based on the information provided by Topology Aggregator and POEM.
SAM software was developed under EGEE project at CERN and now maintained by EGI.
It is mandatory for grid sites to pass ops VO functional test to be part of WLCG. Every NGI maintains a Regional SAM Nagios and result from regional SAM Nagios also goes to central MyEGI which is used for Reliability/Availability calculation.   
UK Regional Nagios is maintained at Oxford
and a backup instance at Lancaster

There was no centralize monitoring of Non LHC VO’s for long time and it contributed to bad user experience as it was difficult to find whether a site is broken or problem at the user end.  It was decided to host a multi VO Nagios at Oxford as we had experience with WLCG Nagios.
It is currently monitoring five VO’s

Sites can look for tests associated with only their site
VO managers may be interested to see tests associated with a particular VO only

We are using VO-feed mechanism to aggregate site metadata and endpoint information. Every VO has a vo-feed available on a web server.  Currently we are maintaining this VO-feed 

VO feed provides list of services to be monitored. I am generating this VO-feed through a script

Jobs are submitted using a proxy generated from a Robot Certificate assigned to Kashif Mohammad. These jobs are like normal grid user jobs and test things like GCC version and CA version. Jobs are submitted every eight hour and this is a configurable option.  We are monitoring CREAMCE, ARC-CE and SE only. Services like BDII, WMS etc. are already monitored by Regional Nagios so there was no need for the duplication.  

For more information, these links can be consulted

Tuesday, May 13, 2014

Configuring ARC CE and Condor with puppet

ARC CE and condor using puppet

We have started testing Condor and ARC CE with the intention of moving away from Torque.  Almost one third of cluster has been moved to condor and we are quite satisfied with Condor as a batch system.  Condor setup was fairly easy but configuring ARC CE was bit challenging.  I believe that new version of ARC CE has fixed most of the issue I faced.  Andrew Lahiff was of great help in troubleshooting our  problems .Our setup consists of
1           CE :  Configured as ARC CE and  Condor submit host and runs Condor SCHEDD process
2              Central manager :  Condor Server and  runs Condor COLLECTOR and NEGOTIATOR process
3              WN’s :  Runs Condor  STARTD process, also installed emi-wn and glexec metapackages.
CE , Central Manager and condor part of WN’s  were completely  configured  with puppet.  I have to run yaim on WN’s t configure emi-wn and glexec.
I used puppet modules from which were initially written by Luke Kreczko from Bristol.  We are using Hiera to pass parameters but most puppet modules works without Hiera as well.  I am not intending to go into details of condor or ARC CE but rather use of puppet modules to install and configure Condor and ARC CE.

Condor :
It was a pleasing experience to configure condor with puppet.
     Git clone to module directory on puppet server
     include htcondor
on CE, Central Manager and WN’s and then Hiera tells that which service has to be configured on a particular machine.
# Condor
- ''
- ''
- 't2wn*'

htcondor::uid_domain: ''
htcondor::collector_name: 'SOUTHGRID_OX'
htcondor::pool_password: 'puppet:///site_files/grid/condor_pool_password'

This configures a basic condor cluster.  There is no user account at this stage so a test user account can be created on all three machines and basic condor jobs can be tested.  Htcondor manual is here

Setting up user accounts :
I  used this module to create user accounts only  for central manager and ce.  Since I have to run yaim on WN’s to setup emi-wn and glexec so  created user account on WN through yaim.
This puppet module can parse a glite type users.conf to create users account or range of  id’s can be passed to the module.

Setting up voms server :
It is used to set voms client on central-manager and ce.  One way to use this module is to pass name of each VO separately as described in the readme file of the module.
     Class { ‘voms::atlas’}
I  have used small wrapper class to pass all VO’s as array to wrapper class
     include include setup_grid_accounts
Then pass name of the VO’s through Hiera setup_grid_accounts::vo_list:
    - 'alice'
    - 'atlas'
    - 'cdf'
    - 'cms'
    - 'dteam'
    - 'dzero'

include arc_ce and on CE and then pass configuration parameters from Hiera. It has a very long list of configurable parameters and most of the default values works ok.  Since most of values are passed through Hiera so arc Hiera file is quite long, I am giving few of the examples
    targethostname: ''
    targetport: '2135'
    targetsuffix: 'Mds-Vo-Name=UK,o=grid'
    regperiod: '120'

       default_memory: '2048'
         - '1cpu:4'
          OSFamily: 'linux'
          OSName: 'ScientificSL'
          OSVersion: '6.5'
          OSVersionName: 'Carbon'
          CPUVendor: 'GenuineIntel'
          CPUClockSpeed: '2334'
          CPUModel: 'xeon'
          NodeMemory: '2048'
          totalcpus: '168'

This almost sets up condor cluster with arc ce. There are few bits in arc and puppet modules which are there as a workaround for things which have already been fixed upstream. It needs some testing and clean up.

WN's needs some small runtime env setting specific to ARC. When jobs arrive at WN's it looks into /etc/arc/runtime/ directory for ENV settings 
 Our's runtime tree is like this.
├── APPS
│   └── HEP
│       └── ATLAS-SITE-LCG
└── ENV
    ├── GLITE
    └── PROXY
It can be just empty files. SAM-Nagios doesn't submit jobs if ARC CE is not publishing GLITE env.

I may have missed few things so please feel free to point it out.



Wednesday, May 07, 2014

Configuring CVMFS for smaller VOs

We have just configured cvmfs for t2k, hone, mice and ilc after sitting on the request for long time. The main reason for delay was the assumption that we need to change cvmfs puppet module to accommodate non lhc VOs.   It turns out to be quite straight forward with  little effort.
We are using cern cvmfs module and there was an update a month ago so it is better to keep it updated.

 Using hiera to pass parameters to module, our hiera bit for cvmfs
      cvmfs_server_url: ';'
      cvmfs_server_url: ';'
      cvmfs_server_url: ';'
      cvmfs_server_url: ';;'

One important bit is the name of cvmfs repository e.g instead of

Other slight hitch is public key distribution of various cvmfs repositories.  Installation of cvmfs also fetch cvmfs-keys-*.noarch rpm which put all the keys for cern based repository into /etc/cvmfs/keys/.

I have to copy publich key for and to /etc/cvmfs/keys. It can be fetched from  repository
wget -O
or copied from

we  distributed the keys through puppet but outside cvmfs module.
It would be great if some one can convince cern to include public keys of other repositories into cvmfs-keys-* rpm. I am sure that there is not going to be many cvmfs stratum 0s.

Last part of the configuration is to change SW_DIR in site-info.def or vo.d directory

WNs requires re-yaim  to configure SW_DIR in /etc/profile.d/  You can also edit file manually and distribute it through your favourite configuration management system.

Thursday, January 23, 2014

A dramatic effect on Atlas jobs when xrootd dies

This week for the first time at our site the xrootd server process on our DPM SE died.

The ganglia plot shows a dramatic falloff in load.
As all the jobs started to fail to access the data. The number of jobs running in the batch systems remained high so pbswebmon did not alert us although Kashif had noticed the jobs were very inefficient on Tuesday evening. Which in hind sight was the give away that something was amiss. We recieved a ticket from Atlas and Ewan restarted the daemon and all recovered.

Friday, March 30, 2012

Should we Hyperthread

Following the recent discussion on hypertheading on the TB-Support mail list and having several sets of nodes that have hyperthreading and three or four gigabytes of memory per core we decided to run some tests to see if opening some job slots on the virtual cores would increase our throughput.

The first test was to benchmark the nodes with hyperthreading enabled out to the full number of cores. We have three sets of nodes with hyperthreading capabilities with E5520, X5650 and E5645 CPUs. For each type we ran one to n instances of the HEPSPEC benchmarking tests when n is the number of real plus the number of virtual cores.

Figure (1): Total Node HEPSPEC by number of concurrent tests

These are shown in Figure (1) and clearly show the nearly linear rise as the test run on the real CPUs then flattening of as more of the virtual cores run becoming almost completely flat or even dropping again as all the virtual cores are used. However it does show a clear increase in the total HEPSPEC rating of the node when using half of the virtual cores. That should mean that there will be a real gain in output by enabling jobs on these virtual cores, as long as real work scales like the HEPSPEC tests and we don't run into network, memory of disk I/O bottlenecks.

Armed with this information we decided to check the real world performance of the nodes with E5520 and X5650 CPUs with jobs running on half the virtual CPUs.

To do this we took half of each set of nodes and increased the number of job slots by 50% (8 to 12 for the E5520s and 12 to 18 for the X5650s, the E5645 nodes are still in test after delivery and not yet ready for production) and reduced the pbs_mom cpumult and wallmult parameters to reflect the lower per core HEPSPEC rating once we start using the virtual cores and returned them to running production jobs.

They have now been running real jobs for seven days and we have enough statistics to start comparing the nodes running jobs on virtual cores with those not doing so.

Figure (2): Average Job Efficiency for nodes using and not using virtual cores
Figure (2) shows the average job efficiency (total CPU time divided by total wall time) for jobs run on nodes with and without the virtual cores in use. There is no sign of a drop in efficiency when running on the virtual cores so it would appear that at 12 or 18 jobs per node we are not yet hitting network, memory or disk I/O bottlenecks.

Figure (3): Average Job Efficiency for different VOs and Roles on the different classes of nodes
Figure (3) shows the average job efficiency for different VOs and roles on the different classes of nodes, and shows no sign of a systematic drop in efficiency when running jobs of the virtual cores (only the prdlhcb group shows signs of such an effect and the statistics are somewhat lower for that).
Figure(4):HEPSPEC06 Scaled CPU Hours per Node per Day
Figure (5): HEPSPEC06 Scaled CPU Hours per Node per Day
Figures (4) and (5) HEPSPEC06 scaled CPU hours accumulated per day per core or node (total unscaled CPU time of all jobs on that class multiplied by the HS06 rating divided by the number of days the test ran and the number of cores or nodes in that class) shows as hoped that although the individual cores accumulate HEPSPEC06 scaled CPU hours faster when not running on virtual cores that is more than offset by the increase in the number of slots per node.

Figure (6): Average number of Jobs per Core per Day
Figure (7): Average number of Jobs per Node per Day
Finally Figures (6) and (7) Jobs per day per node or core (Total number of jobs run on each class of node decided by the length of the test and the number of nodes or cores in that class) shows a similar effect - the individual cores manage to do more work when no jobs are running on the virtual cores but the increase in the number of slots more than make up for it.

In conclusion it appears that running jobs on half of the virtual cores for nodes that are hyperthreading capable gives a 30-40% increase in the total "installed capacity" provided by those nodes without any apparent decrease in the efficiency of jobs running on those nodes.

We will continue the test for another week but unless the numbers change drastically we will be changing our policy and running jobs on half of the virtual cores on hyperthreading capable nodes.

Chris and Rob.

Friday, February 10, 2012

HEPSEPC06 on AMD Interlagos 6276

I have been running HEPSPEC06 on a recent Dell 815 with the new AMD 16 core Interlagos processors.

The only valid HEPSPEC06 result (for current GridPP use) is the SL5 (64 bit OS) but 32 bit compiler result but for interest we ran also with 64 bit compiler switches.

Then we installed SL6 and re-ran both 32 and 64 bit compiler options.

The results are on the GridPP wiki, but the most notable thing is the performance boost you can get going from 32bit on SL5 to 64 bit on SL6.

The boost is nearly 25% which could mean a lot to the experiments and the sites productivity if they can be persuaded to migrate sooner rather than later.