Monday, October 13, 2014

Nagios Monitoring for Non LHC VO’s

A brief description of monitoring framework before coming to the actual topic of Non LHC VO's monitoring.
Service Availability Monitoring (SAM) is a framework for monitoring grid sites remotely. It consists of many components to perform various functions. It can be broadly divided into
‘What to Monitor’ or Topology Aggregation:  Collection of service endpoints and metadata from different sources like GOCDB, BDII, VOMS etc. Custom topological source (VO Feeds) can also be used.
Profile Management:  Mapping of services to the test to be performed.  This service is provided by POEM ( Profile Management) database.  It provides a web based interface to group various metrics into profiles.
Monitoring: Nagios is used as monitoring engine. It is automatically configured based on the information provided by Topology Aggregator and POEM.
SAM software was developed under EGEE project at CERN and now maintained by EGI.
It is mandatory for grid sites to pass ops VO functional test to be part of WLCG. Every NGI maintains a Regional SAM Nagios and result from regional SAM Nagios also goes to central MyEGI which is used for Reliability/Availability calculation.   
UK Regional Nagios is maintained at Oxford
and a backup instance at Lancaster

There was no centralize monitoring of Non LHC VO’s for long time and it contributed to bad user experience as it was difficult to find whether a site is broken or problem at the user end.  It was decided to host a multi VO Nagios at Oxford as we had experience with WLCG Nagios.
It is currently monitoring five VO’s

Sites can look for tests associated with only their site
VO managers may be interested to see tests associated with a particular VO only

We are using VO-feed mechanism to aggregate site metadata and endpoint information. Every VO has a vo-feed available on a web server.  Currently we are maintaining this VO-feed 

VO feed provides list of services to be monitored. I am generating this VO-feed through a script

Jobs are submitted using a proxy generated from a Robot Certificate assigned to Kashif Mohammad. These jobs are like normal grid user jobs and test things like GCC version and CA version. Jobs are submitted every eight hour and this is a configurable option.  We are monitoring CREAMCE, ARC-CE and SE only. Services like BDII, WMS etc. are already monitored by Regional Nagios so there was no need for the duplication.  

For more information, these links can be consulted

Tuesday, May 13, 2014

Configuring ARC CE and Condor with puppet

ARC CE and condor using puppet

We have started testing Condor and ARC CE with the intention of moving away from Torque.  Almost one third of cluster has been moved to condor and we are quite satisfied with Condor as a batch system.  Condor setup was fairly easy but configuring ARC CE was bit challenging.  I believe that new version of ARC CE has fixed most of the issue I faced.  Andrew Lahiff was of great help in troubleshooting our  problems .Our setup consists of
1           CE :  Configured as ARC CE and  Condor submit host and runs Condor SCHEDD process
2              Central manager :  Condor Server and  runs Condor COLLECTOR and NEGOTIATOR process
3              WN’s :  Runs Condor  STARTD process, also installed emi-wn and glexec metapackages.
CE , Central Manager and condor part of WN’s  were completely  configured  with puppet.  I have to run yaim on WN’s t configure emi-wn and glexec.
I used puppet modules from which were initially written by Luke Kreczko from Bristol.  We are using Hiera to pass parameters but most puppet modules works without Hiera as well.  I am not intending to go into details of condor or ARC CE but rather use of puppet modules to install and configure Condor and ARC CE.

Condor :
It was a pleasing experience to configure condor with puppet.
     Git clone to module directory on puppet server
     include htcondor
on CE, Central Manager and WN’s and then Hiera tells that which service has to be configured on a particular machine.
# Condor
- ''
- ''
- 't2wn*'

htcondor::uid_domain: ''
htcondor::collector_name: 'SOUTHGRID_OX'
htcondor::pool_password: 'puppet:///site_files/grid/condor_pool_password'

This configures a basic condor cluster.  There is no user account at this stage so a test user account can be created on all three machines and basic condor jobs can be tested.  Htcondor manual is here

Setting up user accounts :
I  used this module to create user accounts only  for central manager and ce.  Since I have to run yaim on WN’s to setup emi-wn and glexec so  created user account on WN through yaim.
This puppet module can parse a glite type users.conf to create users account or range of  id’s can be passed to the module.

Setting up voms server :
It is used to set voms client on central-manager and ce.  One way to use this module is to pass name of each VO separately as described in the readme file of the module.
     Class { ‘voms::atlas’}
I  have used small wrapper class to pass all VO’s as array to wrapper class
     include include setup_grid_accounts
Then pass name of the VO’s through Hiera setup_grid_accounts::vo_list:
    - 'alice'
    - 'atlas'
    - 'cdf'
    - 'cms'
    - 'dteam'
    - 'dzero'

include arc_ce and on CE and then pass configuration parameters from Hiera. It has a very long list of configurable parameters and most of the default values works ok.  Since most of values are passed through Hiera so arc Hiera file is quite long, I am giving few of the examples
    targethostname: ''
    targetport: '2135'
    targetsuffix: 'Mds-Vo-Name=UK,o=grid'
    regperiod: '120'

       default_memory: '2048'
         - '1cpu:4'
          OSFamily: 'linux'
          OSName: 'ScientificSL'
          OSVersion: '6.5'
          OSVersionName: 'Carbon'
          CPUVendor: 'GenuineIntel'
          CPUClockSpeed: '2334'
          CPUModel: 'xeon'
          NodeMemory: '2048'
          totalcpus: '168'

This almost sets up condor cluster with arc ce. There are few bits in arc and puppet modules which are there as a workaround for things which have already been fixed upstream. It needs some testing and clean up.

WN's needs some small runtime env setting specific to ARC. When jobs arrive at WN's it looks into /etc/arc/runtime/ directory for ENV settings 
 Our's runtime tree is like this.
├── APPS
│   └── HEP
│       └── ATLAS-SITE-LCG
└── ENV
    ├── GLITE
    └── PROXY
It can be just empty files. SAM-Nagios doesn't submit jobs if ARC CE is not publishing GLITE env.

I may have missed few things so please feel free to point it out.



Wednesday, May 07, 2014

Configuring CVMFS for smaller VOs

We have just configured cvmfs for t2k, hone, mice and ilc after sitting on the request for long time. The main reason for delay was the assumption that we need to change cvmfs puppet module to accommodate non lhc VOs.   It turns out to be quite straight forward with  little effort.
We are using cern cvmfs module and there was an update a month ago so it is better to keep it updated.

 Using hiera to pass parameters to module, our hiera bit for cvmfs
      cvmfs_server_url: ';'
      cvmfs_server_url: ';'
      cvmfs_server_url: ';'
      cvmfs_server_url: ';;'

One important bit is the name of cvmfs repository e.g instead of

Other slight hitch is public key distribution of various cvmfs repositories.  Installation of cvmfs also fetch cvmfs-keys-*.noarch rpm which put all the keys for cern based repository into /etc/cvmfs/keys/.

I have to copy publich key for and to /etc/cvmfs/keys. It can be fetched from  repository
wget -O
or copied from

we  distributed the keys through puppet but outside cvmfs module.
It would be great if some one can convince cern to include public keys of other repositories into cvmfs-keys-* rpm. I am sure that there is not going to be many cvmfs stratum 0s.

Last part of the configuration is to change SW_DIR in site-info.def or vo.d directory

WNs requires re-yaim  to configure SW_DIR in /etc/profile.d/  You can also edit file manually and distribute it through your favourite configuration management system.

Thursday, January 23, 2014

A dramatic effect on Atlas jobs when xrootd dies

This week for the first time at our site the xrootd server process on our DPM SE died.

The ganglia plot shows a dramatic falloff in load.
As all the jobs started to fail to access the data. The number of jobs running in the batch systems remained high so pbswebmon did not alert us although Kashif had noticed the jobs were very inefficient on Tuesday evening. Which in hind sight was the give away that something was amiss. We recieved a ticket from Atlas and Ewan restarted the daemon and all recovered.