A brief description of monitoring framework before coming to
the actual topic of Non LHC VO's monitoring.
Service Availability Monitoring (SAM) is a framework for
monitoring grid sites remotely. It consists of many components to perform
various functions. It can be broadly divided into
‘What to Monitor’ or Topology Aggregation: Collection of service endpoints and metadata
from different sources like GOCDB, BDII, VOMS etc. Custom topological source
(VO Feeds) can also be used.
Profile Management: Mapping
of services to the test to be performed. This service is provided by POEM ( Profile Management)
database. It provides a web based
interface to group various metrics into profiles.
Monitoring: Nagios is used as monitoring engine. It is
automatically configured based on the information provided by Topology Aggregator
and POEM.
SAM software was developed under EGEE project at CERN and
now maintained by EGI.
It is mandatory for grid sites to pass ops VO functional
test to be part of WLCG. Every NGI maintains a Regional SAM Nagios and result from
regional SAM Nagios also goes to central MyEGI which is used for Reliability/Availability
calculation.
UK Regional Nagios is maintained at Oxford
and a backup instance at Lancaster
VO-Nagios
There was no centralize monitoring of Non LHC VO’s for long
time and it contributed to bad user experience as it was difficult to find
whether a site is broken or problem at the user end. It was decided to host a multi VO Nagios at
Oxford as we had experience with WLCG Nagios.
It is currently monitoring five VO’s
gridpp
t2k
snoplus.snolab.ca
pheno
vo.soutgrid.ac.uk
Sites can look for tests associated with only their site
VO managers may be interested to see tests associated with a
particular VO only
We are using VO-feed mechanism to aggregate site metadata and
endpoint information. Every VO has a vo-feed available on a web server. Currently we are maintaining this VO-feed
VO feed provides list of services to be monitored. I am
generating this VO-feed through a script
Jobs are submitted using a proxy generated from a Robot
Certificate assigned to Kashif Mohammad. These jobs are like normal grid user
jobs and test things like GCC version and CA version. Jobs are submitted every
eight hour and this is a configurable option.
We are monitoring CREAMCE, ARC-CE and SE only. Services like BDII, WMS
etc. are already monitored by Regional Nagios so there was no need for
the duplication.
For more information, these links can be consulted
https://tomtools.cern.ch/confluence/display/SAMDOC/SAM+Public+Site.html