Tuesday, January 22, 2008

Oxford Update

Plans to move the Oxford gridpp cluster up to Begbroke are being formulated.
The first part of the plan is to ensure that only these nodes are using the subnet in question. We did some tidying up over the last week or so, before having the subnet rerouted to both Physics and Begbroke. This change was made this morning at 8:50, and mostly went smoothly.
Our ui needs to be moved back on to the physics subnet to allow NFS mounting of home directories to work.
A new rack, PDU and network switch has been ordered to allow us to move a few test nodes up to Begbroke in advance of the main move.
We aim to complete the move late Jan/ early Feb.

The disk on our installation server which holds ganglia data and central syslog data failed today. We will restore from backups.
t2wn05 has a failed hard disk which may have been acting as a black hole over the weekend.

Working with ZEUS and LHCb VO's to improve usage of our cluster uncovered some configuration problems.
  1. Not all the nodes had the latest DESY VOMS server certs applied (stopped zeus working)
  2. sgm ROLES were not mapped correctly for LHCB.
Finally the APEL problems seem to be behind us.
  1. Configuration seemed to have changed at the last running of yaim before Christmas which stopped any records getting published
  2. Installing the latest Development Apel rpms fixed the problem of not seeing the newer spec value for our new ce.

Friday, January 04, 2008

Scheduled Power outage at Birmingham causes problems

The scheduled power outage at Birmingham on Saturday 8th December caused 19 Babar SL4 systems to fail. 4 bad disks appeared on the SL3 cluster. The age of this equipment is a cause for concern.

There has been some concern expressed at small sites such at Bristol that the number of Atlas tests submitted by Steve Lloyds tests can over whelm their sites.