Thursday, October 09, 2008

SouthGrid update

The Birmingham site has suffered some reliability problems caused by Site Networking problems.
Physics are working with central IS to resolve these issues.

Bristol has been having problems with their SE. Despite work over the weekend to fsck all the partitions by hand, the array is still causing problems.

Oxford has recently been having a strange Maui problem that means only about 60-70% of the available cores get allocated jobs. Manually 'qrun' ing the jobs causes them to run ok.
Then more recently the maui process actually started crashing. Investigations are ongoing although things seem a bit better just now.

RalPPD have installed the latest purchase of WNs into production adding another 160 job slots worth 270 kSI2k bringing us up to 1025kSI2k Total. Further disk servers have arrived but will take a month to commission.

No comments: