Thursday, January 23, 2014

A dramatic effect on Atlas jobs when xrootd dies

This week for the first time at our site the xrootd server process on our DPM SE died.

The ganglia plot shows a dramatic falloff in load.
As all the jobs started to fail to access the data. The number of jobs running in the batch systems remained high so pbswebmon did not alert us although Kashif had noticed the jobs were very inefficient on Tuesday evening. Which in hind sight was the give away that something was amiss. We recieved a ticket from Atlas and Ewan restarted the daemon and all recovered.

