Thursday, August 28, 2008

Upgrade at RALPP

We had a downtime this morning to:
  1. Upgrade the kernel on the Grid Service and dCache pool nodes
  2. Install the latest gLite updates on the service nodes
  3. Upgrade dCache to the most recent patch version
Upgrading gLite and the kernel on the service nodes seems to have gone smoothly (still waiting for the SAM Admin jobs I submitted to get beyond "Waiting").

However, I had a bit more fun upgrading the kernel on the dCache Pool nodes. This is supposed to be much easier now the Areca drivers are in the stock SL4 kernel and the xfs kernel modules are in the SL Contrib yum repository so we don't have to build our own rpms like we have in the past, and indeed both these parts worked fine. But five of the nodes with 3ware cards did not appear again after I (remotely) shutdown -r now'd them. Of course these are the nodes in the Atlas Center so I had to walk across to try to find out what the problem was. The all seemed to have hung up at the end of the shutdown "Unmounting the filesystems". All came back cleanly after I hit the reset buttons.

The second problem (which had me worried for a time ) was with one of the Areca nodes. I was checking them to see if the XFS kernel modules had installed correctly and that the raid partition was mounted and on this node it wasn't, but the kernel modules had installed correctly. Looking a bit harder I found that the whole device seemed to be missing. Connecting to the RAID Card web interface I find that instead of two RAID sets (system and data) it has the two system disks in a RAIDo pair and 22 free disks (cue heart palpitations). Looking (in a rather panicked fashion) through the admin interface options I find "Rescue RAID set" and give it a go. After a reboot I connected to the web interface again and now I see both RAID sets. Phew! It's too early to start the celebrations though becuase when I log in the partition isn't mounted and when I try by hand it complains that the Logical Volume isn't there. Uh oh, cue much googling and reading of man pages.

pvscan sees the physical volume, vgscan sees the volume group and lvscan sees the logical volume but it's "NOT Available". I tried vgscan --mknodes, no that didn't work. I finally got it working with:

vgchange --available y --verbose raid

Then I could mount the partition and all the data appeared to be there.

After all that the upgrade of dCache was very simple. Just a case of adding the latest rpm to my yum repository, running yum update dcache-server then /opt/d-cache/install/install.sh. The latter complained of some depreciated config file options I'll have to look at but dCache came up
smoothly.

I'd qsig -s STOP'd all the jobs whist doing the obviously and here's an interesting ploy of the network traffic into the Worker Nodes over the last day.

As you can see once I restarted the jobs they more or less picked up without missing a beat. And yes, they are reading data at 600 MB/sec and the dCache is quite happily serving to them at that rate.


No comments: