Thursday, August 28, 2008

Upgrade at RALPP

We had a downtime this morning to:
  1. Upgrade the kernel on the Grid Service and dCache pool nodes
  2. Install the latest gLite updates on the service nodes
  3. Upgrade dCache to the most recent patch version
Upgrading gLite and the kernel on the service nodes seems to have gone smoothly (still waiting for the SAM Admin jobs I submitted to get beyond "Waiting").

However, I had a bit more fun upgrading the kernel on the dCache Pool nodes. This is supposed to be much easier now the Areca drivers are in the stock SL4 kernel and the xfs kernel modules are in the SL Contrib yum repository so we don't have to build our own rpms like we have in the past, and indeed both these parts worked fine. But five of the nodes with 3ware cards did not appear again after I (remotely) shutdown -r now'd them. Of course these are the nodes in the Atlas Center so I had to walk across to try to find out what the problem was. The all seemed to have hung up at the end of the shutdown "Unmounting the filesystems". All came back cleanly after I hit the reset buttons.

The second problem (which had me worried for a time ) was with one of the Areca nodes. I was checking them to see if the XFS kernel modules had installed correctly and that the raid partition was mounted and on this node it wasn't, but the kernel modules had installed correctly. Looking a bit harder I found that the whole device seemed to be missing. Connecting to the RAID Card web interface I find that instead of two RAID sets (system and data) it has the two system disks in a RAIDo pair and 22 free disks (cue heart palpitations). Looking (in a rather panicked fashion) through the admin interface options I find "Rescue RAID set" and give it a go. After a reboot I connected to the web interface again and now I see both RAID sets. Phew! It's too early to start the celebrations though becuase when I log in the partition isn't mounted and when I try by hand it complains that the Logical Volume isn't there. Uh oh, cue much googling and reading of man pages.

pvscan sees the physical volume, vgscan sees the volume group and lvscan sees the logical volume but it's "NOT Available". I tried vgscan --mknodes, no that didn't work. I finally got it working with:

vgchange --available y --verbose raid

Then I could mount the partition and all the data appeared to be there.

After all that the upgrade of dCache was very simple. Just a case of adding the latest rpm to my yum repository, running yum update dcache-server then /opt/d-cache/install/ The latter complained of some depreciated config file options I'll have to look at but dCache came up

I'd qsig -s STOP'd all the jobs whist doing the obviously and here's an interesting ploy of the network traffic into the Worker Nodes over the last day.

As you can see once I restarted the jobs they more or less picked up without missing a beat. And yes, they are reading data at 600 MB/sec and the dCache is quite happily serving to them at that rate.

Wednesday, August 27, 2008

More spacetokens at Oxford

Expanding the spacetokens at Oxford showed that the dpm-updatespace command has to have integer values so for 4.5T use 4500G

/opt/lcg/bin/dpm-updatespace --token_desc ATLASMCDISK --gspace 4500G --lifetime Inf

I used Graemes script to setup the ATLASGROUPDISK permissions after the reservespace command:

/opt/lcg/bin/dpm-reservespace --gspace 2T --lifetime Inf --group atlas/Role=production --token_desc ATLASGROUPDISK

Graeme's script:

[root@t2se01 ~]# more

DOMAIN=$(hostname -d)

dpns-mkdir /dpm/$DOMAIN/home/atlas/atlasgroupdisk/
dpns-chgrp atlas/Role=production /dpm/$DOMAIN/home/atlas/atlasgroupdisk/
dpns-setacl -m d:g:atlas/Role=production:7,d:m:7 /dpm/$DOMAIN/home/atlas/atlasgroupdisk/

for physgrp in exotics higgs susy beauty sm; do
dpns-entergrpmap --group atlas/phys-$physgrp/Role=production
dpns-mkdir /dpm/$DOMAIN/home/atlas/atlasgroupdisk/phys-$physgrp
dpns-chgrp atlas/phys-$physgrp/Role=production /dpm/$DOMAIN/home/atlas/atlasgroupdisk/phy
dpns-setacl -m d:g:atlas/phys-$physgrp/Role=production:7,d:m:7 /dpm/$DOMAIN/home/atlas/at

ATLASDATADISK space was increased to 15TB
dpm-updatespace --token_desc ATLASDATADISK --gspace 15T --lifetime Inf

ATLASLOCALGROUPDISK was created and setup:
/opt/lcg/bin/dpm-reservespace --gspace 1T --lifetime Inf --group atlas --token_desc ATLASLOCALGROUPDISK

dpns-mkdir /dpm/

dpns-chgrp atlas/uk /dpm/
dpns-setacl -m d:g:atlas/uk:7,m:7 /dpm/
dpns-setacl -m g:atlas/uk:7,m:7 /dpm/

Wednesday, August 20, 2008

Brief Bristol Update

Brief Bristol update: new hardware to replace HPC CE received &
being built. New hardware for StoRM SE & a gridftp nodes received,
Dr Wakelin building them.
Our 50TB of new storage should be ready in September.

New hardware to replace MON received, being built. Will replace small
cluster WN this fall (possibly increase number) & possibly also
its CE & DPM SE.

Both clusters mostly stable, except for occasional gpfs timeouts on
HPC & recent intermittent problems with SCSI resets on DPM SE.

Delays due to Yves, Jon & Winnie very busy with other very high prio.

Monday, August 18, 2008

Setting up the Atlas Space Tokens on dCache

Well the request from Atlas to have space tokens set up is quite complicated but here's my first attempt at setting them up for dCache:

The want to have different permissions on different space tokens. I think the only way to do that is to create different LinkGroups to associate with the space tokens. Here is the section from my LinkGroupAuorization.conf file for Atlas now:
LinkGroup atlas-link-group

LinkGroup atlas-group-link-group

LinkGroup atlas-user-link-group

LinkGroup atlas-localgroup-link-group
However it appears a Link can only be associated with one LinkGroup so we also have to create a Link for each of these. Luckily it appears that a PoolGroup can be associated with multiple links so we don't have to split up the Atlas space (phew).

So I created a bunch of Links and LinkGroups in the PoolManager like this:
psu create link atlas-localgroup-link world-net atlas
psu set link atlas-localgroup-link -readpref=20 -writepref=20 -cachepref=20 -p2ppref=-1
psu add link atlas-localgroup-link atlas-pgroup
psu add link atlas-localgroup-link atlas
psu create linkGroup atlas-localgroup-link-group
psu set linkGroup custodialAllowed atlas-localgroup-link-group false
psu set linkGroup replicaAllowed atlas-localgroup-link-group true
psu set linkGroup nearlineAllowed atlas-localgroup-link-group false
psu set linkGroup outputAllowed atlas-localgroup-link-group false
psu set linkGroup onlineAllowed atlas-localgroup-link-group true
psu addto linkGroup atlas-localgroup-link-group atlas-localgroup-link
Obviously repeated for each of the other extra LinkGroups

Then it's just a case of creating the space tokens in the SrmSpaceManager:
reserve -vog=/atlas -vor=NULL -acclat=ONLINE -retpol=REPLICA -desc=ATLASUSERDISK -lg=atlas-user-link-group 2500000000000 "-1"
reserve -vog=/atlas/uk -vor=NULL -acclat=ONLINE -retpol=REPLICA -desc=ATLASLOCALGROUPDISK -lg=atlas-localgroup-link-group 9000000000000 "-1"
reserve -vog=/atlas -vor=production -acclat=ONLINE -retpol=REPLICA -desc=ATLASGROUPDISK -lg=atlas-group-link-group 3000000000000 "-1"
I'm not sure the last one will work as expected I don't know how the -vog=/atlas will map with the multiple VOMS groups in the LinkGroupAuthorization.conf file. But I've no idea how to specify multiple VOMS groups there.

OK, that should get us the Space tokens, but Atlas are also requesting specific permissions on directories and that's completely orthogonal to the space tokens. All I've got to play with there are the normal UNIX users and groups.

So I start off by creating 6 extra groups and amking it the primary group for a single pool account (which is also in the main atlas group) , I also add the atlasprd account to the physics group groups since they want that to have write access to the group areas. Here's the relevant bit from /etc/groups, you can work out the changes to /etc/passwd yourselves.
Now I've got the users and groups set up I can create the directories:
mkdir /pnfs/
chown atlas007:atl-uk /pnfs/
chmod 755 /pnfs/
[root@heplnx204 etc]# ls -l /pnfs/
total 3
drwxrwxr-x 1 atlas005 atl-b 512 Aug 18 13:21 phys-beauty
drwxrwxr-x 1 atlas002 atl-exo 512 Aug 18 13:21 phys-exotics
drwxrwxr-x 1 atlas003 atl-higg 512 Aug 18 13:21 phys-higgs
drwxrwxr-x 1 atlas006 atl-sm 512 Aug 18 13:21 phys-sm
drwxrwxr-x 1 atlas004 atl-susy 512 Aug 18 13:21 phys-susy
But no I have to make sure dCache maps the right voms credentials to the correct account:
First of in /etc/grid-security/storage-authzdb
authorize atlas001 read-write 37101 24259 / / /
authorize atlas002 read-write 37102 24358 / / /
authorize atlas003 read-write 37103 24359 / / /
authorize atlas004 read-write 37104 24360 / / /
authorize atlas005 read-write 37105 24361 / / /
authorize atlas006 read-write 37106 24362 / / /
authorize atlas007 read-write 37107 24365 / / /
authorize atlasprd read-write 51000 24259 / / /
and in /etc/grid-security/grid-vorolemap
# Added role /alice/Role=production
"*" "/alice/Role=production" aliceprd

# Added role /atlas
"*" "/atlas" atlas001
"*" "/atlas/phys-exotics" atlas002
"*" "/atlas/phys-higgs" atlas003
"*" "/atlas/phys-susy" atlas004
"*" "/atlas/phys-beauty" atlas005
"*" "/atlas/phys-sm" atlas006
"*" "/atlas/uk" atlas007

# Added role /atlas/Role=lcgadmin
"*" "/atlas/Role=lcgadmin" atlas001

This has not been fully tested yet, in particular it's not clear the the ATLASGROUPDISK space token will handle the way I expect.

Oh, and doing this has once again made me realise that I don't really understand what Units and Links are and do in dCache, so I'm offering a beer to anyone who can explain this to me.

Update on 28/08/08

It looks like this doesn't work fully, dCache doesn't support secondary groups so the atlasprd user who is in group atlas cannot write to the /pnfs/* areas even though it has secondary group membership of the groups which do have write access. I'm now waiting for feedback from atlas to know how they want the permissions configured in view of this.