Tuesday, March 27, 2007

Oxford tries out MonAMI

During the gridpp collaboration meeting I was persuaded to give MonAMI a go.
Installing the rpm from the sourceforge web site was easy enough.
http://monami.sourceforge.net/
Also see the link from the gridpp wiki http://www.gridpp.ac.uk/wiki/MonAMI

As I already use ganglia the idea was that I'd run some checks on disk space and DPM and send the output to ganglia. The first thing we noticed was that in order for some of the features to work you need to be running at least v3 of ganglia. I was still running v2.5, a quick upgrade of the gmond rpms and a new gmond.conf was required.
You do also require mysql. (For the DPM plugin - more later)

The main configuration file is /etc/monami.conf, but this can read further files in /etc/monami.d, so we set about making a basic file to monitor the root file system.

[filesystem]
name=root-fs
location=/

[sample]
interval=1m
read = root-fs.blocks.free
write = ganglia

[snapshot]
name=simple-snapshot
filename=/tmp/monami-simple-snapshot

[ganglia]
multicast_ip_address = 239.2.11.95
multicast_port = 8656


more coming soon....

Thursday, March 08, 2007

CAMONT jobs successfuly running at Oxford

The CAMONT VO has now been working correctly at Oxford since Friday 2nd March.
Karl Harrison of Cambridge has been running jobs from Cambrdige.

In another Cambridge collaboration, LHCB software has been installed on a Windows server 2003 test node at Oxford by Ying Ying Li from Cambridge. They are testing the use of Windows for LHCb analysis code, and having tetsed at Cambridge were looking to prove it could work at other sites. Ideally they would like some more test nodes and 0.5 TB of disk space. This may be harder to find.

Cambridge ran the Atlas DPM ACL fix on Monday 5th when I (PDG) visited Santanu. Now all SouthGrid sites have run the required fix.

I took the opportunity to measure the power consumption of the new Dell 1950's (Intel 5150 cpus). Idle power consumtion is about 200W rising to 285 under load (4 cpu intensive jobs).

Thursday, March 01, 2007

Oxford and the ATLAS DPM ACL fix.

I tried to run the ATLAS patch program yesterday to fix the ACL's on the DPM server at Oxford.
This update has be provided as a binary from ATLAS that has to be run as root on the se. This was potentially dangerous and many sites had delayed running this, and objected to the fact that we don't really know what it is doing. Anyway the pragmatic approach seemed to be that most other sites had run it now so I would.
The configuration file has to be edited to match the local sites config.
I perfomed a normal file backup using the HFS software Tivolis Storage Manager.
dsmc incr
Then dumped the mysql data base.
mysqldump --user=root --password=****** --opt --all-databases | gzip -c > mysql-dump-280207.sql.gz
As our main DPM server was currently set readonly (To cope with the DPM bug of not sharing across pools properly) we decided to set it back to read/write for the update.
dpm-modifyfs --server t2se01.physics.ox.ac.uk --fs /storage --st 0
Then run the update program (refered to as a script in some docs):
./UpdateACLForMySQL
Unfortuneatly I had used the wrong password in the config file so it failed,
this is where a strange feature of the update program was discovered.
After it runs it removes several entries from the config file , the password and the gid entry, so after several attempts the correct config file was used and the update appears to have been successful.
dpns-getacl /dpm/physics.ox.ac.uk/home/atlas/dq2

Shows the acls.
# file: /dpm/physics.ox.ac.uk/home/atlas/dq2
# owner: atlas002
# group: atlas
user::rwx
group::rwx #effective:rwx
group:atlas/Role=production:rwx #effective:rwx
mask::rwx
other::r-x
default:user::rwx
default:group::rwx
default:group:atlas/Role=production:rwx
default:mask::rwx
default:other::r-x


I reset the main DPM server back to read only:
dpm-modifyfs --server t2se01.physics.ox.ac.uk --fs /storage --st RDONLY

The process was not simple or clear and I hope not to have to do more for other VO's...



Birmingham suffering from multiple hardware failures

The Babar cluster at Birmingham which is made up of older kit salvaged from QMUL and Bristol plus the original Birmingham cluster, is suffering from h/w problems.
7 worker node disks have died, some systems have kernel panics, and the globus MDS service is playing up. Yves is working hard to fix things but maybe we are just getting to the end of the useful life of much of this kit?