SouthGrid: February 2007

Tuesday, February 27, 2007

glite UI update fix

To resolve the missing dependancy on the ui for the last two updates of glite
in particular the glite-ui-config rpm required python-fpconst.
You can get this rpm from cern see:

http://linuxsoft.cern.ch/repository//python-fpconst.html

Use
wget http://linuxsoft.cern.ch/cern/SLC30X/i386/SL/RPMS/python-fpconst-0.6.0-3.noarch.rpm

to add this to your local repository.
Then yum -y update will work once again.

Monday, February 26, 2007

Another Workernode Hard drive failure at Oxford

t2wn37 Hard drive failed over the weekend. Dell will replace.
This follows on from t2wn04 last week, and the PSU in t2lfc01 a few weeks before. t2lfc01 was one of the gridpp supplied nodes from Streamline. Replacement of the PSU took several weeks.

Tuesday, February 20, 2007

Fusion jobs successfully running at Oxford

FUSION jobs have now run succesfully at Oxford.

Birmingham Network reconfiguration

Yves reports:
Our site has been down since yesterday morning 10am until today 1000am due to network problem which IS have linked to a faulty link with a campus switch.
....
IS have temporarily disabled the physics link to the library switch, one of our two links to the network, and this has fixed the connectivity problem from the outside world to our grid box.

They will re-instate the link (for resilience) when they've got to the bottom of the problem (faulty fibre, or whatever).
-->

So, it'be interesting to see the gridmon result in the current configuration while waiting for IS to understand the problem.

This may be the cause of the 33% UDP Packet loss we have been seeing to/from birmingham.

FUSION VO Problems at Oxford

At Oxford we had reports from Fusion of problems:
"We have checked that FUSION jobs fail at your site with the error "37 the provided RSL 'queue' parameter is invalid". This is because "fusion" is missing at the end of the file /opt/globus/share/globus_gram_job_manager/lcgpbs.rvf in your CE ("fusion" should be included in the list of Values of the attribute "queue"). We also noticed that the FUSION VOMS server certificate ([1]) is not installed at /etc/grid-security/vomsdir/ in your CE."

I down loaded the cert from :
http://swevo.ific.uv.es/vo/files/swevo.ific.uv.es-oct2006.pem

and ran
/opt/glite/yaim/scripts/run_function /root/yaim-conf/site-info.def config_globus
which made the 4 VO's I recently added appear in the files lcgpbs.rvf and pbs.rvf in
/opt/globus/share/globus_gram_job_manager/ .
I can only assume that we had had errors when we ran yaim the first time as the 4 new
VO's had not appeared the first time.

Monday, February 19, 2007

Problems with DNS style VO names

We have now discovered that adding the camont VO is not straight forward due to the new DNS style VO name.
The current yaim cannot handle the long format for VO names. The new yaim 3.1 which is not yet released should help but has not yet been tested.
Yves has had a look at it and it is very different from the current version.
"Hello all,

I got hold of the new version of yaim and there are some non-trivial differences with the production version. I think it would be ill advised to try the new version in production. I think we could enable this new vo style by configuring gip by hand and then perform the correct queue to group mapping for pbs/condor. But instead of all sites doing this (plus potential RB complications?), couldn't we revert to the current vo style (if running jobs is urgent), I do not understand while the new vo style should be implemented on production sites when it is still awaiting certification and has not even been tried on the pre-production service?

Thanks,

Yves
"

Thursday, February 15, 2007

New VO's added at Oxford

Support for Minos, Fusion, Geant 4 and camont were added yesterday at Oxford.

The new CA rpms were also installed so now we should be green again.

Monday, February 12, 2007

Latest glite update problem on UI

Got the below error on my UI when I tried to update to the latest rpms.
This has already been reported as a GGUS ticket no:
https://gus.fzk.de/ws/ticket_info.php?ticket=18358

gronbech@ppslgen:/var/local> ssh root@t2ui02 'yum -y update;pakiti'
Gathering header information file(s) from server(s)
Server: Oxford LCG Extras
Server: gLite packages
Server: gLite updated packages
Server: gLite updated packages
Server: LCG CA packages
Server: SL 3 errata
Server: SL 3 main
Finding updated packages
Downloading needed headers
Resolving dependencies
.....Unable to satisfy dependencies
Package SOAPpy needs python-fpconst >= 0.6.0, this is not available.

Steve Lloyds ATLAS Test jobs

Work was carried out at Oxford to find out why the Atlas test jobs were not working.
It seemed there were some old references to pool accounts of the format atlas0100 and upwards which should have been atlas100 upwards. Once all references to these were removed the jobs started working. The problem effected both the ce and the DPM server.

PDG requested that 12.0.5 be installed at Oxford via the web page:
https://atlas-install.roma1.infn.it/atlas_install/protected/rai.php
but wonders if he should have been using
https://atlas-install.roma1.infn.it/atlas_install/

The installation was complete by Friday 9th Feb.
Results for Oxford were all fine until the problems over the weekend.
http://hepwww.ph.qmul.ac.uk/~lloyd/atlas/atest.php

The problems at Bristol are caused by the worker having very small
home disk partitions. The atlas software can not be loaded as their is insufficent space to expand the tar file.

Oxford instabilities over the weekend

The Oxford site had trouble over the weekend due to the system disk on the ce getting full.
This was mainly due to large number of old log files. These have been migrated off to part of the software directory for now.
The dpm server was also in a bad state and services had to be restarted.

Meanwhile PDG is on the process of adding support for some new VO's; namely:
MINOS, FUSION, GEANT4 and CAMONT

While also on TPM duty this week.

RALPPD to get another upgrade

Chris Brew announced on Friday 9th Feb:
RALPPD have been awarded another chunck of money to be spent by March 31st.2007
This will allow them to purchase one rack of CPU's and one rack of Disks.
The CPU's will be equivalent to 275KSI2K bringing the total to about 600KSI2K, and the new disks will be 78TB,s bringing the total to 158TB,s.
This includes the 50TB's currently on loan to the T1 will be returned shortly.
The hardware will be identical to the recent T1 purchase.

Cambridge New Systems Arrive

Santanu announced on 19.1.07:

Just to let you know that all the new machines have arrived; just waiting for the rack to be delivered and the Dell engineer (that's actually the part of the contact) to come and switch it on.

When done, it's gonna give LCG/gLite another 128 CPUs and if our experiment with CamGrid and Condor succeeds, it will top up another ~500 CUPs. Now we can mount /experiment-software and LCG middleware area onto any CamGrid machine with any root permissions, WNs outbound connection is also sorted out. Now need to think about the stupid "WN pool account"

Intel call it as "Woodcrest". All the nodes are dual-core dual CPU, so 4 CPUs under the same roof
Dell Model : PE1950
Processor : Xeon 5150Ghz/4MB 1333FSB
Memory : 8*1GB dual rank DIMMs

SouthGrid