SouthGrid: 2006

Thursday, November 30, 2006

School of System Engineering
The University of Reading

Are now setting up their cluster, Emails between Jeremy Coles, Pete Gronbech and Yves have been passing giving advice on setup.

Oxford instability of the ce over the last few days, first the bdii crashed and was restarted.
Second the globus_mds service crashed and was restarted. IN the end I decided to reboot the node today.
New jobs seem to be arriving now and we are passing SAM tests again.

Tuesday, November 21, 2006

13.11.06
Ralppd : upgraded to torque 2 and dCache 1.7 .

Lawrie at Birmingham complained about the amount of spam to "lcg" style mailing lists? At least half my spam comes from the following lists:

proj-atlas-geant4-emb@cern.ch
project-lcg-vo-sites@cern.ch
project-lcg-vo-atlas-sites@cern.ch
project-lcg-security-csirts@cern.ch

maybe one or two more. One answer would be to change these mailing list names and then don't advertise them on a web page! Or make them closed lists.

I have created a new page as a header for the shared private technical documents.
See the link at the bottom of
https://www.gridpp.ac.uk/southgrid/TechnicalBoard.html

It now includes phone numbers (off the GOCDB) for each site.

Some questions arise,
Should we have a separate security email list for the RAL site security for major problems not just LCG VO type probs?

Should Cambridge have an lcg-security email list that includes the site one. My worry is that LCG security challenges may not be seen by Santanu is he is not directly on the cert@cam list.

Thursday, October 26, 2006

A torque security flaw was made public last week, most southgrid sites shutdown queues on Friday night.
Then patched up over the next 3 days.
Cambridge was not affected as they use Condor.

Sysman meeting at Cambridge followed by the SouthGrid Technical meeting, shared support was discussed and generally agreed, but needs to be formalised and put in place at Birmingham. RALPPD were not there to discuss.
Pete had another meeting with the Oxford Campus Network guys, who have been doing some tests and managed to get better through put by tweeking the kernel. It is now thought that the change that happened on August 15th may have increased the latency between sites and this may be the cause of the reduced bandwidth.
Further tests will be done.

Friday, October 13, 2006

SouthGrid has been busy taking part in the Service Challenge Throughput tests.
Oxford carried out tests on September 26th and 27th. The first Oxford to RALpp went ok although slowly, the second RALPP to Oxford was extremly disapointing. Practically zero bandwidth.

Further iperf tests have been carried out between servers both within Oxford and SouthGrid. At the moment the main cause of the problem seems to be the installation of a new Campus Firewall on August 15th. See: the Gridmon web site.

Work with David Wallom of Oxford Grid has continued in order to enable the NGS VO on Oxford's cluster.

Another worker node PSU failed two weeks ago and was replaced under warrenty by Dell.

Wednesday, September 06, 2006

Security Update

Oxford UI updated other nodes proceding.
Bristol, Cambridge and Birmingham have confirmed they too have updated.

Tuesday, September 05, 2006

Oxford updated

Over the weekend I updated the rpms on Oxfords gird nodes. On Monday I re ran yaim to make the changes take effect. Testing was hammpered by the SFT page being very slow and the submission page not working well. On Tuesday I discovered Oxford had been failing the SFT's due to pbs not working properly, because there was no longer a nodes, file. This was traced to a typo in the site-info.def file. yaim was re run on which seemed to cure things.

Yves has had trouble getting >250Mb/s from Birmingham, he thinks this is due to dpm problems rather than raw bandwidth issues as iperf tests give much better results.

We are now testing Oxford instead of Birmingham.

Friday, September 01, 2006

Friday pm

Bristol - Birmingham tests due to start today.
Winnie had some problems diagnosing the error messages from the transfer tests, Yves thinks the documentation is OK if you are an expert but it could be improved.

Yves has been benchmarking new systems, Intel Woodcrest vs AMD Opteron, Ethernet vs Infiniband, to provide data for Birminghams future escience cluster purchase. Some results will be available later.

update

Four sites out of 5 are now running glite 3.0.2
Just Oxford to go which is being upgraded today.
rm failures at Oxford were cured by re running yaim on the se's. May be gridftp had gone mad?

EDFA-JET is now fully operational and running SFT's succesfully.

Yves continuing to test DPM-DPM throughput and has been tuning the kernel and tcp ip parameters to optimise performance.

Yves also carried out CASTOR DPM tests last Sunday, http://www.gridpp.ac.uk/wiki/RAL_Tier1_CASTOR_SRM_tests_T1toT2

Friday, August 25, 2006

Condor Problems at Cambridge

The glite CE requires a version of Condor which is a developmet fork, and not the production release.
Santanu expects very few production sites to ever consider using a development release and yet LCG has a dependacy on it.
The second number in the release version is even for production and odd for development.
The numbers in question are; lcg requires 6.7.10-1 but Cambridge says 6.6.x-x is more likely at a production site or may be the next release which will be 6.8.x-x.

First Post

Visited Yves at Birmingham on Monday 21st . Discussed the throughput tests he has been carrying out between Bristol and Ral and Bham. Carried out some tests between Oxford and Bham.

Whole building power testing at Oxford on Wednesday 23rd. Set queues to disabled on Monday to force quese to drain. I had previously marked all the nodes offline which meant Oxford failed some SFT's, just disabling the VO queues is a better way to do it. All systems came back OK on Thursday morning.

Yves has been helping Culham get the EDFA-JET site up and running via email. They are very nearly there.

SouthGrid