Tuesday, September 06, 2011

Installing and Deploying a Cluster Publisher

As part of the battle to replace out LCG-CEs with CreamCEs I realised that the reason one of our new CreamCEs was not getting many jobs was because it was not publishing a cluster/subcluster into the BDII (despite having a /var/lib/bdii/gip/static-file-Cluster.ldif file) and so I guess wasn't matching any resources.

Since, I eventually wanted to go to a stand alone Cluster Publisher I thought it would be easiest to push ahead and install that rather than try to install one one the CreamCE and remove it later.

So with a shiny new VM in hand and certificate I plunged onwards.

First step was to define the cluster variables in site-info.def (or in this case a specific node file):

cat /opt/glite/yaim/etc/nodes/heplnv146.pp.rl.ac.uk
CE_HOST_heplnx206_pp_rl_ac_uk_CE_TYPE=cream
CE_HOST_heplnx206_pp_rl_ac_uk_CE_InfoJobManager=pbs
CE_HOST_heplnx206_pp_rl_ac_uk_QUEUES="grid"
CE_HOST_heplnx207_pp_rl_ac_uk_CE_TYPE=cream
CE_HOST_heplnx207_pp_rl_ac_uk_CE_InfoJobManager=pbs
CE_HOST_heplnx207_pp_rl_ac_uk_QUEUES="grid"
CLUSTER_HOST=heplnv146.pp.rl.ac.uk
CLUSTERS=GRID
CLUSTER_GRID_CLUSTER_UniqueID=grid.pp.rl.ac.uk
CLUSTER_GRID_CLUSTER_Name=grid.pp.rl.ac.uk
CLUSTER_GRID_SITE_UniqueID=UKI-SOUTHGRID-RALPP
CLUSTER_GRID_CE_HOSTS="heplnx206.pp.rl.ac.uk heplnx207.pp.rl.ac.uk"
CLUSTER_GRID_SUBCLUSTERS="GRID"
SUBCLUSTER_GRID_SUBCLUSTER_UniqueID=grid.pp.rl.ac.uk
SUBCLUSTER_GRID_HOST_ApplicationSoftwareRunTimeEnvironment="
LCG-2
LCG-2_1_0
LCG-2_1_1
LCG-2_2_0
LCG-2_3_0
LCG-2_3_1
LCG-2_4_0
LCG-2_5_0
LCG-2_6_0
LCG-2_7_0
GLITE-3_0_0
RALPP
SOUTHHGRID
GRIDPP
R-GMA
"
SUBCLUSTER_GRID_HOST_ArchitectureSMPSize=4
SUBCLUSTER_GRID_HOST_ArchitecturePlatformType=x86_64
SUBCLUSTER_GRID_HOST_BenchmarkSF00=0
SUBCLUSTER_GRID_HOST_BenchmarkSI00=2390
SUBCLUSTER_GRID_HOST_MainMemoryRAMSize=2000
SUBCLUSTER_GRID_HOST_MainMemoryVirtualSize=2000
SUBCLUSTER_GRID_HOST_NetworkAdapterInboundIP=FALSE
SUBCLUSTER_GRID_HOST_NetworkAdapterOutboundIP=TRUE
SUBCLUSTER_GRID_HOST_OperatingSystemName=ScientificSL
SUBCLUSTER_GRID_HOST_OperatingSystemRelease=5.4
SUBCLUSTER_GRID_HOST_OperatingSystemVersion=Boron
SUBCLUSTER_GRID_HOST_ProcessorClockSpeed=2300
SUBCLUSTER_GRID_HOST_ProcessorModel=Xeon
SUBCLUSTER_GRID_HOST_ProcessorOtherDescription='Cores=3.7656,Benchmark=9.56-HEP-SPEC06'
SUBCLUSTER_GRID_HOST_ProcessorVendor=Intel
SUBCLUSTER_GRID_SUBCLUSTER_Name=grid.pp.rl.ac.uk
SUBCLUSTER_GRID_SUBCLUSTER_PhysicalCPUs=546
SUBCLUSTER_GRID_SUBCLUSTER_LogicalCPUs=2056
SUBCLUSTER_GRID_SUBCLUSTER_WNTmpDir=/scratch

Then it was a simple case of installing the rpms and running YAIM:

yum install emi-cluster
/opt/glite/yaim/bin/yaim -c -s /opt/glite/yaim/etc/site-info.def -n glite-CLUSTER

At that point we seemed to have a working system, the BDII was running and queriable, I count connect to the gridftp server and it had set up expriment and cluster directories in /opt/edg/var/info/ and /opt/glite/var/info/.

Fine, next step was to rsync the contents of those directories from the torque server that then exports them to the CEs - well actually to /export/gridtags and /export/glitetags and symlink the previous locations to those. cfengine had already set the node up as a nfs server for me so exporting the new areas and updating the CEs to mount it from there was a matter of moments.

A quick check of the resource BDII looked fine so it was a simple matter to add the new source into the site bdii and tweak the static-file-CE.ldif file on the CreamCE to assign it to the new cluster.

One thing remained, when testing the gridftp server with uberftp* I'd noticed that I was not mapped to my usual pool account - not surprising as I had not mounted the site gridmapdir so it was using its local one. However, reasoning that the gridftp server was the same rpm as the one on the CreamCE that was using Argus for authentication and mapping I had a poke around on the CreamCE and in YAIM and tried installing the argus-gsi-pep-callout rpm and coping over /etc/grid-security/gsi-authz.conf and /etc/grid-security/gsi-pep-callout.conf from the CreamCE.

Another quick test with uberftp and yes, I am mapped to my normal pool account so it appears I have a Cluster Publisher with Argus integration working. That means the one things at the site not using Argus are the gLite CreamCE which will be replaced soon by another EMI one and dCache which will get banning from Argus when I update to the next Golden Release.

*uberftp heplnv146.pp.rl.ac.uk "ls /etc"