Friday, December 19, 2008

Automount problems on torque server

We've been having a few problems with our torque server failing to automout disks randomly.

Most of the time the mounts succeeded but occasionally they would fail with just:

Dec 19 08:05:06 heplnx201 kernel: RPC: error 5 connecting to server nfsserver
Dec 19 08:05:06 heplnx201 automount[23438]: >> mount: nfsserver:/opt/ppd/mount: can't read superblock
Dec 19 08:05:06 heplnx201 automount[23438]: mount(nfs): nfs: mount failure nfsserver:/opt/ppd/mount on /net/mount
Dec 19 08:05:06 heplnx201 automount[23438]: failed to mount /net/mount
Dec 19 08:05:07 heplnx201 kernel: RPC: Can't bind to reserved port (98).
Dec 19 08:05:07 heplnx201 kernel: RPC: can't bind to reserved port.

With the wonders of Google I was able to find out that error 98 is address in use and that what is going on is that the client is unable to find a free port in it's port range to initiate the connection to the server.

The culprit seems to be torque, which when I checked with a netstat -a was using very single port from 600 to 1023, which quite neatly overlaid the nfs client port range of 600-1023.

Here Google failed me and I was unable to find anyway to limit the port range used by torque.

So for now I've taken the quick option of extending the nfs client port range down to port 300 with:

echo 300 > /proc/sys/sunrpc/min_resvport

I think I'd like to move the nfs client port range out of the priveledged port range altogether. I think this should be possible, the RFC says that it SHOULD use a port below 1023 but MAY use a higher port, but I'd like to test it a bit before I configure a major server like that.

static-file-Cluster.ldif edit required post yaim at Oxford

Every time we run yaim at Oxford we have to fix the number of cpu's in our cluster by hand.
on t2ce02:
diff static-file-Cluster.ldif-fixed /opt/glite/etc/gip/ldif/static-file-Cluster.ldif
64c64
< GlueSubClusterPhysicalCPUs: 384
---
> GlueSubClusterPhysicalCPUs: 2
[root@t2ce02 ~]# cp static-file-Cluster.ldif-fixed /opt/glite/etc/gip/ldif/static-file-Cluster.ldif


On t2ce04:
Physical cpu's needs to be 74. After the change the ldap query shows:
ldapsearch -x -H ldap://t2bdii01.physics.ox.ac.uk:2170 -b Mds-vo-name=UKI-SOUTHGRID-OX-HEP,o=grid|grep -i physicalcpu
GlueSubClusterPhysicalCPUs: 74
GlueSubClusterPhysicalCPUs: 384



Tuesday, December 16, 2008

EFDA-JET Service nodes upgraded to glite 3.1

We upgraded our service nodes to Scientific Linux 4.7 and glite-3.1. The worker nodes had been upgraded earlier. The problems/issues we had while upgrading to Scientific Linux 4.7 are listed below:

Storage Engine

While installing the SE glite middleware (glite-SE_dpm_mysql), there was
a missing dependency issue for the perl-SOAP-Lite package.

Error: Missing Dependency: perl-SOAP-Lite >= 0.67 is needed by package
gridview-wsclient-common

Doing a

# yum install perl-SOAP-Lite

only installs perl-SOAP-Lite-0.65, which is lower than the version needed.

The perl-SOAP-Lite rpm was downloaded from a different repository. We
initially downloaded the perl-SOAP-Lite-0.67.el4 but this one failed to install as it needed MQSeries and other packages to be installed. We finally downloaded perl-SOAP-Lite-0.67-1.1.fc1.rf.noarch.rpm and it installed without any problems.

When the node was configured by yaim, the following error was obtained

sed: can't read /opt/bdii/etc/schemas: No such file or directory

The file /opt/bdii/etc/schemas was missing. The fix is to copy the schemas.example file to schemas

# cp -i /opt/bdii/doc/schemas.example /opt/bdii/etc/schemas

First SAM test failed. lcg-lr was missing, we needed to install lcg_util.
This installed a new version of lcg_util that was on the other nodes. lcg_util
was then updated on all the nodes.

Compute Element (& site BDII)

We run the compute element service and the site BDII service on the same node.

While installing the glite-BDII packages, we obtained the following dependency errors.

Error: Missing Dependency: glite-info-provider-ldap = 1.1.0-1 is needed by package glite-BDII
Error: Missing Dependency: glue-schema = 1.3.0-3 is needed by package glite-BDII
Error: Missing Dependency: bdii = 3.9.1-5 is needed by package glite-BDII

Using yum to install the missing packages, installs these packages at a higher level and still causes the installation of glite-BDII packages to fail, as it needs these packages at the versions listed above. These packages were instead installed by hand. A GGUS ticket (Ticket-ID: 42456), which suggested that this problem is fixed in the latest release (update 34).

As with the SE install above, we had the same problem with the schemas file, missing. The above fix was repeated here.

When running yaim, we had the following errors,

grep: a: No such file or directory
grep: VO: No such file or directory
grep: or: No such file or directory
grep: a: No such file or directory
grep: VOMS: No such file or directory
grep: FQAN: No such file or directory
grep: as: No such file or directory
grep: an: No such file or directory
grep: argument: No such file or directory
qmgr: Syntax error - cannot locate attribute
set queue lhcb acl_groups += /opt/glite/yaim/bin/yaim: supply a VO or a VOMS FQAN as an argument

To fix it we edited the file /opt/glite/yaim/functions/utils/users_getvogroup and commented out

#echo "$0: supply a VO or a VOMS FQAN as an argument"

On Gstat web monitoring page, it was being reported that the SE service was missing ('SE missing in Gstat service'). To fix this problem, we edited the file /opt/bdii/etc/bdii-update.conf and add the following line for our SE.

SE ldap://grid001.jet.efda.org:2170/mds-vo-name=resource,o=grid

Mon Box

When running yaim, we had the following errors


Problem starting rgma-servicetool

Starting rgma-servicetool: [FAILED]
For more details check /var/log/glite/rgma-servicetool.log
Stopping rgma-gin: [ OK ]
Starting rgma-gin: [FAILED]

Fixed by defining a new java by adding the following to the site-info.def

HOSTNAME=`hostname`
if [ "$HOSTNAME" == "$MON_HOST" ] ; then
JAVA_LOCATION="/usr/lib/jvm/jre-1.5.0-sun"
else
JAVA_LOCATION="/usr/java/j2sdk1.4.2_12"
fi

We had the same 'schemas' file missing problem here as well.

Networking

EFDA-JET has a slightly unusually set up as we are restricted to a small number of external IP addresses. All nodes are on the same LAN with private IP addresses, whilst the service nodes also have external addresses. In the hosts files on the service nodes, all service nodes are referenced by their external addresses, whilst on the worker nodes, the service nodes are referenced by their private addresses.

This worked well for glite 3.0, but not for glite 3.1, where we saw clients on the worker nodes trying to contact the service nodes via their external addresses. It looks like glite 3.1 iservices are passing IP addresses for clients to be call back on at a later time. The complete solution was to run iptables on the worker nodes and NAT translate outgoing connections for external addresses of the service nodes to their corresponding internal addresses. This was done by adding the following to /etc/rc.local on the worker nodes.

/sbin/service iptables start
/sbin/iptables -A OUTPUT -t nat -d <CE-ext-addr> -j DNAT \
--to-destination <CE-int-addr>
/sbin/iptables -A OUTPUT -t nat -d <SE-ext-addr> -j DNAT \
--to-destination <SE-int-addr>

Thursday, December 04, 2008

dCache Update

We updated dCache this morning to 1.9.0. Now that sounds like a major jump but reading the release notes it is only a minor step up from the 18.0.15pX series of releases.

The upgrade itself was trivial, just installing the new dcache-server rpm and running install.sh across all the nodes.

We also took the opportunity to update the version of Postgresql on the head node from 8.3.1 to 8.3.5 using rpms from pgsqlrpms.org. I'm hoping that I will now be able to use their prebuilt slony-1 rpm to set up master slave mirroring of the databases from the dCache head node to a live mirror node.

Finally we updated the SL version of all the dCache nodes to SL4.6 from a mix of SL4.4, SL4.5 and SL4.6. We're now using the SL-Contrib xfs kernel modules on all nodes and the Araca drivers complied into the 2.6.9-78 series of kernels on all nodes with Areca raid cards rather than our own builds and have had no issues.