Latest Entries »

No Gravatar

Subsequent to my previous post about the Lustre / Vanilla CentOS issue I was having, specifically with the Kernel modules, I have hit another snag.
The specific issue is the Lustre lnet modules don’t recognise the symbols provided by the o2iblnd module in the OFED. Something along the lines of this in dmesg.

ko2iblnd: disagrees about version of symbol…..
LustreError: 3640:0:(api-ni.c:1055:lnet_startup_lndnis()) Can’t load LND o2ib, module ko2iblnd, rc=256
LustreError: 3640:0:(events.c:728:ptlrpc_init_portals()) network initialisation failed

Lustre 2.1.1 seems to play nice with the Redhat / CentOS distribution OFED drivers, but doesn’t like the vendor Qlogic OFED as I discovered.
There is a post in the Whamcloud mailing list here which speaks about this specific issue (I am running the 1.5.3.2 OFED however)  and then goes off into why the patch is bad:

https://groups.google.com/a/whamcloud.com/group/wc-discuss/browse_thread/thread/9ab7a02404e4d9b2#

Looking at Lustre 2.2, the change log states that is supports the OFED, so it looks like that may be the solution.
Every Lustre expert I have spoken to has either 1.8 in production and is testing 2.1, or has moved over to 2.1.
I’m not a fan of the cutting edge when it comes to servers, but heck it wouldn’t hurt to try :)
If only to provide feedback to Whamcloud.

No Gravatar

I’m currently testing out Lustre 2.1 on the HPC system I spoke about here on google+

Initially I stuck with the vanilla CentOS kernel on the nodes and installed the Qlogic OFED bundle to get the Infiniband and MPI packages.
This worked well. Later though I decided to install the Lustre Kernel and packages from Whamcloud to test Lustre over Infiniband.
I simply dropped the kernel, kernel-devel and kernel-headers onto the nodes using rpm and reboot.
Everything seemed okay, however I noticed pretty poor performance from Lustre and later found that RDMA wasn’t working properly… yikes!

The way I determined RDMA was broken was through compiling and running the HPL 2.0 benchmark as used by Top500.org.
On starting mpirun, I would receive a cute error about not being able to initialize RDMA through the mlx4 driver (The Mellanox IB driver).
This caused me to scratch my head.

I laid the blame squarely at the bait and switch I performed with the Kernels.
After clearing out the Qlogic OFED packages and re-installing it so it compiled the kernel modules against the Lustre Kernel, everything worked as expected and RDMA seems to function once again.
<UPDATE>
However it appears the lustre networking kernel modules in 2.1.1 don’t like the OFED! See this post about it
http://www.nodeofcrash.com/?p=513

The moral of the story is CentOS 6 Vanilla Kernel <> CentOS 6 Lustre Kernel, so don’t assume the kernel modules will port over.

 

No Gravatar

Thought I’d share this speed test from work as well (University of Adelaide).
The limiting factor is the speed of the link between the Thebarton Campus where I am based and the main campus on North Terrace.

No Gravatar

Thought I should share this :)

Speedtest

No Gravatar

Recently I have been working to setting up a NFS server and NFS clients using NFS version 4.
I’ve come across a number of caveats and tweaks required to get NFS v4 working properly.

With NFS v3 you simply setup your server, create an export and mount it on the client. You also need the same uids on both client and server to be able to read and write files.

NFS v4 adds extra security which is good. I highly recommend setting this up properly rather than disabling NFS v4 security or worse, forcing clients to mount NFS using the version 3 protocol.

NFS version 4 uses the idmap daemon, without digging too deep into the technical side (of which I don’t completely understand) it is in charge of checking the NFS domain and checking UID security. The ipmap daemon by default uses the DNS domain on your client and server for security. Therefore your DNS domain needs to be identical on all clients and servers. Else you get your permissions forced to nobody.nobody.

However for my environment, the Server and Client DNS domains weren’t the same.

You can manually set this in your  /etc/idmapd.conf with the line
Domain = local.domain.edu

Make sure this is the same on your server and client machines. Don’t forget to restart rpcidmapd after changing this.

In my environment I use LDAP to synchronize user accounts, so I use PAM ldap authentication and NSS. As a result idmapd uses the user accounts in LDAP via nsswitch to check permissions.
Basically as long as the getent passwd command returns your LDAP accounts, idmapd will pick it up.

I found that I also needed to restart nfslock after making the change to idmapd.

To summarize:
1.) Make sure the DNS domain on clients and server match (can force this in /etc/idmapd.conf)
2.) Make sure your account UID matches on both the NFS server and client.