Bits & Bytes online Edition




GPFS on the Linux Clusters

Thomas Soddemann

For a distributed Linux compute cluster the need for a performant parallel file system is obvious. As standard global file system AFS is in use at RZG. The Blade Center and the Rack optimized Linux Cluster at RZG were additionally equipped with IBM's General Parallel File System (GPFS). This system is already known from our IBM 'Regatta' system (/ptmp and /u). In a test configuration we have set up several GPFS systems.

In the contrary to the 'Regatta' system where we have two dedicated I/O nodes and all compute nodes as clients participating in the same file system, we employ a different strategy with the Linux clusters. Here, in principle, all compute nodes are both clients and servers for a GPFS file system. In one Blade Center Cluster, e.g., each compute node contributes a 40 GByte partition of its internal hard drive to the GPFS mounted on /gpfs4. Without mirroring data and meta data we would create a single point of failure. Just rebooting one machine would require a shutdown of the whole file system. By mirroring we are able to reboot up to n/2-1 nodes of a n-nodes cluster without loosing a single Bit or disturbing the file system usage (except performance). Problems occur, if nodes of different failure groups become unavailable. GPFS has the advantage over other parallel file systems such as PVFS that it is able to recover by itself to a certain degree. But if mount points become unavailable, e.g., due to hanging jobs, even GPFS is unable to remount and most often only a reboot helps. Luckily, such situations are in general rare.

In our test configuration we achieve a sustained transfer speed up to more than 100 MBytes/s for a single host I/O. This is almost at the edge of the networks bandwidth. Read/write speed can even be faster, if the data are processed in parallel. This requires to make use of special POSIX compliant I/O libraries. MPI-IO which is going to be able to make use of GPFS' parallel I/O in the near future, and already works with AIX, is another solution. More about this will be available from July on under www.rzg.mpg.de/docs/linux.

Naturally, GPFS needs bandwidth, and, at the moment, it has to compete with communication intensive jobs for that bandwidth. We are aware of this bottleneck and will provide a dedicated network for GPFS on the Blade Centers.