Lustre 1.6.5.1 and Sun X4500’s

Just finished installing two Sun X4500 AKA Thumppers. Our systems have 48 1 Tbit drives (yes 1000Gig not 1024Gig) and 16 GB of ram. The intention was to run Lustre on them and move data from our old 2.7TByte lustre 1.6.4 setup.

Today we finished the install final usable disk space is 49TBytes and 8Gbit of bandwidth provided by two sets of 4 1 Gig links bonded with 802.3ab. We have so much less real disk after the drives are broken into 14 raid groups with spares and external journals. Oh and don’t forget the two boot drives. We decided to go with external journals to help with the random nature of our work load. Our old disk system was an NFS mount provided by some proprietary data movers. Suffered not from raw performance but from the load of 2000+ cpus asking for IO of different patters to many different files at the same time. There just wasn’t a way to provide a single name space with a simple NFS server.

We hope now with lustre writing meta data to 6 15,000RPM disks in raid 0+1 on a server with 16GB of ram for cache should help meta data performance. Random IO performance to data by default will be spread across 14 differnt arrays, 7 per X4500. This should keep keep heads moving in parallel for multiple requests. The best part is we can just add more X4500’s latter as IO and space needs come. Last both meta data and object data arrays have external journals keeping journal IO to the underlying ldiskfs filesystem independent. This should also help the many differnt requests being hit on the servers.

Performance

The only real performance test we did was bandwidth from a single host to a single array was limited to the 1Gbit/s speed of the host. There were no metadata tests, if you know of any please email or comment.

MPI-IO Performance

This was a big reason for putting lustre on our system. NFS and Romio don’t play well. Lustre was built with MPI-IO in mind and thus works out of the box. Writing to a single file using MPI_File_write() on 10 4 cpu Opteron 2218 nodes with 1Gbit/s ethernet reached 650 MB/s write speed. At this point the CPU’s on the X4500 were filled. I think higher speeds could be reached using TOE or 10Gbit cards (where TOE is implied). Even better speeds might be reached using Infiniband as its RDMA abilities may free CPU resources. Note Lustre does support Infiniband and can support both Infiniband and TCP networks at the same time.

The example code for MPI_File_write() was taken from: Beige.ucs.indiana.edu.

module swap pgi/7.2
module swap openmpi/1.2.6-pgi
mpicc mkrandfiles.c

lfs setstripe data14 0 -1 -1
mpirun -np 40 ./a.out -f data14 -l 300

longest_io_time       = 82.115636 seconds
total_number_of_bytes = 50331648000
transfer rate         = 584.541535 MB/s

Future hopes is that MPI-IO ability of this scale will allow new forms of research to be done on our clusters. I expect in the Winter term to teach a course on using HDF5 parallel IO abilities to researchers.

Questions: brockp@mlds-networks.com

NONE, NADA, ZIP, ZILCH

Why don't you pony up and be the first to add your comment?

Add your own comment...

plants