Lustre 1.6.5.1 and Sun X4500’s

Just finished installing two Sun X4500 AKA Thumppers. Our systems have 48 1 Tbit drives (yes 1000Gig not 1024Gig) and 16 GB of ram. The intention was to run Lustre on them and move data from our old 2.7TByte lustre 1.6.4 setup.

Today we finished the install final usable disk space is 49TBytes and 8Gbit of bandwidth provided by two sets of 4 1 Gig links bonded with 802.3ab. We have so much less real disk after the drives are broken into 14 raid groups with spares and external journals. Oh and don’t forget the two boot drives. We decided to go with external journals to help with the random nature of our work load. Our old disk system was an NFS mount provided by some proprietary data movers. Suffered not from raw performance but from the load of 2000+ cpus asking for IO of different patters to many different files at the same time. There just wasn’t a way to provide a single name space with a simple NFS server.

We hope now with lustre writing meta data to 6 15,000RPM disks in raid 0+1 on a server with 16GB of ram for cache should help meta data performance. Random IO performance to data by default will be spread across 14 differnt arrays, 7 per X4500. This should keep keep heads moving in parallel for multiple requests. The best part is we can just add more X4500’s latter as IO and space needs come. Last both meta data and object data arrays have external journals keeping journal IO to the underlying ldiskfs filesystem independent. This should also help the many differnt requests being hit on the servers.

Performance

The only real performance test we did was bandwidth from a single host to a single array was limited to the 1Gbit/s speed of the host. There were no metadata tests, if you know of any please email or comment.

MPI-IO Performance

This was a big reason for putting lustre on our system. NFS and Romio don’t play well. Lustre was built with MPI-IO in mind and thus works out of the box. Writing to a single file using MPI_File_write() on 10 4 cpu Opteron 2218 nodes with 1Gbit/s ethernet reached 650 MB/s write speed. At this point the CPU’s on the X4500 were filled. I think higher speeds could be reached using TOE or 10Gbit cards (where TOE is implied). Even better speeds might be reached using Infiniband as its RDMA abilities may free CPU resources. Note Lustre does support Infiniband and can support both Infiniband and TCP networks at the same time.

The example code for MPI_File_write() was taken from: Beige.ucs.indiana.edu.

module swap pgi/7.2
module swap openmpi/1.2.6-pgi
mpicc mkrandfiles.c

lfs setstripe data14 0 -1 -1
mpirun -np 40 ./a.out -f data14 -l 300

longest_io_time       = 82.115636 seconds
total_number_of_bytes = 50331648000
transfer rate         = 584.541535 MB/s

Future hopes is that MPI-IO ability of this scale will allow new forms of research to be done on our clusters. I expect in the Winter term to teach a course on using HDF5 parallel IO abilities to researchers.

Questions: brockp@mlds-networks.com

DGEMM() The Function every HPC user should know

Its true DGEMM() should be used by everyone! What should be noted though is if your problem does not require doubles then use SGEMM()!!!

This post is a reply from an email I received from a student. On my post about CUBlas running on Nvidia graphics cards.

The Quesion


How can I run the Dgemm on GPUs, how to support the dgemm() on GPUs which only support the single precision. Thanks, if you have done that, could you please share the code with me? Thanks

The answer is you can’t. The current crop of cards including those branded for HPC only use like the Tesla 8 cards can only do single. Reason for this is simple. Graphics did not require it and the first generation HPC targeted cards were really re-branded workstation cards. There is also an added real cost, DOUBLE requires twice the number of transistors. Many modern CPU’s when running a code in SINGLE will run it twice as fast because it takes a DOUBLE register breaks it in half so it can work on twice as many numbers at a time. IE real cost savings in performance and silicon.

Now there was some talk in PLASMA (Parallel Linear Algebra for Multicore) about swapping in a and out of SINGLE and DOUBLE. Look on page 5 for an example, in this case it was used on the IBM Cell BE. The Cell reaches over 100 GFlop in single but instead of dropping performance by half for double it dropped to 14 GFlop! Thus the reasoning behind PLASMA’s swapping in and out of SINGLE. NOTE: This was resolved in the Power XCell 8i CPU used on Road Runner. I still think this is a great idea.

So the full answer to run DGEMM() (Double Generic Matrix Multiply) on current graphics cards is you can’t. But don’t give up hope, keep working with CUBLas only use DOUBLE if you need to. Even then keep CUBlas and similar projects in mind the Nvidia Tesla 10 series introduced DOUBLE support which should make all this pain go away.

Short term solutions to high performance DGEMM() is Goto Blas. There is a threaded version which can take advantage of things like Multicore and SMP. I have gotten my best HPL numbers using Goto and it is a great tool. If your not using a common platform Goto was built for, the vendor provided libraries (ESSL ACML MKL etc.) work great and many are threaded. Core 2 and Barcelona with modern BLAS libs are twice as fast per clock than previous generations.

sorry there is no good solution, but I hope I gave you enough tools to get buy till Tesla 10 hardware is out. If you have questions email me, also for getting my hands dirty I am available for consulting.
Comments welcome.

Brock E. Palen

CUDA Blas

I am a huge fan of BLAS and I wish more people used it. I have not figured any numbers out but I think U of M (My day job) might waste enough money per year to fund a position to teach faculty and researchers to use BLAS, CUDA and NAG/IMSL. For example it is easy to show how on the same hardware using DGEMM() vs some DO loops can go 10′x faster while consing the same capital resources (computer, facility space) and consumables (Power, Cooling). This problem will only get worse as more and more research computing happens at the University.

So once you have users using BLAS and friends what is the next big leap in performance one can extract from hardware? Nvidia has a great option called CUBlas. It is part of their CUDA kit for doing general computing work on Nvidia graphics cards.

I was able to port a simple matrix multiply code to CUBlas in an hour, so most codes that depend on this should find they can use CUBlas quickly and easly.

Calculations on graphics cards have to be done from the video buffer memory on the card. In my case this was 256MB so I could not do very large problems. This should not be much of an issue as it is easy to copy data from the host memory (ram) to the card memory (video buffer). Also for larger problem I think porting some of the out-of-core ways of running problems like Pardiso and DGEMM() could be implemented on GPU’s using the system RAM where we used disk in the past and treading the card memory as the ram of old.

Ok technical details, basic form was allocate memory on the card, copy from host memory to card memory, call the function and copy results back:

//pointers to memory on the card
float* d_A = 0;
float* d_B = 0;
float* d_C = 0;

CUstatus = cublasInit();

CUstatus = cublasAlloc(DIM*DIM, sizeof(d_A[0]), (void**)&d_A);
CUstatus = cublasAlloc(DIM*DIM, sizeof(d_B[0]), (void**)&d_B);
CUstatus = cublasAlloc(DIM*DIM, sizeof(d_C[0]), (void**)&d_C);

CUstatus = cublasSetVector(DIM*DIM, sizeof(a[0]), a, 1, d_A, 1);
CUstatus = cublasSetVector(DIM*DIM, sizeof(b[0]), b, 1, d_B, 1);
CUstatus = cublasSetVector(DIM*DIM, sizeof(c[0]), c, 1, d_C, 1);

cublasSgemm('n','n', M, N, K, alpha, d_A, lda, d_B, ldb, beta, d_C, ldc);

CUstatus = cublasGetError();

//copy back
CUstatus = cublasGetVector(DIM*DIM, sizeof(c[0]), d_C, 1, c, 1);

//free memory on the card
CUstatus = cublasFree(d_A);
CUstatus = cublasFree(d_B);
CUstatus = cublasFree(d_C);

//shutdown cublas
cublasShutdown();

Quite simple, its all a C/FORTRAN library so nothing special other than an Nvidia card and the CUBlas library. If you want the full source email me at: brockp@mlds-networks.com

GPGPU’s

One of the hot things in HPC right now is GPGPU’s. The general idea is to use graphics cards with their very high memory bandwidth and massive parallel ALU to do general computation. Think math on graphics chips. This is a great idea because graphics companies have the scale of the consumer market to keep prices down and innovation up. Jezz I love capitalism. The performance of these cards is much much higher than a general purpose Intel or AMD cpu, and are even higher than could be had out of many purpose build accelerators like those from Clear Speed and available at a lower cost.

Now GPGPU is not without its problems. Most the problems though are slated to be resolved by Nvidia and ATI/AMD. These problems are:

  • Lack of DOUBLE support
  • Lack of simple programming interfaces
  • Lack of standard interface from both major vendors
  • Lack of large memory space
  • Lack of Scheduling

Most serious HPC application requires DOUBLE support. Current cards only support SINGLE but the Nvidia testla 10 cards now support double and ATI sees this also. So this will be solved soon. Yes all my MD (protein folding) folks will tell me we don’t need DOUBLE but they will get in fists fights over it with my FEA (Meshing) friends. I of course need to support both on the clusters. In any point this is solved.

Programming API’s are in the works. Nvidia has the wonderful CUDA which I will write about latter. The worst part of cuda is all the hard work it involves that the average PHD candidate will not understand. Currently they can’t program FORTRAN or C effecntly CUDA asks for so much more.

CUDA does have CUBlas which I can’t stress how much I love it. CUBlas allowed me in an hour convert the heavy work portion of a code I had to use the graphics card quickly and no special NVCC compiler. I was quite happy. This is what I think should be done. I have always said most people should just make their code into and LU problem and then use BLAS to Factor it. Well all of BLAS and LAPACK (PLASMA) should be implimented in CUDA and be a library that fortran and C programmers link against. Easy right? Well simpler than writing CUDA still harder than raw C or FORTRAN but the benefits are huge. I really hope to see Nvidia and ATI implement PLASMA.

I will cover the other points when I write about using CUBlas. I will focus now on why I think PLASMA should be used by Nvidea over regular BLAS.

PLASMA (Parallel Linear Algebra for Multi-core) was made mostly for cpus like the CELL BE from IBM. The first rendition of the CELL while it could do DOUBLE performed much better in SINGLE. PLASMA wanted to ease this pain by taking LAPACK and the parts of the operation that don’t require DOUBLE just be ran in single. Extracting the full performance of the CELL cpu. Now that the Cell BE has full DOUBLE support it acts more like regular cpus in that the cpu is only half as fast at DOUBLE vs SINGLE.

This is still twice as fast though! Why don’t we do this in all math libraries? As long as changing between types (DOUBLE-> SINGLE, SINGLE->DOUBLE) is cheap we should do this. I think this matter more and more on larger systems because going from 1Gflop to 2 Gflop might not matter as much but going from 1Tflop to 2 Tflop is a big deal. Because GPU’s are so much faster and the market that pushes their development (the consumer graphics market) does not require DOUBLE the more that the HPC world can leverage SINGLE the better and the speed bonus to boot.

Please post any comments and questions, or email me at brockp@mlds-networks.com

64 bit Scientific Computing - Things to look out for

Below is an email I sent to a user who had a question about what could limit the amount of memory in use by his application. Turns out its not as easy as use 64bit.

Things to watch out for:

  • Admin Placed Limit on the Stack for Fortran77
  • Limits on size of arrays on x86_86
  • Assumed size of a pointer

The email follows, I will also attach example code to demonstrate.

Yes,

First fortran77 (g77) does not do dynamic allocation, so all variable

are allocated on the stack not the heap (look online for what these are if you care).

First on nyx by default the stack can’t
be larger than 10 MB. This is a limit we impose to keep stack frames
from going crazy. You can see what the stack limit is by running:

ulimit -s

If you run

ulimit -s unlimited

It removes that stack limit. This is needed if you get ’segfault’
messages some times.

The next limit you will hit will be because of your cpu architecture.
Your regular windows machine is 32 bit. 32bit machines can not
access more than 4gb of memory, and this is split between the OS
(windows linux) and the application leaving between 3 and 2 gb
available for the application.

Then there are 64 bit systems (Nyx is 64 bit). 64 bit cpus can
address up to 16.8 Million Terra Bytes. (yes more than a petaByte of
memory). How this is split between the OS and the application is
Moot right now. Should be aware that the first gen AMD64 cpus (like
on nyx) had an artificial limit of 1TerraByte of memory. Though the
largest memory system we have has only 64GB or ~0.064 TB

Now among 64 bit cpus there is what i call ‘native’ 64 bit and
‘extended’ 64 bit. True 64 bit like IA64 and Power have no limits on
anything (other than maybe the admin imposed stack limit above).

The more common ‘extended’ are any of the x86_64, em64t, amd64.
Works the same as the native, only if you add in a compile/link option:

-mcmodel=medium

The default is

-mcmodel=small

.

Under the small memory model some arrays static arrays can only be
2gb in size. Under medium this is not a problem but there is a
performance hit for accessing memory. If you dynamic memory
allocation like in fortran 90 or c/c++ there is no need for the
medium memory model.

The error:

relocation truncated to fit: R_X86_64_PC32 .bss
 

Means you need to use

-mcmodel=medium

So the moral of the story is don’t use fortran 77 is you can avoid it
and use dynamic allocation. It will free up most of your pain.
64bit right now provides just a massive amount of memory addressable
that the next memory limit wont come for another decade. Note the
amount of ram 64 bit systems support is more than the amount of hard
disk in the world.

You of-course can’t use ram you don’t have installed. And you can’t
assume the size of pointers. Pointers on nyx are 64 bit

(INTEGER*8)

on 32bit systems 32 bit

(INTEGER*4)

I hope your not using pointers in fortran 77 though.

 

C/C++

If your stack limit (ulmit -s) is less than 4GB this will fail with a segfault:

#define N 715827882/2/2
int main(){
double bigarray[N];
bigarray[1000]=5;
return 0;
}

This will work though with dynamic allocation even with the stack size limit in place:

#include <stdlib.h>
#define N 715827882/2/2
int main(){
double *bigarray=(double *)malloc(N*sizeof(double));
bigarray[1000]=5;
return 0;
}

cblas with ACML c++ support etc.

One currently large problem with the BLAS is that these libraries are all written for Fortran 77.  While many commercial BLAS/LAPACK implementations seek to correct this.  (NAG, IMSL, MKL, etc).   It is not portable across all systems.  The only way to use them for sure is to write the code only in Fortran.  it is quite annoying as more and more users are writing their code in C and C++.

Now you can call Fortran from C.  Its not that hard, fortran passes everyhting by reference:

DGEMV(TRANS, M, N, ALPHA, A, LDA, X, INCX, BETA, Y, INCY, TRANS)

Becomes

dgemv_(char *trans, int *m, int *n, double *alpha, double *a, int *lda, double *x, int *incx, double *beta, double *y, int *incy, int trans_len);

It’s is not the cleanest thing in the world and everything much be by reference.

Well there was an update to BLAS known as the CBLAS. The paper on the CBLAS is defined  as cblas_dgemv();  The cblas shall also have options so you can tell the BLAS if your matrix is Row Major or Colum major (if your writting C or Fortran you better know what this is!  If not email me: brockp@mlds-networks.com).  Very helpful because C puts matrixes in Row Major while Fortran is collum major so calling fortran from C can make memory location quite the pain.

Now yes some (most) BLAS fuctions that work with matrixes have a transpose option which can correct from this true.  I have not done tests yet if it hurts performace or not due to cache locality   My guess is it will depend on your BLAS library.

Now so we have a nice deffinition of how the BLAS should be called from C.  While MKL and ATLAS have implimented some if not all of it, ACML and other have not or have done their own for of BLAS from C.  This is a major pain.  To have code work on multiple platforms with differnt cpu vendors and thus differnt BLAS libraries. Anything other than fortran could be a pain.

I recently figured out a way to make the CBLAS work with ACML which does not impliement the ‘correct’ CBLAS.

Steps:

  1. Download CBLAS
  2. Modify Makefile.in to use ACML (Mine attached)
  3. Build libcblas.a following the CBLAS instructions

When you are all done, the cblas library will take care of moving all calls even Row Major to the ACML fortran BLAS.  This method should work for any BLAS that does not have CBLAS hooks. The one that comes to mind is GOTO BLAS one of the best BLAS’s out there.

In my case the call

cblas_dgemm(CblasRowMajor,CblasNoTrans, CblasNoTrans, M, N, K, alpha, a, lda, b, ldb, beta, c, ldc);
pgcc -fastsse -mp driver.c -I /opt/cblas/include/ -L /opt/cblas/lib -L /opt/acml/pgi64_mp/lib -lcblas -lacml_mp -lpgftnrtl

Will work, and link with the threaded BLAS so that I can use OpenMP for parallel.  My tests show this to be atleast as fast as the native fortran77 so I am happy.

 Though It would still be better if all BLAS lib makers would settle on the CBLAs paper above from Netlib.

CBLAS from C++

The cblas.h does not work when called from C++ reason for this is C++ name mangling. Its an easy fix and hope the BLAS working group adds this fix.

 Open up cblas.h and add:

#ifdef __cplusplus
extern "C" {
#endif /* __cpluplus */

Right after the last include before anything else is defined.

Then at the very end of the header add:

#ifdef __cplusplus
}
#endif /* __cplusplus*/

Right before the last #endif

Amber 8 on em64t/amd64 full 64bit and OpenMPI 1.2.6

Amber 8 is one of the molecular dynamics packages a number of our users at the Center for Advanced Computing use. Our main cluster is a 64bit AMD system Nyx. No system at the CAC is 32bit anymore but Amber’s ‘configure’ script assumes that if your building with the Intel Ifort compiler that you are ether 32bit i686/i386 or IA64. We are neither we are x86_64 also known as em64t and amd64, it all depends if your talking to Intel or AMD they are identical though.

So Amber’s configure script is a disaster, does not allow for using the mpif90/mpicc wrappers as part of a mpi library, it also assumes your going to use a small subset of MPI libraries. Our MPI library of choise is OpenMPI which was not around when Amber8 was published. So I made a patch to support both x86_64 and OpenMPI in amber’s configure script.

Apply the patch:

patch < openmpi.patch

This adds the option:

-openmpi

To the configure script. To build Amber now is:

export AMBERHOME=~/Amber8
export OMPI_HOME=~/openmpi-1.2.6
./configure -openmpi ifort

This will build amber. Note that the patch configure script does NOT suport 32bit targets when using the MKL, the script was patched such that if you set: MKL_HOME

Which I highly recommend it will use the em64t versions of MKL. The configure script will no-longer find the 32bit versions. Thus I did not send this patch to the amber devs. Also It may be fixed in Amber 9 and Amber 10 but we have not installed them yet.

Final to build Amber 8 with OpenMPI-1.2.6 on x86_64 with Intel Ifort compiler and MKL do:

export AMBERHOME=~/Amber8
export OMPI_HOME=~/openmpi-1.2.6
export MKL_HOME=~/mkl80
cd $AMBERHOME/src/
wget http://www.mlds-networks.com/~brockp/papers/code/ompi.patch
patch < ompi.patch
./configure -openmpi ifort
make parallel

HPC The future and Vector Processing

The three of us at the CAC were all invited to a talk at the University Hospital on HPC (High Performance Computing) in medicine. A few cool things came to mind.

  • HPC is being used in real medical applications
  • Vector processing is back (think Cell)
  • HPC is going to be underutilized

The first point is obvious.  And its cool, HPC till now has really been only used for weather prediction, and anything else was all research. Now there is a need to build simple clusters and gateways to them for a doctor to pull data right from a DNA sequencer or MRI machine.

(I will get this graphic latter).

Vector Processing is back.  To me as much as IBM wants to call the eight units hanging off the Cell cpu SPE’s. The cell is just a vector CPU in the eyes of a application.   There is a few problems with this: First, vector processing is not taught in classes at all if your a Computer Science major.  And when I mention vector, SIMD or SSE, 3DNow! to a gradstudent who is trying to graduate in 3 years and is just learning to code they have no clue.

Already in the graphic above you can see that if you use the vector SSE unit on the CPU in your laptop the time to soluation is cut by more than half.  This problem is only going to get worse with things like Cell, and GPU’s.  All these systems are massive SIMD engines, that the peopel writing code on at the collage level know nothing about.  Classes do not exist to teach such things, and the tools are not there to abstract it out.

 My prediction: 80% of the code ran on HPC systems at universities will not use the SPE’s GPUS and Vector units.  Resulting in wasting huge amounts of resources. 

Of the code ran thats written by grad students, I expect 95% to not use them. 

Now I don’t say we should not go this route, I think we have to.  Look at the performace of the Cell, 200 Gflop when working on floats. A Intel or AMD cpu can’t do half that even with SSE3 which doubled the performance of AMD and Intel cpus when (can you guess it?) Only using SSE3 VECTOR units!  Scallar performance is still drag and will not improve (Much).

We need to build tools to make these units available, but many already are around IMSLNAG and similar tools already provide high level methods for solvers to use vector units, and expect them to build ones for the cell.  There is also the lowerlevel BLAS and LAPACK, in the form of MKL, ACML, ATLAS and GOTO. All free to researchers, and provide huge performance gains mostly by memory blocking and vector unit use.  Many are also already parallel with no need to know OpenMP or MPI.  Tlak to your local HPC admin, or email me at brockp@mlds-netowrks.com.

So I think we already have the tools and they are not being used enough, so whats the problem?

 Education,

Courses, ether in the form of formal classes or as seminars need to be available and pushed by Faculty that show students writing new code can find out that these tools are available. Just teaching students to use:

pgcc -fastsse -O3 -ipo -Minfo
gcc -O3

Would make huge performance leaps on systems on campuses.  

Sigh, I started teaching such classes but I can only do so much, You can find what I have done so far at:

www.umich.edu/~brockp

Lustre Dstat plugin

At The Center for Advanced Computing at my day job. I tested out Lustre a cluster filesystem now owned by Sun. While I have been very impressed with Lustre we have had a huge amount of trouble with servers and clients evicting each other.

This did not keep me from making a plugin for my favorite system tool, Dstat.

This plugin works only on clients and has not been tested with multiple lustre mounts patches please!

Download the plugin here: Download

Place in dstat-0.6.6/plugins/

dstat -M lustre

Visit Visualization - Quite Powerful

Visit is a large scale scientific visualization software from LLNL . I have tried out Visit on our cluster NYX using both my Mac as a client and our 3D workstations (Linux). I was quite impressed with the large amount of out of the box ability. So far I have been able to make the following formats work as both static images and video:

  • VASP OUTCAR
  • Protein Databank (ent)
  • CGNS
  • Simple HDF5
  • Fluent
  • Flash

I plan to try a few others, TecPlot, XYZ and NetCDF to name a few.

I must stress that Visit will not run in parallel using the binaries provided by LLNL on the Visit website. You will need to compile it your self, but to do this the Visit devs have made a build_visit script you can find on their website. It was very easy on our RHEL4 machine with OpenMPI-1.2.3 with gcc-3.4 to build. A few libraries it tries to do will fail. Mili is not available yet so deselect it from the available plugins. Also h5Part would not build for me though HDF5 does just fine. Just disable these.

Note DO NOT run visit over a slow line, in this case your highspeed Comcast is not fast enough, the machine running the client should be on the same lan as the cluster or have at-least a 100Mbps available to the client at anytime. Second never X-forward the client, it will look bad and work awful. Last make sure your Linux machine supports hardware GL (We use Nvidia SLI for our 30" displays). If you use the Windows or Mac client this should already be done for you but for larger models a cheap graphics card will hurt you.

The largest render I have done used 16 cpus and chewed a 4.5GB HDF5 file from lusture to my display at the 3d lab in around 45 seconds. This file was way to simple so YMMV.

If you are a U of M student/staff/faculty who wants to know more about tools like Visit please contact me. I would love to help you out.

Example Images: 

visit-vasp0000.pngvisit-pdb0002.pngvisit-pdb0001.pngvisit-pdb0000.pngvisit-cgns0001.pngCGNS Filevisit-hdf50000.png

Example Movie:

 amr.png

plants