SC 08′, Austin Texas

Right now I am in my hotel for the last night before I go home from Austin Texas from this years Super Computing

This year was not that bad. I was able to touch base with PGI about there 8.0 release with support for GPGPUS from Fortran and CCFF (Common Compiler Feedback Format). The GPGPU support looks simple, but it always does. I am sure it is not as easy as it looks, but it is going to be much simpler to work with GPU hardware from Fortran, and if the regular -Minfo -Mneginfo flags work could be very useful. The CCFF stuff looks really good. Lots of data about what the compiler did and what the code did when it ran. Turns out this information in held in XML and they have published this (they claim, I can’t find it) thus third party tools should support it. I for one harassed Allinea about adding support for this profile information into OPT and maybe DDT.

DFT’s on GPU’s. Two papers and one application. Both papers reached the same performance, in both cases much better than the CU-FFT support shipped with CUDA. I hope Nvidia (and ATI) look at these papers and add their support. In many cases, optimizing for bandwidth on the card provided the best performance. This says to me that GPU’s are already in the case of CPU’s which is memory is the bottle neck and there is only marginal improvements that need to be made in the compute engine.

In one of the papers the author looked at total cost for a application they were running. This is great because in my own testing I found copying data to and from real memory to device memory was slow and just awful. Thus moving small parts of your app to GPU’s does not help in most cases. Ether your entire app needs to live on the card (well compute should, pre and post processing not so much) or don’t even bother. In his case GPU’s were good only if using his wonderful optimized FFT. If using the stock CU-FFT, forget it, no point. Copy to and from the card crushed performance. Still promising. I am happy though, that someone pointed out the real downfall of GPU hardware.

The dev’s of NAMD have added CUDA support for GPU’s to their code. This is beta and not GA yet, expect to see it soon. He pointed out something that I didn’t think of. That was, the best performance to and from RAM and device memory is using pinned memory. This is memory that can not be paged out. This allows RDMA style applications to make sure that where the data is placed in memory is where it should be and that the address has not moved (yes the OS moves pages). Well RDMA fabrics like IB also want pinned memory. Thus MPI stacks want to use pinned memory also, but also de-pinns memory that is nolonger needed. What happens when the card and MPI both work with these buffers? It gets un-pinned, data goes into memory in wrong place.

Now you can code around this, but regular research grad students should not have to worry about this. I talked to Jeff Squires of OpenMPI and he was aware of this and Nvidia came to him already that day before I did and have asked about this.

Here is the last bit of insight from the NAMD dev. Why can we not do RDMA between GPU memory in systems with multiple cards? Why can we not do RDMA from GPU memory across and RDMA fabric (think IB, iWARP) to another GPU or RAM in a remote node? This is a great idea! Talking to Jeff should be doable and should help out performance very much on these types of MPI+GPU applications we _will_ see in the future.

Parmetis on OSX-10.3.x

Building ParMetis

This will work on 10.4.x and 10.5.x for PPC and Intel I expect. Though I have not tried it. This was done using OpenMPI

tar -xzvf ParMetis-3.1.tar.gz
cd ParMetis-3.1

Edit Makefile.in make sure CC and LD are set to mpicc. Set set INCDIR and LIBDIR to nothing.

make

This make will fail with ‘cant find malloc.h’, malloc.h is in /usr/include/sys/. Re-open Makefile.in and set INCDIR = -I/usr/include/sys/

make

For those with impatience

tar -xzvf ParMetis-3.1.tar.gz
cd ParMetis-3.1
CC = mpicc
LD = mpicc
INCDIR =
LIBDIR =
make
INCDIR = -I /usr/include/sys
make

dcopy() vs memcopy() vs C code?? ACML slacks?

It is not uncommon to have to copy around arrays and data in HPC applications. There is three ways you can do this:

  • string.h memcpy()
  • BLAS1’s DCOPY()
  • C code and let the compiler optimize it

Of these three I expected DCOPY() to be the fastest, then memcpy() and last the C code. Oh was I wrong. I used the STREAM benchmark on an AMD Opteron 2218 using PGI 7.2 compilers:

  • memcpy() 3056.8 MB/s
  • DCOPY() 5727.4 MB/s
  • C code 5737.7 MB/s

So memcpy() is much slower than I thought, It about the same speed as if I optimize the crap out of the C code using the GNU C compiler, but we are not using the GNU compiler. DCOPY() reaches that speed using the PGI or GNU compiler which is good to know for portable so I still recommend using DCOPY() over memcpy().

The result from the PGI compiler resulted in the great performance. Turns out the compiler is to smart for us:

pgcc stream.c -fastsse -Minline -Minfo
main:
   164, Generated vector sse code for inner loop
   181, Generated vector sse code for inner loop
        Generated 1 prefetch instructions for this loop
   208, Memory copy idiom, loop replaced by call to __c_mcopy8
   218, Generated vector sse code for inner loop

Notice how on line 208 the memory copy was replaced by a call to __c_mcopy8. Looks like PGI is smart and has their own high speed calls to do similar operations built into their compiler. Nice work guys.

Where ACML starts to slack is in use of the AMD multiple memory controllers. Their DCOPY() is not threaded and thus the OpenMP version does not any faster than the single threaded version. While the compiler is even faster on more memory controllers!

  • DCOPY() 2 threads: 4525.8 MB/s
  • C code 2 threads: 10934.6 MB/s

As you can see ACML has room for improvment making use of the multiple memory contolers available in the Opteron platform. Good compilers can do operations as simple as copying arrays of doubles faster using

 #pragma omp for 

. Again test test test.

BLAS Performance AMD Opteron

I have been putting together a course on using the BLAS on our CAC machines. General theory has been that blas 2 is faster than 1 and blas 3 is faster than 2. In my measurements though this was not the case. I was very surprised by this.

As can be seen DDOT() is faster in Mflop/s than DGEMV() not sure why this is. The case where I am comparing ACML to C code the compiler does make faster code for DGEMV() as expected and is still much slower than the slowest ACML/BLAS call. In any case I am still confused why this happened.

In any case users doing any type of math should be aware of the Pentium Pro line. This is the peak performance of the 1995 CPU and how the C routine reaches only that performance. As an example of what many HPC admins already know, use the BLAS or your performance will be awful.

memcpy vs xCOPY()

C has a function memcpy that will copy from one memory location to another. As a way of quickly copying arrays from one memory location to another is memcpy on par with the hardware specific BLAS1 call xCOPY()? When it comes to Fortran I do not know of a memcpy so xCOPY() may be their only option. Last xCOPY() better not be any worse than memcpy because they do the same thing.
If anyone has data on this comparison let me know.

On a side note if anyone wants the code and latex to go with it use:

git clone http://www.umich.edu/~brockp/git-repo/cac-docs.git
git checkout --track -b fastmath origin/fastmath

Lustre 1.6.5.1 and Sun X4500’s

Just finished installing two Sun X4500 AKA Thumppers. Our systems have 48 1 Tbit drives (yes 1000Gig not 1024Gig) and 16 GB of ram. The intention was to run Lustre on them and move data from our old 2.7TByte lustre 1.6.4 setup.

Today we finished the install final usable disk space is 49TBytes and 8Gbit of bandwidth provided by two sets of 4 1 Gig links bonded with 802.3ab. We have so much less real disk after the drives are broken into 14 raid groups with spares and external journals. Oh and don’t forget the two boot drives. We decided to go with external journals to help with the random nature of our work load. Our old disk system was an NFS mount provided by some proprietary data movers. Suffered not from raw performance but from the load of 2000+ cpus asking for IO of different patters to many different files at the same time. There just wasn’t a way to provide a single name space with a simple NFS server.

We hope now with lustre writing meta data to 6 15,000RPM disks in raid 0+1 on a server with 16GB of ram for cache should help meta data performance. Random IO performance to data by default will be spread across 14 differnt arrays, 7 per X4500. This should keep keep heads moving in parallel for multiple requests. The best part is we can just add more X4500’s latter as IO and space needs come. Last both meta data and object data arrays have external journals keeping journal IO to the underlying ldiskfs filesystem independent. This should also help the many differnt requests being hit on the servers.

Performance

The only real performance test we did was bandwidth from a single host to a single array was limited to the 1Gbit/s speed of the host. There were no metadata tests, if you know of any please email or comment.

MPI-IO Performance

This was a big reason for putting lustre on our system. NFS and Romio don’t play well. Lustre was built with MPI-IO in mind and thus works out of the box. Writing to a single file using MPI_File_write() on 10 4 cpu Opteron 2218 nodes with 1Gbit/s ethernet reached 650 MB/s write speed. At this point the CPU’s on the X4500 were filled. I think higher speeds could be reached using TOE or 10Gbit cards (where TOE is implied). Even better speeds might be reached using Infiniband as its RDMA abilities may free CPU resources. Note Lustre does support Infiniband and can support both Infiniband and TCP networks at the same time.

The example code for MPI_File_write() was taken from: Beige.ucs.indiana.edu.

module swap pgi/7.2
module swap openmpi/1.2.6-pgi
mpicc mkrandfiles.c

lfs setstripe data14 0 -1 -1
mpirun -np 40 ./a.out -f data14 -l 300

longest_io_time       = 82.115636 seconds
total_number_of_bytes = 50331648000
transfer rate         = 584.541535 MB/s

Future hopes is that MPI-IO ability of this scale will allow new forms of research to be done on our clusters. I expect in the Winter term to teach a course on using HDF5 parallel IO abilities to researchers.

Questions: brockp@mlds-networks.com

DGEMM() The Function every HPC user should know

Its true DGEMM() should be used by everyone! What should be noted though is if your problem does not require doubles then use SGEMM()!!!

This post is a reply from an email I received from a student. On my post about CUBlas running on Nvidia graphics cards.

The Quesion


How can I run the Dgemm on GPUs, how to support the dgemm() on GPUs which only support the single precision. Thanks, if you have done that, could you please share the code with me? Thanks

The answer is you can’t. The current crop of cards including those branded for HPC only use like the Tesla 8 cards can only do single. Reason for this is simple. Graphics did not require it and the first generation HPC targeted cards were really re-branded workstation cards. There is also an added real cost, DOUBLE requires twice the number of transistors. Many modern CPU’s when running a code in SINGLE will run it twice as fast because it takes a DOUBLE register breaks it in half so it can work on twice as many numbers at a time. IE real cost savings in performance and silicon.

Now there was some talk in PLASMA (Parallel Linear Algebra for Multicore) about swapping in a and out of SINGLE and DOUBLE. Look on page 5 for an example, in this case it was used on the IBM Cell BE. The Cell reaches over 100 GFlop in single but instead of dropping performance by half for double it dropped to 14 GFlop! Thus the reasoning behind PLASMA’s swapping in and out of SINGLE. NOTE: This was resolved in the Power XCell 8i CPU used on Road Runner. I still think this is a great idea.

So the full answer to run DGEMM() (Double Generic Matrix Multiply) on current graphics cards is you can’t. But don’t give up hope, keep working with CUBLas only use DOUBLE if you need to. Even then keep CUBlas and similar projects in mind the Nvidia Tesla 10 series introduced DOUBLE support which should make all this pain go away.

Short term solutions to high performance DGEMM() is Goto Blas. There is a threaded version which can take advantage of things like Multicore and SMP. I have gotten my best HPL numbers using Goto and it is a great tool. If your not using a common platform Goto was built for, the vendor provided libraries (ESSL ACML MKL etc.) work great and many are threaded. Core 2 and Barcelona with modern BLAS libs are twice as fast per clock than previous generations.

sorry there is no good solution, but I hope I gave you enough tools to get buy till Tesla 10 hardware is out. If you have questions email me, also for getting my hands dirty I am available for consulting.
Comments welcome.

Brock E. Palen

CUDA Blas

I am a huge fan of BLAS and I wish more people used it. I have not figured any numbers out but I think U of M (My day job) might waste enough money per year to fund a position to teach faculty and researchers to use BLAS, CUDA and NAG/IMSL. For example it is easy to show how on the same hardware using DGEMM() vs some DO loops can go 10′x faster while consing the same capital resources (computer, facility space) and consumables (Power, Cooling). This problem will only get worse as more and more research computing happens at the University.

So once you have users using BLAS and friends what is the next big leap in performance one can extract from hardware? Nvidia has a great option called CUBlas. It is part of their CUDA kit for doing general computing work on Nvidia graphics cards.

I was able to port a simple matrix multiply code to CUBlas in an hour, so most codes that depend on this should find they can use CUBlas quickly and easly.

Calculations on graphics cards have to be done from the video buffer memory on the card. In my case this was 256MB so I could not do very large problems. This should not be much of an issue as it is easy to copy data from the host memory (ram) to the card memory (video buffer). Also for larger problem I think porting some of the out-of-core ways of running problems like Pardiso and DGEMM() could be implemented on GPU’s using the system RAM where we used disk in the past and treading the card memory as the ram of old.

Ok technical details, basic form was allocate memory on the card, copy from host memory to card memory, call the function and copy results back:

//pointers to memory on the card
float* d_A = 0;
float* d_B = 0;
float* d_C = 0;

CUstatus = cublasInit();

CUstatus = cublasAlloc(DIM*DIM, sizeof(d_A[0]), (void**)&d_A);
CUstatus = cublasAlloc(DIM*DIM, sizeof(d_B[0]), (void**)&d_B);
CUstatus = cublasAlloc(DIM*DIM, sizeof(d_C[0]), (void**)&d_C);

CUstatus = cublasSetVector(DIM*DIM, sizeof(a[0]), a, 1, d_A, 1);
CUstatus = cublasSetVector(DIM*DIM, sizeof(b[0]), b, 1, d_B, 1);
CUstatus = cublasSetVector(DIM*DIM, sizeof(c[0]), c, 1, d_C, 1);

cublasSgemm('n','n', M, N, K, alpha, d_A, lda, d_B, ldb, beta, d_C, ldc);

CUstatus = cublasGetError();

//copy back
CUstatus = cublasGetVector(DIM*DIM, sizeof(c[0]), d_C, 1, c, 1);

//free memory on the card
CUstatus = cublasFree(d_A);
CUstatus = cublasFree(d_B);
CUstatus = cublasFree(d_C);

//shutdown cublas
cublasShutdown();

Quite simple, its all a C/FORTRAN library so nothing special other than an Nvidia card and the CUBlas library. If you want the full source email me at: brockp@mlds-networks.com

GPGPU’s

One of the hot things in HPC right now is GPGPU’s. The general idea is to use graphics cards with their very high memory bandwidth and massive parallel ALU to do general computation. Think math on graphics chips. This is a great idea because graphics companies have the scale of the consumer market to keep prices down and innovation up. Jezz I love capitalism. The performance of these cards is much much higher than a general purpose Intel or AMD cpu, and are even higher than could be had out of many purpose build accelerators like those from Clear Speed and available at a lower cost.

Now GPGPU is not without its problems. Most the problems though are slated to be resolved by Nvidia and ATI/AMD. These problems are:

  • Lack of DOUBLE support
  • Lack of simple programming interfaces
  • Lack of standard interface from both major vendors
  • Lack of large memory space
  • Lack of Scheduling

Most serious HPC application requires DOUBLE support. Current cards only support SINGLE but the Nvidia testla 10 cards now support double and ATI sees this also. So this will be solved soon. Yes all my MD (protein folding) folks will tell me we don’t need DOUBLE but they will get in fists fights over it with my FEA (Meshing) friends. I of course need to support both on the clusters. In any point this is solved.

Programming API’s are in the works. Nvidia has the wonderful CUDA which I will write about latter. The worst part of cuda is all the hard work it involves that the average PHD candidate will not understand. Currently they can’t program FORTRAN or C effecntly CUDA asks for so much more.

CUDA does have CUBlas which I can’t stress how much I love it. CUBlas allowed me in an hour convert the heavy work portion of a code I had to use the graphics card quickly and no special NVCC compiler. I was quite happy. This is what I think should be done. I have always said most people should just make their code into and LU problem and then use BLAS to Factor it. Well all of BLAS and LAPACK (PLASMA) should be implimented in CUDA and be a library that fortran and C programmers link against. Easy right? Well simpler than writing CUDA still harder than raw C or FORTRAN but the benefits are huge. I really hope to see Nvidia and ATI implement PLASMA.

I will cover the other points when I write about using CUBlas. I will focus now on why I think PLASMA should be used by Nvidea over regular BLAS.

PLASMA (Parallel Linear Algebra for Multi-core) was made mostly for cpus like the CELL BE from IBM. The first rendition of the CELL while it could do DOUBLE performed much better in SINGLE. PLASMA wanted to ease this pain by taking LAPACK and the parts of the operation that don’t require DOUBLE just be ran in single. Extracting the full performance of the CELL cpu. Now that the Cell BE has full DOUBLE support it acts more like regular cpus in that the cpu is only half as fast at DOUBLE vs SINGLE.

This is still twice as fast though! Why don’t we do this in all math libraries? As long as changing between types (DOUBLE-> SINGLE, SINGLE->DOUBLE) is cheap we should do this. I think this matter more and more on larger systems because going from 1Gflop to 2 Gflop might not matter as much but going from 1Tflop to 2 Tflop is a big deal. Because GPU’s are so much faster and the market that pushes their development (the consumer graphics market) does not require DOUBLE the more that the HPC world can leverage SINGLE the better and the speed bonus to boot.

Please post any comments and questions, or email me at brockp@mlds-networks.com

64 bit Scientific Computing - Things to look out for

Below is an email I sent to a user who had a question about what could limit the amount of memory in use by his application. Turns out its not as easy as use 64bit.

Things to watch out for:

  • Admin Placed Limit on the Stack for Fortran77
  • Limits on size of arrays on x86_86
  • Assumed size of a pointer

The email follows, I will also attach example code to demonstrate.

Yes,

First fortran77 (g77) does not do dynamic allocation, so all variable

are allocated on the stack not the heap (look online for what these are if you care).

First on nyx by default the stack can’t
be larger than 10 MB. This is a limit we impose to keep stack frames
from going crazy. You can see what the stack limit is by running:

ulimit -s

If you run

ulimit -s unlimited

It removes that stack limit. This is needed if you get ’segfault’
messages some times.

The next limit you will hit will be because of your cpu architecture.
Your regular windows machine is 32 bit. 32bit machines can not
access more than 4gb of memory, and this is split between the OS
(windows linux) and the application leaving between 3 and 2 gb
available for the application.

Then there are 64 bit systems (Nyx is 64 bit). 64 bit cpus can
address up to 16.8 Million Terra Bytes. (yes more than a petaByte of
memory). How this is split between the OS and the application is
Moot right now. Should be aware that the first gen AMD64 cpus (like
on nyx) had an artificial limit of 1TerraByte of memory. Though the
largest memory system we have has only 64GB or ~0.064 TB

Now among 64 bit cpus there is what i call ‘native’ 64 bit and
‘extended’ 64 bit. True 64 bit like IA64 and Power have no limits on
anything (other than maybe the admin imposed stack limit above).

The more common ‘extended’ are any of the x86_64, em64t, amd64.
Works the same as the native, only if you add in a compile/link option:

-mcmodel=medium

The default is

-mcmodel=small

.

Under the small memory model some arrays static arrays can only be
2gb in size. Under medium this is not a problem but there is a
performance hit for accessing memory. If you dynamic memory
allocation like in fortran 90 or c/c++ there is no need for the
medium memory model.

The error:

relocation truncated to fit: R_X86_64_PC32 .bss
 

Means you need to use

-mcmodel=medium

So the moral of the story is don’t use fortran 77 is you can avoid it
and use dynamic allocation. It will free up most of your pain.
64bit right now provides just a massive amount of memory addressable
that the next memory limit wont come for another decade. Note the
amount of ram 64 bit systems support is more than the amount of hard
disk in the world.

You of-course can’t use ram you don’t have installed. And you can’t
assume the size of pointers. Pointers on nyx are 64 bit

(INTEGER*8)

on 32bit systems 32 bit

(INTEGER*4)

I hope your not using pointers in fortran 77 though.

 

C/C++

If your stack limit (ulmit -s) is less than 4GB this will fail with a segfault:

#define N 715827882/2/2
int main(){
double bigarray[N];
bigarray[1000]=5;
return 0;
}

This will work though with dynamic allocation even with the stack size limit in place:

#include <stdlib.h>
#define N 715827882/2/2
int main(){
double *bigarray=(double *)malloc(N*sizeof(double));
bigarray[1000]=5;
return 0;
}

cblas with ACML c++ support etc.

One currently large problem with the BLAS is that these libraries are all written for Fortran 77.  While many commercial BLAS/LAPACK implementations seek to correct this.  (NAG, IMSL, MKL, etc).   It is not portable across all systems.  The only way to use them for sure is to write the code only in Fortran.  it is quite annoying as more and more users are writing their code in C and C++.

Now you can call Fortran from C.  Its not that hard, fortran passes everyhting by reference:

DGEMV(TRANS, M, N, ALPHA, A, LDA, X, INCX, BETA, Y, INCY, TRANS)

Becomes

dgemv_(char *trans, int *m, int *n, double *alpha, double *a, int *lda, double *x, int *incx, double *beta, double *y, int *incy, int trans_len);

It’s is not the cleanest thing in the world and everything much be by reference.

Well there was an update to BLAS known as the CBLAS. The paper on the CBLAS is defined  as cblas_dgemv();  The cblas shall also have options so you can tell the BLAS if your matrix is Row Major or Colum major (if your writting C or Fortran you better know what this is!  If not email me: brockp@mlds-networks.com).  Very helpful because C puts matrixes in Row Major while Fortran is collum major so calling fortran from C can make memory location quite the pain.

Now yes some (most) BLAS fuctions that work with matrixes have a transpose option which can correct from this true.  I have not done tests yet if it hurts performace or not due to cache locality   My guess is it will depend on your BLAS library.

Now so we have a nice deffinition of how the BLAS should be called from C.  While MKL and ATLAS have implimented some if not all of it, ACML and other have not or have done their own for of BLAS from C.  This is a major pain.  To have code work on multiple platforms with differnt cpu vendors and thus differnt BLAS libraries. Anything other than fortran could be a pain.

I recently figured out a way to make the CBLAS work with ACML which does not impliement the ‘correct’ CBLAS.

Steps:

  1. Download CBLAS
  2. Modify Makefile.in to use ACML (Mine attached)
  3. Build libcblas.a following the CBLAS instructions

When you are all done, the cblas library will take care of moving all calls even Row Major to the ACML fortran BLAS.  This method should work for any BLAS that does not have CBLAS hooks. The one that comes to mind is GOTO BLAS one of the best BLAS’s out there.

In my case the call

cblas_dgemm(CblasRowMajor,CblasNoTrans, CblasNoTrans, M, N, K, alpha, a, lda, b, ldb, beta, c, ldc);
pgcc -fastsse -mp driver.c -I /opt/cblas/include/ -L /opt/cblas/lib -L /opt/acml/pgi64_mp/lib -lcblas -lacml_mp -lpgftnrtl

Will work, and link with the threaded BLAS so that I can use OpenMP for parallel.  My tests show this to be atleast as fast as the native fortran77 so I am happy.

 Though It would still be better if all BLAS lib makers would settle on the CBLAs paper above from Netlib.

CBLAS from C++

The cblas.h does not work when called from C++ reason for this is C++ name mangling. Its an easy fix and hope the BLAS working group adds this fix.

 Open up cblas.h and add:

#ifdef __cplusplus
extern "C" {
#endif /* __cpluplus */

Right after the last include before anything else is defined.

Then at the very end of the header add:

#ifdef __cplusplus
}
#endif /* __cplusplus*/

Right before the last #endif

plants