GPGPU’s
One of the hot things in HPC right now is GPGPU’s. The general idea is to use graphics cards with their very high memory bandwidth and massive parallel ALU to do general computation. Think math on graphics chips. This is a great idea because graphics companies have the scale of the consumer market to keep prices down and innovation up. Jezz I love capitalism. The performance of these cards is much much higher than a general purpose Intel or AMD cpu, and are even higher than could be had out of many purpose build accelerators like those from Clear Speed and available at a lower cost.
Now GPGPU is not without its problems. Most the problems though are slated to be resolved by Nvidia and ATI/AMD. These problems are:
- Lack of DOUBLE support
- Lack of simple programming interfaces
- Lack of standard interface from both major vendors
- Lack of large memory space
- Lack of Scheduling
Most serious HPC application requires DOUBLE support. Current cards only support SINGLE but the Nvidia testla 10 cards now support double and ATI sees this also. So this will be solved soon. Yes all my MD (protein folding) folks will tell me we don’t need DOUBLE but they will get in fists fights over it with my FEA (Meshing) friends. I of course need to support both on the clusters. In any point this is solved.
Programming API’s are in the works. Nvidia has the wonderful CUDA which I will write about latter. The worst part of cuda is all the hard work it involves that the average PHD candidate will not understand. Currently they can’t program FORTRAN or C effecntly CUDA asks for so much more.
CUDA does have CUBlas which I can’t stress how much I love it. CUBlas allowed me in an hour convert the heavy work portion of a code I had to use the graphics card quickly and no special NVCC compiler. I was quite happy. This is what I think should be done. I have always said most people should just make their code into and LU problem and then use BLAS to Factor it. Well all of BLAS and LAPACK (PLASMA) should be implimented in CUDA and be a library that fortran and C programmers link against. Easy right? Well simpler than writing CUDA still harder than raw C or FORTRAN but the benefits are huge. I really hope to see Nvidia and ATI implement PLASMA.
I will cover the other points when I write about using CUBlas. I will focus now on why I think PLASMA should be used by Nvidea over regular BLAS.
PLASMA (Parallel Linear Algebra for Multi-core) was made mostly for cpus like the CELL BE from IBM. The first rendition of the CELL while it could do DOUBLE performed much better in SINGLE. PLASMA wanted to ease this pain by taking LAPACK and the parts of the operation that don’t require DOUBLE just be ran in single. Extracting the full performance of the CELL cpu. Now that the Cell BE has full DOUBLE support it acts more like regular cpus in that the cpu is only half as fast at DOUBLE vs SINGLE.
This is still twice as fast though! Why don’t we do this in all math libraries? As long as changing between types (DOUBLE-> SINGLE, SINGLE->DOUBLE) is cheap we should do this. I think this matter more and more on larger systems because going from 1Gflop to 2 Gflop might not matter as much but going from 1Tflop to 2 Tflop is a big deal. Because GPU’s are so much faster and the market that pushes their development (the consumer graphics market) does not require DOUBLE the more that the HPC world can leverage SINGLE the better and the speed bonus to boot.
Please post any comments and questions, or email me at brockp@mlds-networks.com

