SC 08′, Austin Texas
Right now I am in my hotel for the last night before I go home from Austin Texas from this years Super Computing
This year was not that bad. I was able to touch base with PGI about there 8.0 release with support for GPGPUS from Fortran and CCFF (Common Compiler Feedback Format). The GPGPU support looks simple, but it always does. I am sure it is not as easy as it looks, but it is going to be much simpler to work with GPU hardware from Fortran, and if the regular -Minfo -Mneginfo flags work could be very useful. The CCFF stuff looks really good. Lots of data about what the compiler did and what the code did when it ran. Turns out this information in held in XML and they have published this (they claim, I can’t find it) thus third party tools should support it. I for one harassed Allinea about adding support for this profile information into OPT and maybe DDT.
DFT’s on GPU’s. Two papers and one application. Both papers reached the same performance, in both cases much better than the CU-FFT support shipped with CUDA. I hope Nvidia (and ATI) look at these papers and add their support. In many cases, optimizing for bandwidth on the card provided the best performance. This says to me that GPU’s are already in the case of CPU’s which is memory is the bottle neck and there is only marginal improvements that need to be made in the compute engine.
In one of the papers the author looked at total cost for a application they were running. This is great because in my own testing I found copying data to and from real memory to device memory was slow and just awful. Thus moving small parts of your app to GPU’s does not help in most cases. Ether your entire app needs to live on the card (well compute should, pre and post processing not so much) or don’t even bother. In his case GPU’s were good only if using his wonderful optimized FFT. If using the stock CU-FFT, forget it, no point. Copy to and from the card crushed performance. Still promising. I am happy though, that someone pointed out the real downfall of GPU hardware.
The dev’s of NAMD have added CUDA support for GPU’s to their code. This is beta and not GA yet, expect to see it soon. He pointed out something that I didn’t think of. That was, the best performance to and from RAM and device memory is using pinned memory. This is memory that can not be paged out. This allows RDMA style applications to make sure that where the data is placed in memory is where it should be and that the address has not moved (yes the OS moves pages). Well RDMA fabrics like IB also want pinned memory. Thus MPI stacks want to use pinned memory also, but also de-pinns memory that is nolonger needed. What happens when the card and MPI both work with these buffers? It gets un-pinned, data goes into memory in wrong place.
Now you can code around this, but regular research grad students should not have to worry about this. I talked to Jeff Squires of OpenMPI and he was aware of this and Nvidia came to him already that day before I did and have asked about this.
Here is the last bit of insight from the NAMD dev. Why can we not do RDMA between GPU memory in systems with multiple cards? Why can we not do RDMA from GPU memory across and RDMA fabric (think IB, iWARP) to another GPU or RAM in a remote node? This is a great idea! Talking to Jeff should be doable and should help out performance very much on these types of MPI+GPU applications we _will_ see in the future.

