DGEMM() The Function every HPC user should know
Its true DGEMM() should be used by everyone! What should be noted though is if your problem does not require doubles then use SGEMM()!!!
This post is a reply from an email I received from a student. On my post about CUBlas running on Nvidia graphics cards.
The Quesion
How can I run the Dgemm on GPUs, how to support the dgemm() on GPUs which only support the single precision. Thanks, if you have done that, could you please share the code with me? Thanks
The answer is you can’t. The current crop of cards including those branded for HPC only use like the Tesla 8 cards can only do single. Reason for this is simple. Graphics did not require it and the first generation HPC targeted cards were really re-branded workstation cards. There is also an added real cost, DOUBLE requires twice the number of transistors. Many modern CPU’s when running a code in SINGLE will run it twice as fast because it takes a DOUBLE register breaks it in half so it can work on twice as many numbers at a time. IE real cost savings in performance and silicon.
Now there was some talk in PLASMA (Parallel Linear Algebra for Multicore) about swapping in a and out of SINGLE and DOUBLE. Look on page 5 for an example, in this case it was used on the IBM Cell BE. The Cell reaches over 100 GFlop in single but instead of dropping performance by half for double it dropped to 14 GFlop! Thus the reasoning behind PLASMA’s swapping in and out of SINGLE. NOTE: This was resolved in the Power XCell 8i CPU used on Road Runner. I still think this is a great idea.
So the full answer to run DGEMM() (Double Generic Matrix Multiply) on current graphics cards is you can’t. But don’t give up hope, keep working with CUBLas only use DOUBLE if you need to. Even then keep CUBlas and similar projects in mind the Nvidia Tesla 10 series introduced DOUBLE support which should make all this pain go away.
Short term solutions to high performance DGEMM() is Goto Blas. There is a threaded version which can take advantage of things like Multicore and SMP. I have gotten my best HPL numbers using Goto and it is a great tool. If your not using a common platform Goto was built for, the vendor provided libraries (ESSL ACML MKL etc.) work great and many are threaded. Core 2 and Barcelona with modern BLAS libs are twice as fast per clock than previous generations.
sorry there is no good solution, but I hope I gave you enough tools to get buy till Tesla 10 hardware is out. If you have questions email me, also for getting my hands dirty I am available for consulting.
Comments welcome.
Brock E. Palen

