CUDA Blas
I am a huge fan of BLAS and I wish more people used it. I have not figured any numbers out but I think U of M (My day job) might waste enough money per year to fund a position to teach faculty and researchers to use BLAS, CUDA and NAG/IMSL. For example it is easy to show how on the same hardware using DGEMM() vs some DO loops can go 10′x faster while consing the same capital resources (computer, facility space) and consumables (Power, Cooling). This problem will only get worse as more and more research computing happens at the University.
So once you have users using BLAS and friends what is the next big leap in performance one can extract from hardware? Nvidia has a great option called CUBlas. It is part of their CUDA kit for doing general computing work on Nvidia graphics cards.
I was able to port a simple matrix multiply code to CUBlas in an hour, so most codes that depend on this should find they can use CUBlas quickly and easly.
Calculations on graphics cards have to be done from the video buffer memory on the card. In my case this was 256MB so I could not do very large problems. This should not be much of an issue as it is easy to copy data from the host memory (ram) to the card memory (video buffer). Also for larger problem I think porting some of the out-of-core ways of running problems like Pardiso and DGEMM() could be implemented on GPU’s using the system RAM where we used disk in the past and treading the card memory as the ram of old.
Ok technical details, basic form was allocate memory on the card, copy from host memory to card memory, call the function and copy results back:
//pointers to memory on the card
float* d_A = 0;
float* d_B = 0;
float* d_C = 0;
CUstatus = cublasInit();
CUstatus = cublasAlloc(DIM*DIM, sizeof(d_A[0]), (void**)&d_A);
CUstatus = cublasAlloc(DIM*DIM, sizeof(d_B[0]), (void**)&d_B);
CUstatus = cublasAlloc(DIM*DIM, sizeof(d_C[0]), (void**)&d_C);
CUstatus = cublasSetVector(DIM*DIM, sizeof(a[0]), a, 1, d_A, 1);
CUstatus = cublasSetVector(DIM*DIM, sizeof(b[0]), b, 1, d_B, 1);
CUstatus = cublasSetVector(DIM*DIM, sizeof(c[0]), c, 1, d_C, 1);
cublasSgemm('n','n', M, N, K, alpha, d_A, lda, d_B, ldb, beta, d_C, ldc);
CUstatus = cublasGetError();
//copy back
CUstatus = cublasGetVector(DIM*DIM, sizeof(c[0]), d_C, 1, c, 1);
//free memory on the card
CUstatus = cublasFree(d_A);
CUstatus = cublasFree(d_B);
CUstatus = cublasFree(d_C);
//shutdown cublas
cublasShutdown();
Quite simple, its all a C/FORTRAN library so nothing special other than an Nvidia card and the CUBlas library. If you want the full source email me at: brockp@mlds-networks.com

