BLAS Performance AMD Opteron
I have been putting together a course on using the BLAS on our CAC machines. General theory has been that blas 2 is faster than 1 and blas 3 is faster than 2. In my measurements though this was not the case. I was very surprised by this.
As can be seen DDOT() is faster in Mflop/s than DGEMV() not sure why this is. The case where I am comparing ACML to C code the compiler does make faster code for DGEMV() as expected and is still much slower than the slowest ACML/BLAS call. In any case I am still confused why this happened.
In any case users doing any type of math should be aware of the Pentium Pro line. This is the peak performance of the 1995 CPU and how the C routine reaches only that performance. As an example of what many HPC admins already know, use the BLAS or your performance will be awful.
memcpy vs xCOPY()
C has a function memcpy that will copy from one memory location to another. As a way of quickly copying arrays from one memory location to another is memcpy on par with the hardware specific BLAS1 call xCOPY()? When it comes to Fortran I do not know of a memcpy so xCOPY() may be their only option. Last xCOPY() better not be any worse than memcpy because they do the same thing.
If anyone has data on this comparison let me know.
On a side note if anyone wants the code and latex to go with it use:
git clone http://www.umich.edu/~brockp/git-repo/cac-docs.git git checkout --track -b fastmath origin/fastmath

