Posting Audio on YouTube with background Image

I was going to post my podcast on youTube, maybe someone random would run across it why not.

This process had some needs, I needed to make a video, out of audio. To make it less boring I took my RCE logo as a static background image. To do this I went to my usual tools mencoder and ffmpeg. Turns out mencoder did what I needed though not easy.

Mencoder has an easy way to make a video out of a sequence of images, if you use just one image, you get a 1 second video with 1 frame (fps set to 1). To get around this, you could just set the frames per second to 1/runtime of your mp3. But while it would encode it, things like QuickTime crash (includes finder which I am proud of breaking).

The solution (given to me by roo on the IRC channel) was to make that 1 second video, copy it in 5 times to make a 5 second video, use that 5 times to make a 25 second etc. Retarded I know, but it worked. I made a small (dumb) script that automates this. It uses madtime to find the time needed, some math, mktemp and your on your way.

Download the script form the usual location buildyoutube.sh. And see the results on my ….., well I would after I went though all this work, went to upload to youtube, remembered, they don’t allow public videos more than 10 minutes long, damn.

Anyone know how to get more than 10 minutes allowed on youTube?

Podcast Hits Goal

Last show The Torque Resource Manager hit over 100 downloads from iTunes in the first week. Total download count for it stands at 140 as of May 7th.

Have to thank Jeff for all the help, and being the better voice on the air. The next show should be good, we had the most technical problems we have ever had, but the content is good. Never knew HDF5 was used in the Lords of the Rings movie.

Shows are very heavy on the technical, but if you enjoy that kind of stuff check it out: www.rce-cast.com

Brock Palen

New Computing Podcast! RCE-Cast.com

My new podcast is going well. Turn out is much better than I expected. I also only expect it to get better as more mailing lists, word spreads etc. The website is www.rce-cast.com and you can subscribe via iTunes.

I am warning you this podcast is very technical. It is meant for researchers and HPC administrators. Don’t be surprised if words are said you don’t understand. If you listen you will hear me be corrected by my guests a lot. It is because I don’t know what I am talking about ether! That would be why I am doing this podcast.

All for now: Brock Palen

HPC Podcast — Need Creative Name

I am going to register the domain for the HPC pod-cast soon, everything is coming together. I need a name that is easy to remember. An acronym would be nice, it is simpler for people to say “Are you listening to this weeks ‘blah’”.

Please put in the comments, or email/message me your ideas. To give you an example of how bad I am at this, the only name I can come up with is “HPC Week”. Thanks

Multi-Core Is bad for Science

That is right Multi-Core CPU’s while great for use desktops and laptops to support many background processes running is bad for scientific simulations. I will present data for this using Namd 2.6.

In the old days CPU builders focused on increasing single core speed. Most systems had only 1 core and only high end server had two cores, which lived in two sockets. Many research applications (Like Namd, or the better known Gromacs which is used in Folding@Home. Already were built to run on multiple CPU’s using MPI. MPI allowed researchers to tie together the power of many smaller systems to reach supper computer performance better than that available on purpose built supercomputers. Now even such purpose built machines are using commodity processors and MPI to reach new performance numbers.

To MPI a multi core CPU looks like N CPU’s where N is the number of cores. Thus now with modern quad core CPU’s users can run ‘mpirun -np 8 namd2′ (run namd on 8 cpus/cores). So what is bad about having many cores? CPU builders are rapidly increasing the total performance available in the same socket 1U (1.75 Inch) systems by adding cores. While it is great that a 8 total core box has a lot of performance, in the past that extra performance came from singe core improvements. Thus serial codes (those who can not use multiple cores) benefited and MPI codes benefited also.

With multi core, CPU builders have been lowering the performance of individual cores. That is serial applications, or applications with serial portions will now run slower. Look at the plot of namd running on a cluster of AMD cpus.

The CPU types are dual core Opteron 2218’s at 2.6 GHz and quad core Opteron 2356 “Barcelona’s” at 2.4 GHz. The Barcelona is AMD’s current (as of 12/2008) CPU and from the data it is shown that on 1 core the 2218 is faster. So if NAMD was a serial code the older 2218 would be a better choice. Now in the case of NAMD which scales fine to 8 cores we see that having quad core (if that is the only way to get more performance in a box) is ok. Remember that 4 cores total of 2218’s costs the same about as 8 cores of 2356’s because of the dual vs quad issue. This does not include the cost of power, rack, network etc. Thus for the same cost the 2356 is better because on 8 core namd reaches .45 Days/NanoSecond. While the 4 cores of the 2218 reaches only .70 Days/NanoSecond. Thus in a cost/performance at these small numbers the 2356 is a great deal for small labs running parallel codes up to a few tens of cores.

That is fine for many labs, and they will benefit greatly. The problem, and why Multi-Core is bad for science. Is at the margin. Some place some researcher is trying to run Namd not at 32 cpus as in the plot, but at 2048 and 4096. Namd will have a hard time reaching this limit. Many codes will have no speed improvements from 32 cores and up. Scaling many times has to do with network performance and memory bandwidth. Some applications can not be made parallel! Thus the the above user reached a given performance on 2218’s at 2048 cores, he will need many more 2356 (the newer and better CPU) cores to reach the performance he had before. That may not be possible.

As CPU builders add more and more cores and individual core speed drops, many researchers will find them selves needing to find new ways to make their applications which worked great at smaller number to scale further to just maintain performance on new hardware.

There are limitations to scaling. For example the simulation above was ~29,000 atoms. Atoms in the simulation are spread across cores, thus my upper limit is 29,000 cores. Not good, we need individual core speed to increase and be easily accessible to the programmer. More on that latter.

HPC Podcast and Happy Thanksgiving

First I must say Happy Thanksgiving to everyone. This is one of my favorite holidays. Why? I love food and good traditional home cooked food, and Thanksgiving is no better way to get it. It means even more now that I have been living on my own for the last 6 years.

Podcast

I have been kicking around the idea of trying to start a pod cast on High Performance Computing (HPC). This is a topic I am intimately familiar with, but I have a lot to learn. My goals would be for sysadmins like my self to learn from other sites how they run their systems, funding models, and in use management software. This would be balanced with user applications. These talks would be for the graduate student, research scientist, and faculty, who would like to hear about tools for developing HPC and Terascale applications as well as tools for those who wish to do some form of science.

In short I am looking for a co-host and software projects to talk to. For a co-host I am looking for someone who is familiar with HPC but less from the admin perspective (Like my self) and more for the science factor. Remember these applicaions will from from large FEA and CFD applications to the biological sciences. Curiosity, a good speaking voice, and a large science background would be helpful, to ask good questions so that those listening who are in those fields will get answers they are looking for.

The show would be modeled after FLOSS Weekly. My own company will have a site up in the future. Shows will be between 30 minutes and 1 hour long, and will start at once every two weeks. I don’t know when I will get this started, mostly hinges on finding a co-host.

And trust me I speak much better than I write. Co-hosts do not need to be close to Ann Arbor, all interviews will be done over Skype. People interested in being co-hosts should email me at: brockp@mlds-networks.com and people who have projects to nominate for being features can ether comment or email. Feel free to mention your own project.

SC 08′, Austin Texas

Right now I am in my hotel for the last night before I go home from Austin Texas from this years Super Computing

This year was not that bad. I was able to touch base with PGI about there 8.0 release with support for GPGPUS from Fortran and CCFF (Common Compiler Feedback Format). The GPGPU support looks simple, but it always does. I am sure it is not as easy as it looks, but it is going to be much simpler to work with GPU hardware from Fortran, and if the regular -Minfo -Mneginfo flags work could be very useful. The CCFF stuff looks really good. Lots of data about what the compiler did and what the code did when it ran. Turns out this information in held in XML and they have published this (they claim, I can’t find it) thus third party tools should support it. I for one harassed Allinea about adding support for this profile information into OPT and maybe DDT.

DFT’s on GPU’s. Two papers and one application. Both papers reached the same performance, in both cases much better than the CU-FFT support shipped with CUDA. I hope Nvidia (and ATI) look at these papers and add their support. In many cases, optimizing for bandwidth on the card provided the best performance. This says to me that GPU’s are already in the case of CPU’s which is memory is the bottle neck and there is only marginal improvements that need to be made in the compute engine.

In one of the papers the author looked at total cost for a application they were running. This is great because in my own testing I found copying data to and from real memory to device memory was slow and just awful. Thus moving small parts of your app to GPU’s does not help in most cases. Ether your entire app needs to live on the card (well compute should, pre and post processing not so much) or don’t even bother. In his case GPU’s were good only if using his wonderful optimized FFT. If using the stock CU-FFT, forget it, no point. Copy to and from the card crushed performance. Still promising. I am happy though, that someone pointed out the real downfall of GPU hardware.

The dev’s of NAMD have added CUDA support for GPU’s to their code. This is beta and not GA yet, expect to see it soon. He pointed out something that I didn’t think of. That was, the best performance to and from RAM and device memory is using pinned memory. This is memory that can not be paged out. This allows RDMA style applications to make sure that where the data is placed in memory is where it should be and that the address has not moved (yes the OS moves pages). Well RDMA fabrics like IB also want pinned memory. Thus MPI stacks want to use pinned memory also, but also de-pinns memory that is nolonger needed. What happens when the card and MPI both work with these buffers? It gets un-pinned, data goes into memory in wrong place.

Now you can code around this, but regular research grad students should not have to worry about this. I talked to Jeff Squires of OpenMPI and he was aware of this and Nvidia came to him already that day before I did and have asked about this.

Here is the last bit of insight from the NAMD dev. Why can we not do RDMA between GPU memory in systems with multiple cards? Why can we not do RDMA from GPU memory across and RDMA fabric (think IB, iWARP) to another GPU or RAM in a remote node? This is a great idea! Talking to Jeff should be doable and should help out performance very much on these types of MPI+GPU applications we _will_ see in the future.

Parmetis on OSX-10.3.x

Building ParMetis

This will work on 10.4.x and 10.5.x for PPC and Intel I expect. Though I have not tried it. This was done using OpenMPI

tar -xzvf ParMetis-3.1.tar.gz
cd ParMetis-3.1

Edit Makefile.in make sure CC and LD are set to mpicc. Set set INCDIR and LIBDIR to nothing.

make

This make will fail with ‘cant find malloc.h’, malloc.h is in /usr/include/sys/. Re-open Makefile.in and set INCDIR = -I/usr/include/sys/

make

For those with impatience

tar -xzvf ParMetis-3.1.tar.gz
cd ParMetis-3.1
CC = mpicc
LD = mpicc
INCDIR =
LIBDIR =
make
INCDIR = -I /usr/include/sys
make

dcopy() vs memcopy() vs C code?? ACML slacks?

It is not uncommon to have to copy around arrays and data in HPC applications. There is three ways you can do this:

  • string.h memcpy()
  • BLAS1’s DCOPY()
  • C code and let the compiler optimize it

Of these three I expected DCOPY() to be the fastest, then memcpy() and last the C code. Oh was I wrong. I used the STREAM benchmark on an AMD Opteron 2218 using PGI 7.2 compilers:

  • memcpy() 3056.8 MB/s
  • DCOPY() 5727.4 MB/s
  • C code 5737.7 MB/s

So memcpy() is much slower than I thought, It about the same speed as if I optimize the crap out of the C code using the GNU C compiler, but we are not using the GNU compiler. DCOPY() reaches that speed using the PGI or GNU compiler which is good to know for portable so I still recommend using DCOPY() over memcpy().

The result from the PGI compiler resulted in the great performance. Turns out the compiler is to smart for us:

pgcc stream.c -fastsse -Minline -Minfo
main:
   164, Generated vector sse code for inner loop
   181, Generated vector sse code for inner loop
        Generated 1 prefetch instructions for this loop
   208, Memory copy idiom, loop replaced by call to __c_mcopy8
   218, Generated vector sse code for inner loop

Notice how on line 208 the memory copy was replaced by a call to __c_mcopy8. Looks like PGI is smart and has their own high speed calls to do similar operations built into their compiler. Nice work guys.

Where ACML starts to slack is in use of the AMD multiple memory controllers. Their DCOPY() is not threaded and thus the OpenMP version does not any faster than the single threaded version. While the compiler is even faster on more memory controllers!

  • DCOPY() 2 threads: 4525.8 MB/s
  • C code 2 threads: 10934.6 MB/s

As you can see ACML has room for improvment making use of the multiple memory contolers available in the Opteron platform. Good compilers can do operations as simple as copying arrays of doubles faster using

 #pragma omp for 

. Again test test test.

BLAS Performance AMD Opteron

I have been putting together a course on using the BLAS on our CAC machines. General theory has been that blas 2 is faster than 1 and blas 3 is faster than 2. In my measurements though this was not the case. I was very surprised by this.

As can be seen DDOT() is faster in Mflop/s than DGEMV() not sure why this is. The case where I am comparing ACML to C code the compiler does make faster code for DGEMV() as expected and is still much slower than the slowest ACML/BLAS call. In any case I am still confused why this happened.

In any case users doing any type of math should be aware of the Pentium Pro line. This is the peak performance of the 1995 CPU and how the C routine reaches only that performance. As an example of what many HPC admins already know, use the BLAS or your performance will be awful.

memcpy vs xCOPY()

C has a function memcpy that will copy from one memory location to another. As a way of quickly copying arrays from one memory location to another is memcpy on par with the hardware specific BLAS1 call xCOPY()? When it comes to Fortran I do not know of a memcpy so xCOPY() may be their only option. Last xCOPY() better not be any worse than memcpy because they do the same thing.
If anyone has data on this comparison let me know.

On a side note if anyone wants the code and latex to go with it use:

git clone http://www.umich.edu/~brockp/git-repo/cac-docs.git
git checkout --track -b fastmath origin/fastmath

plants