SC 08′, Austin Texas

Right now I am in my hotel for the last night before I go home from Austin Texas from this years Super Computing

This year was not that bad. I was able to touch base with PGI about there 8.0 release with support for GPGPUS from Fortran and CCFF (Common Compiler Feedback Format). The GPGPU support looks simple, but it always does. I am sure it is not as easy as it looks, but it is going to be much simpler to work with GPU hardware from Fortran, and if the regular -Minfo -Mneginfo flags work could be very useful. The CCFF stuff looks really good. Lots of data about what the compiler did and what the code did when it ran. Turns out this information in held in XML and they have published this (they claim, I can’t find it) thus third party tools should support it. I for one harassed Allinea about adding support for this profile information into OPT and maybe DDT.

DFT’s on GPU’s. Two papers and one application. Both papers reached the same performance, in both cases much better than the CU-FFT support shipped with CUDA. I hope Nvidia (and ATI) look at these papers and add their support. In many cases, optimizing for bandwidth on the card provided the best performance. This says to me that GPU’s are already in the case of CPU’s which is memory is the bottle neck and there is only marginal improvements that need to be made in the compute engine.

In one of the papers the author looked at total cost for a application they were running. This is great because in my own testing I found copying data to and from real memory to device memory was slow and just awful. Thus moving small parts of your app to GPU’s does not help in most cases. Ether your entire app needs to live on the card (well compute should, pre and post processing not so much) or don’t even bother. In his case GPU’s were good only if using his wonderful optimized FFT. If using the stock CU-FFT, forget it, no point. Copy to and from the card crushed performance. Still promising. I am happy though, that someone pointed out the real downfall of GPU hardware.

The dev’s of NAMD have added CUDA support for GPU’s to their code. This is beta and not GA yet, expect to see it soon. He pointed out something that I didn’t think of. That was, the best performance to and from RAM and device memory is using pinned memory. This is memory that can not be paged out. This allows RDMA style applications to make sure that where the data is placed in memory is where it should be and that the address has not moved (yes the OS moves pages). Well RDMA fabrics like IB also want pinned memory. Thus MPI stacks want to use pinned memory also, but also de-pinns memory that is nolonger needed. What happens when the card and MPI both work with these buffers? It gets un-pinned, data goes into memory in wrong place.

Now you can code around this, but regular research grad students should not have to worry about this. I talked to Jeff Squires of OpenMPI and he was aware of this and Nvidia came to him already that day before I did and have asked about this.

Here is the last bit of insight from the NAMD dev. Why can we not do RDMA between GPU memory in systems with multiple cards? Why can we not do RDMA from GPU memory across and RDMA fabric (think IB, iWARP) to another GPU or RAM in a remote node? This is a great idea! Talking to Jeff should be doable and should help out performance very much on these types of MPI+GPU applications we _will_ see in the future.

Parmetis on OSX-10.3.x

Building ParMetis

This will work on 10.4.x and 10.5.x for PPC and Intel I expect. Though I have not tried it. This was done using OpenMPI

tar -xzvf ParMetis-3.1.tar.gz
cd ParMetis-3.1

Edit Makefile.in make sure CC and LD are set to mpicc. Set set INCDIR and LIBDIR to nothing.

make

This make will fail with ‘cant find malloc.h’, malloc.h is in /usr/include/sys/. Re-open Makefile.in and set INCDIR = -I/usr/include/sys/

make

For those with impatience

tar -xzvf ParMetis-3.1.tar.gz
cd ParMetis-3.1
CC = mpicc
LD = mpicc
INCDIR =
LIBDIR =
make
INCDIR = -I /usr/include/sys
make

Lustre 1.6.5.1 and Sun X4500’s

Just finished installing two Sun X4500 AKA Thumppers. Our systems have 48 1 Tbit drives (yes 1000Gig not 1024Gig) and 16 GB of ram. The intention was to run Lustre on them and move data from our old 2.7TByte lustre 1.6.4 setup.

Today we finished the install final usable disk space is 49TBytes and 8Gbit of bandwidth provided by two sets of 4 1 Gig links bonded with 802.3ab. We have so much less real disk after the drives are broken into 14 raid groups with spares and external journals. Oh and don’t forget the two boot drives. We decided to go with external journals to help with the random nature of our work load. Our old disk system was an NFS mount provided by some proprietary data movers. Suffered not from raw performance but from the load of 2000+ cpus asking for IO of different patters to many different files at the same time. There just wasn’t a way to provide a single name space with a simple NFS server.

We hope now with lustre writing meta data to 6 15,000RPM disks in raid 0+1 on a server with 16GB of ram for cache should help meta data performance. Random IO performance to data by default will be spread across 14 differnt arrays, 7 per X4500. This should keep keep heads moving in parallel for multiple requests. The best part is we can just add more X4500’s latter as IO and space needs come. Last both meta data and object data arrays have external journals keeping journal IO to the underlying ldiskfs filesystem independent. This should also help the many differnt requests being hit on the servers.

Performance

The only real performance test we did was bandwidth from a single host to a single array was limited to the 1Gbit/s speed of the host. There were no metadata tests, if you know of any please email or comment.

MPI-IO Performance

This was a big reason for putting lustre on our system. NFS and Romio don’t play well. Lustre was built with MPI-IO in mind and thus works out of the box. Writing to a single file using MPI_File_write() on 10 4 cpu Opteron 2218 nodes with 1Gbit/s ethernet reached 650 MB/s write speed. At this point the CPU’s on the X4500 were filled. I think higher speeds could be reached using TOE or 10Gbit cards (where TOE is implied). Even better speeds might be reached using Infiniband as its RDMA abilities may free CPU resources. Note Lustre does support Infiniband and can support both Infiniband and TCP networks at the same time.

The example code for MPI_File_write() was taken from: Beige.ucs.indiana.edu.

module swap pgi/7.2
module swap openmpi/1.2.6-pgi
mpicc mkrandfiles.c

lfs setstripe data14 0 -1 -1
mpirun -np 40 ./a.out -f data14 -l 300

longest_io_time       = 82.115636 seconds
total_number_of_bytes = 50331648000
transfer rate         = 584.541535 MB/s

Future hopes is that MPI-IO ability of this scale will allow new forms of research to be done on our clusters. I expect in the Winter term to teach a course on using HDF5 parallel IO abilities to researchers.

Questions: brockp@mlds-networks.com

64 bit Scientific Computing - Things to look out for

Below is an email I sent to a user who had a question about what could limit the amount of memory in use by his application. Turns out its not as easy as use 64bit.

Things to watch out for:

  • Admin Placed Limit on the Stack for Fortran77
  • Limits on size of arrays on x86_86
  • Assumed size of a pointer

The email follows, I will also attach example code to demonstrate.

Yes,

First fortran77 (g77) does not do dynamic allocation, so all variable

are allocated on the stack not the heap (look online for what these are if you care).

First on nyx by default the stack can’t
be larger than 10 MB. This is a limit we impose to keep stack frames
from going crazy. You can see what the stack limit is by running:

ulimit -s

If you run

ulimit -s unlimited

It removes that stack limit. This is needed if you get ’segfault’
messages some times.

The next limit you will hit will be because of your cpu architecture.
Your regular windows machine is 32 bit. 32bit machines can not
access more than 4gb of memory, and this is split between the OS
(windows linux) and the application leaving between 3 and 2 gb
available for the application.

Then there are 64 bit systems (Nyx is 64 bit). 64 bit cpus can
address up to 16.8 Million Terra Bytes. (yes more than a petaByte of
memory). How this is split between the OS and the application is
Moot right now. Should be aware that the first gen AMD64 cpus (like
on nyx) had an artificial limit of 1TerraByte of memory. Though the
largest memory system we have has only 64GB or ~0.064 TB

Now among 64 bit cpus there is what i call ‘native’ 64 bit and
‘extended’ 64 bit. True 64 bit like IA64 and Power have no limits on
anything (other than maybe the admin imposed stack limit above).

The more common ‘extended’ are any of the x86_64, em64t, amd64.
Works the same as the native, only if you add in a compile/link option:

-mcmodel=medium

The default is

-mcmodel=small

.

Under the small memory model some arrays static arrays can only be
2gb in size. Under medium this is not a problem but there is a
performance hit for accessing memory. If you dynamic memory
allocation like in fortran 90 or c/c++ there is no need for the
medium memory model.

The error:

relocation truncated to fit: R_X86_64_PC32 .bss
 

Means you need to use

-mcmodel=medium

So the moral of the story is don’t use fortran 77 is you can avoid it
and use dynamic allocation. It will free up most of your pain.
64bit right now provides just a massive amount of memory addressable
that the next memory limit wont come for another decade. Note the
amount of ram 64 bit systems support is more than the amount of hard
disk in the world.

You of-course can’t use ram you don’t have installed. And you can’t
assume the size of pointers. Pointers on nyx are 64 bit

(INTEGER*8)

on 32bit systems 32 bit

(INTEGER*4)

I hope your not using pointers in fortran 77 though.

 

C/C++

If your stack limit (ulmit -s) is less than 4GB this will fail with a segfault:

#define N 715827882/2/2
int main(){
double bigarray[N];
bigarray[1000]=5;
return 0;
}

This will work though with dynamic allocation even with the stack size limit in place:

#include <stdlib.h>
#define N 715827882/2/2
int main(){
double *bigarray=(double *)malloc(N*sizeof(double));
bigarray[1000]=5;
return 0;
}

cblas with ACML c++ support etc.

One currently large problem with the BLAS is that these libraries are all written for Fortran 77.  While many commercial BLAS/LAPACK implementations seek to correct this.  (NAG, IMSL, MKL, etc).   It is not portable across all systems.  The only way to use them for sure is to write the code only in Fortran.  it is quite annoying as more and more users are writing their code in C and C++.

Now you can call Fortran from C.  Its not that hard, fortran passes everyhting by reference:

DGEMV(TRANS, M, N, ALPHA, A, LDA, X, INCX, BETA, Y, INCY, TRANS)

Becomes

dgemv_(char *trans, int *m, int *n, double *alpha, double *a, int *lda, double *x, int *incx, double *beta, double *y, int *incy, int trans_len);

It’s is not the cleanest thing in the world and everything much be by reference.

Well there was an update to BLAS known as the CBLAS. The paper on the CBLAS is defined  as cblas_dgemv();  The cblas shall also have options so you can tell the BLAS if your matrix is Row Major or Colum major (if your writting C or Fortran you better know what this is!  If not email me: brockp@mlds-networks.com).  Very helpful because C puts matrixes in Row Major while Fortran is collum major so calling fortran from C can make memory location quite the pain.

Now yes some (most) BLAS fuctions that work with matrixes have a transpose option which can correct from this true.  I have not done tests yet if it hurts performace or not due to cache locality   My guess is it will depend on your BLAS library.

Now so we have a nice deffinition of how the BLAS should be called from C.  While MKL and ATLAS have implimented some if not all of it, ACML and other have not or have done their own for of BLAS from C.  This is a major pain.  To have code work on multiple platforms with differnt cpu vendors and thus differnt BLAS libraries. Anything other than fortran could be a pain.

I recently figured out a way to make the CBLAS work with ACML which does not impliement the ‘correct’ CBLAS.

Steps:

  1. Download CBLAS
  2. Modify Makefile.in to use ACML (Mine attached)
  3. Build libcblas.a following the CBLAS instructions

When you are all done, the cblas library will take care of moving all calls even Row Major to the ACML fortran BLAS.  This method should work for any BLAS that does not have CBLAS hooks. The one that comes to mind is GOTO BLAS one of the best BLAS’s out there.

In my case the call

cblas_dgemm(CblasRowMajor,CblasNoTrans, CblasNoTrans, M, N, K, alpha, a, lda, b, ldb, beta, c, ldc);
pgcc -fastsse -mp driver.c -I /opt/cblas/include/ -L /opt/cblas/lib -L /opt/acml/pgi64_mp/lib -lcblas -lacml_mp -lpgftnrtl

Will work, and link with the threaded BLAS so that I can use OpenMP for parallel.  My tests show this to be atleast as fast as the native fortran77 so I am happy.

 Though It would still be better if all BLAS lib makers would settle on the CBLAs paper above from Netlib.

CBLAS from C++

The cblas.h does not work when called from C++ reason for this is C++ name mangling. Its an easy fix and hope the BLAS working group adds this fix.

 Open up cblas.h and add:

#ifdef __cplusplus
extern "C" {
#endif /* __cpluplus */

Right after the last include before anything else is defined.

Then at the very end of the header add:

#ifdef __cplusplus
}
#endif /* __cplusplus*/

Right before the last #endif

Amber 8 on em64t/amd64 full 64bit and OpenMPI 1.2.6

Amber 8 is one of the molecular dynamics packages a number of our users at the Center for Advanced Computing use. Our main cluster is a 64bit AMD system Nyx. No system at the CAC is 32bit anymore but Amber’s ‘configure’ script assumes that if your building with the Intel Ifort compiler that you are ether 32bit i686/i386 or IA64. We are neither we are x86_64 also known as em64t and amd64, it all depends if your talking to Intel or AMD they are identical though.

So Amber’s configure script is a disaster, does not allow for using the mpif90/mpicc wrappers as part of a mpi library, it also assumes your going to use a small subset of MPI libraries. Our MPI library of choise is OpenMPI which was not around when Amber8 was published. So I made a patch to support both x86_64 and OpenMPI in amber’s configure script.

Apply the patch:

patch < openmpi.patch

This adds the option:

-openmpi

To the configure script. To build Amber now is:

export AMBERHOME=~/Amber8
export OMPI_HOME=~/openmpi-1.2.6
./configure -openmpi ifort

This will build amber. Note that the patch configure script does NOT suport 32bit targets when using the MKL, the script was patched such that if you set: MKL_HOME

Which I highly recommend it will use the em64t versions of MKL. The configure script will no-longer find the 32bit versions. Thus I did not send this patch to the amber devs. Also It may be fixed in Amber 9 and Amber 10 but we have not installed them yet.

Final to build Amber 8 with OpenMPI-1.2.6 on x86_64 with Intel Ifort compiler and MKL do:

export AMBERHOME=~/Amber8
export OMPI_HOME=~/openmpi-1.2.6
export MKL_HOME=~/mkl80
cd $AMBERHOME/src/
wget http://www.mlds-networks.com/~brockp/papers/code/ompi.patch
patch < ompi.patch
./configure -openmpi ifort
make parallel

HPC The future and Vector Processing

The three of us at the CAC were all invited to a talk at the University Hospital on HPC (High Performance Computing) in medicine. A few cool things came to mind.

  • HPC is being used in real medical applications
  • Vector processing is back (think Cell)
  • HPC is going to be underutilized

The first point is obvious.  And its cool, HPC till now has really been only used for weather prediction, and anything else was all research. Now there is a need to build simple clusters and gateways to them for a doctor to pull data right from a DNA sequencer or MRI machine.

(I will get this graphic latter).

Vector Processing is back.  To me as much as IBM wants to call the eight units hanging off the Cell cpu SPE’s. The cell is just a vector CPU in the eyes of a application.   There is a few problems with this: First, vector processing is not taught in classes at all if your a Computer Science major.  And when I mention vector, SIMD or SSE, 3DNow! to a gradstudent who is trying to graduate in 3 years and is just learning to code they have no clue.

Already in the graphic above you can see that if you use the vector SSE unit on the CPU in your laptop the time to soluation is cut by more than half.  This problem is only going to get worse with things like Cell, and GPU’s.  All these systems are massive SIMD engines, that the peopel writing code on at the collage level know nothing about.  Classes do not exist to teach such things, and the tools are not there to abstract it out.

 My prediction: 80% of the code ran on HPC systems at universities will not use the SPE’s GPUS and Vector units.  Resulting in wasting huge amounts of resources. 

Of the code ran thats written by grad students, I expect 95% to not use them. 

Now I don’t say we should not go this route, I think we have to.  Look at the performace of the Cell, 200 Gflop when working on floats. A Intel or AMD cpu can’t do half that even with SSE3 which doubled the performance of AMD and Intel cpus when (can you guess it?) Only using SSE3 VECTOR units!  Scallar performance is still drag and will not improve (Much).

We need to build tools to make these units available, but many already are around IMSLNAG and similar tools already provide high level methods for solvers to use vector units, and expect them to build ones for the cell.  There is also the lowerlevel BLAS and LAPACK, in the form of MKL, ACML, ATLAS and GOTO. All free to researchers, and provide huge performance gains mostly by memory blocking and vector unit use.  Many are also already parallel with no need to know OpenMP or MPI.  Tlak to your local HPC admin, or email me at brockp@mlds-netowrks.com.

So I think we already have the tools and they are not being used enough, so whats the problem?

 Education,

Courses, ether in the form of formal classes or as seminars need to be available and pushed by Faculty that show students writing new code can find out that these tools are available. Just teaching students to use:

pgcc -fastsse -O3 -ipo -Minfo
gcc -O3

Would make huge performance leaps on systems on campuses.  

Sigh, I started teaching such classes but I can only do so much, You can find what I have done so far at:

www.umich.edu/~brockp

Lustre Dstat plugin

At The Center for Advanced Computing at my day job. I tested out Lustre a cluster filesystem now owned by Sun. While I have been very impressed with Lustre we have had a huge amount of trouble with servers and clients evicting each other.

This did not keep me from making a plugin for my favorite system tool, Dstat.

This plugin works only on clients and has not been tested with multiple lustre mounts patches please!

Download the plugin here: Download

Place in dstat-0.6.6/plugins/

dstat -M lustre

Visit Visualization - Quite Powerful

Visit is a large scale scientific visualization software from LLNL . I have tried out Visit on our cluster NYX using both my Mac as a client and our 3D workstations (Linux). I was quite impressed with the large amount of out of the box ability. So far I have been able to make the following formats work as both static images and video:

  • VASP OUTCAR
  • Protein Databank (ent)
  • CGNS
  • Simple HDF5
  • Fluent
  • Flash

I plan to try a few others, TecPlot, XYZ and NetCDF to name a few.

I must stress that Visit will not run in parallel using the binaries provided by LLNL on the Visit website. You will need to compile it your self, but to do this the Visit devs have made a build_visit script you can find on their website. It was very easy on our RHEL4 machine with OpenMPI-1.2.3 with gcc-3.4 to build. A few libraries it tries to do will fail. Mili is not available yet so deselect it from the available plugins. Also h5Part would not build for me though HDF5 does just fine. Just disable these.

Note DO NOT run visit over a slow line, in this case your highspeed Comcast is not fast enough, the machine running the client should be on the same lan as the cluster or have at-least a 100Mbps available to the client at anytime. Second never X-forward the client, it will look bad and work awful. Last make sure your Linux machine supports hardware GL (We use Nvidia SLI for our 30" displays). If you use the Windows or Mac client this should already be done for you but for larger models a cheap graphics card will hurt you.

The largest render I have done used 16 cpus and chewed a 4.5GB HDF5 file from lusture to my display at the 3d lab in around 45 seconds. This file was way to simple so YMMV.

If you are a U of M student/staff/faculty who wants to know more about tools like Visit please contact me. I would love to help you out.

Example Images: 

visit-vasp0000.pngvisit-pdb0002.pngvisit-pdb0001.pngvisit-pdb0000.pngvisit-cgns0001.pngCGNS Filevisit-hdf50000.png

Example Movie:

 amr.png

plants