Monday, July 7, 2008

It's all in how you measure

We're engineers and scientists; we measure things. So here's a fun measurement experiment. Get a PS3, install Fedora Core 7 on it, then install the IBM Cell SDK. The IBM page has some useful links for this stuff. I'm using FC7 on my machine, as it seems to have the widest development support, and I'm able to get packages without doing a ton of recompiles. It's actually a pretty nice development environment, although I have to admit that I'm biased towards Unix as a platform.

After you get all this installed, you can build the SDK, and there are some fun programs in the "demos" subdirectory. Look in /opt/cell/sdk/src/demos/matrix_mul.... Everyone knows that matrix multiplication is something that works well in parallel. You can run the demo program; it generates a random matrix, and then does multiplication (breaking down parts of the matrix, and farming out chunks to different SPUs).

The matrix_mul demo takes a few command line arguments. The -s # specifies the number of SPEs to use (default is 1). The -m # specifies how big of matrix we want (multiples of 64 only, please). The -p option specifies that we want the program to report performance statistics. We're scientists and engineers. Of course we want performance statistics.

  • ./matrix_mul -s 1 -m 1024 -p
    Just use one SPE, and multiply a big 1024-square matrix with a copy of itself.
    Initializing Arrays ... done
    Running test ... done
    Performance Statistics:
    number of SPEs = 1
    execution time = 0.09 seconds
    computation rate = 23.85 GFlops/sec
    data transfer rate = 1.54 GBytes/sec

    23.85 GFlops. Wow, pretty darned good. I wonder what happens if we pull in two SPEs?

  • ./matrix_mul -s 2 -m 1024 -p
    Initializing Arrays ... done
    Running test ... done
    Performance Statistics:
    number of SPEs = 2
    execution time = 0.05 seconds
    computation rate = 42.93 GFlops/sec
    data transfer rate = 2.77 GBytes/sec

    Roughly half the execution time, double the flops. Linear speedup! Wonderful! How about four SPEs?

  • ./matrix_mul -s 4 -m 1024 -p
    Initializing Arrays ... done
    Running test ... done
    Performance Statistics:
    number of SPEs = 4
    execution time = 0.03 seconds
    computation rate = 71.55 GFlops/sec
    data transfer rate = 4.61 GBytes/sec

    Holy smokes, still looking good. I've done a few runs, and some have reported more than 100 GFlops. We're obviously running into some challenges with timer resolution, the thing is so darned fast.

Now lets take a closer look. Open up matrix_mul.c in your favorite editor. Scan down past all the include files, argument handling, and all that jazz. Here's an interesting batch of the code.

... lots of code ...
/* Start time
gtime = times(&tbuf);

/* Send each of the SPUs to get them started.
for (i=0; i<spus; i++) send_mail(i, 1);

/* Wait for the SPUs to complete
for (i=0; i<spus; i++) {
/* Join thread */
if ((rc = pthread_join (threads[i].pthread, NULL)) != 0) {
printf("INTERNAL ERROR: failed to join pthread %d. Error = %s\n", i, str

/* Stop time
elapsed_time = (double)(times(&tbuf) - gtime) / (double)sysconf(_SC_CLK_TCK)

... lots of code ...

What's being measured is the time to get the matrix chunks to the SPEs, they compute, and then the threads terminate. We're not measuring the time to create the threads. Or to destroy them. Or to load the program from disk to memory. Or to load the code onto the SPEs. Or to get the result back. Or to print out any informational message.

Suppose we use our trusty Unix friend time to tell us how fast things are?

  • time ./matrix_mul -s 1 -m 1024 -p
    real 0m0.322s
    user 0m0.188s
    sys 0m0.042s

  • time ./matrix_mul -s 2 -m 1024 -p
    real 0m0.279s
    user 0m0.188s
    sys 0m0.044s

  • time ./matrix_mul -s 4 -m 1024 -p
    real 0m0.260s
    user 0m0.189s
    sys 0m0.048s

Two SPEs are about 13% faster than just one, if you count everything (which, let's face it, is the right way to measure). Four SPEs are only about 7% faster than two. I think there's a Law that describes this phenomena; give me a minute, I'm sure I'll think of what it was.

Bigger matrices, and lots more of them would certainly help. This would be the Gustafson perspective on things, and in many cases, that's a valid way to look at it. But if you don't actually need to multiply bigger matrices, or more of them, to get the job done, we're stuck. Even worse, if you only need smaller matrices, using extra SPUs is actually a slow-down.

I'm not anywhere close to being the first person to notice this sort of problem. I highly recommend reading David H. Bailey's Twelve Ways to Fool the Masses When Giving Performance Results on Parallel Computers, from Supercomputing Review, August 1991, pages 54-55. What we have above would probably fall under #2.

As I said at the top, we're scientists and engineers. We measure things. We need to be careful when we do this. There are managers, CEOs, and business-school folk nearby; they sometimes make decisions based on what we tell them. They may have a very difficult time understanding the fine distinction between "we can use ten processors to make the program ten times faster" and "at best we can make it twice as fast, and we get almost no effect after the fourth processor." The experiment above could be described as either a 300% speedup (counting only the multiply) or a 24% speedup (measuring start to finish); which one wins a best paper award, and which one is an accurate description? The end of clock rate scaling has given us a very difficult challenge; now is not the time to be sloppy with our measuring sticks.


Daniel said...

I am not sure it is fair to include in your performance timings the time taken to create the matrix.

Also you are timing the time taken to start the process. Real applications will call matrix multiply thousands of times, such that the cost of process startup is amortised away.

Patrick H. Madden said...

I won't argue too strongly that there's any single "fair" measurement, although the "12 Ways" paper makes it pretty clear that measuring only the inner loop should not be considered fair.

What I'm trying to stress is that there's other things going on than the parallel-friendly part of the code, and ignoring it completely gives a very skewed perception. We can speed up the parallel parts of the code by a huge amount -- and then it's the little stuff that can't be sped up that kills us.