We're engineers and scientists; we measure things. So here's a fun measurement experiment. Get a PS3, install Fedora Core 7 on it, then install the IBM Cell SDK.
The IBM page has some useful links for this stuff. I'm using FC7 on my machine, as it seems to have the widest development support, and I'm able to get packages without doing a ton of recompiles. It's actually a pretty nice development environment, although I have to admit that I'm biased towards Unix as a platform.
After you get all this installed, you can build the SDK, and there are some fun programs in the "demos" subdirectory. Look in
/opt/cell/sdk/src/demos/matrix_mul.... Everyone knows that matrix multiplication is something that works well in parallel. You can run the demo program; it generates a random matrix, and then does multiplication (breaking down parts of the matrix, and farming out chunks to different SPUs).
The
matrix_mul demo takes a few command line arguments. The
-s # specifies the number of SPEs to use (default is 1). The
-m # specifies how big of matrix we want (multiples of 64 only, please). The
-p option specifies that we want the program to report performance statistics. We're scientists and engineers. Of course we want performance statistics.
- ./matrix_mul -s 1 -m 1024 -p
Just use one SPE, and multiply a big 1024-square matrix with a copy of itself.
Initializing Arrays ... done
Running test ... done
Performance Statistics:
number of SPEs = 1
execution time = 0.09 seconds
computation rate = 23.85 GFlops/sec
data transfer rate = 1.54 GBytes/sec
23.85 GFlops. Wow, pretty darned good. I wonder what happens if we pull in two SPEs?
- ./matrix_mul -s 2 -m 1024 -p
Initializing Arrays ... done
Running test ... done
Performance Statistics:
number of SPEs = 2
execution time = 0.05 seconds
computation rate = 42.93 GFlops/sec
data transfer rate = 2.77 GBytes/sec
Roughly half the execution time, double the flops. Linear speedup! Wonderful! How about four SPEs?
- ./matrix_mul -s 4 -m 1024 -p
Initializing Arrays ... done
Running test ... done
Performance Statistics:
number of SPEs = 4
execution time = 0.03 seconds
computation rate = 71.55 GFlops/sec
data transfer rate = 4.61 GBytes/sec
Holy smokes, still looking good. I've done a few runs, and some have reported more than 100 GFlops. We're obviously running into some challenges with timer resolution, the thing is so darned fast.
Now lets take a closer look. Open up matrix_mul.c in your favorite editor. Scan down past all the include files, argument handling, and all that jazz. Here's an interesting batch of the code.
... lots of code ...
/* Start time
*/
gtime = times(&tbuf);
/* Send each of the SPUs to get them started.
*/
for (i=0; i<spus; i++) send_mail(i, 1);
/* Wait for the SPUs to complete
*/
for (i=0; i<spus; i++) {
/* Join thread */
if ((rc = pthread_join (threads[i].pthread, NULL)) != 0) {
printf("INTERNAL ERROR: failed to join pthread %d. Error = %s\n", i, str
error(errno));
exit(1);
}
}
/* Stop time
*/
elapsed_time = (double)(times(&tbuf) - gtime) / (double)sysconf(_SC_CLK_TCK)
;
printf("done\n");
... lots of code ...
What's being measured is the time to get the matrix chunks to the SPEs, they compute, and then the threads terminate. We're not measuring the time to create the threads. Or to destroy them. Or to load the program from disk to memory. Or to load the code onto the SPEs. Or to get the result back. Or to print out any informational message.
Suppose we use our trusty Unix friend time to tell us how fast things are?
- time ./matrix_mul -s 1 -m 1024 -p
...
real 0m0.322s
user 0m0.188s
sys 0m0.042s
- time ./matrix_mul -s 2 -m 1024 -p
...
real 0m0.279s
user 0m0.188s
sys 0m0.044s
- time ./matrix_mul -s 4 -m 1024 -p
...
real 0m0.260s
user 0m0.189s
sys 0m0.048s
Two SPEs are about 13% faster than just one, if you count everything (which, let's face it, is the right way to measure). Four SPEs are only about 7% faster than two. I think there's a Law that describes this phenomena; give me a minute, I'm sure I'll think of what it was.
Bigger matrices, and lots more of them would certainly help. This would be the Gustafson perspective on things, and in many cases, that's a valid way to look at it. But if you don't actually need to multiply bigger matrices, or more of them, to get the job done, we're stuck. Even worse, if you only need smaller matrices, using extra SPUs is actually a slow-down.
I'm not anywhere close to being the first person to notice this sort of problem. I highly recommend reading David H. Bailey's
Twelve Ways to Fool the Masses When Giving Performance Results on Parallel Computers, from Supercomputing Review, August 1991, pages 54-55. What we have above would probably fall under #2.
As I said at the top, we're scientists and engineers. We measure things. We need to be careful when we do this. There are managers, CEOs, and business-school folk nearby; they sometimes make decisions based on what we tell them. They may have a very difficult time understanding the fine distinction between "we can use ten processors to make the program ten times faster" and "at best we can make it twice as fast, and we get almost no effect after the fourth processor." The experiment above could be described as either a 300% speedup (counting only the multiply) or a 24% speedup (measuring start to finish); which one wins a best paper award, and which one is an accurate description? The end of clock rate scaling has given us a very difficult challenge; now is not the time to be sloppy with our measuring sticks.