Sunday, July 27, 2008

Take $1000 out of my pocket for Thinking Parallel

Over the past few years, I've sat through a few dozen keynote talks on multi-core. In almost all cases, there's some comment along the lines of computer scientists don't know how to think parallel, and if they'd just smarten up, parallel computing will take off. This usually comes from people who have a management background, and sometimes from electrical engineers.

Like it or not, there are some things that can't be made faster with parallel machines (as far as I, or a number of people I respect, know). So here's an opportunity to take $1000 out of my pocket, just for showing me how to think parallel. A very common task in software applications is finding the shortest path between a pair of vertices in a graph; the best known algorithm for this is Dijkstra's, which has a Big-O complexity of O(E + V log V). Dijkstra uses a heap, which is a serial bottleneck; there's no way that's obvious to me to use multiple processors.

The Bellman-Ford algorithm also computes shortest path, and is very very easy to make massively parallel. So why not use the parallel version, you ask? Bellman-Ford is O(EV), which has higher computational complexity. For even modest sized graphs, Dijkstra will be faster in terms of wall-clock time, require less total power, and does all of this with just a single processor. There are very few types of graphs where Bellman-Ford might be better; you'd have to work hard to find one where a serial Dijkstra would not blow the doors off a parallel Bellman-Ford. If your objective is to get the answer as fast as possible with the minimum power, serial Dijkstra is the obvious choice.

How to take $1000 out of my pocket.

So here's the offer. Show me how to think parallel for this problem. Demonstrate a factor of two speedup for shortest path, and use as many parallel processors as you want. The run-time comparison should be between comparable processors (for example, one core of a multi-core AMD/Intel/Sun processor versus all available cores, not a 1987 IBM PC vs a new Xeon). The speedup should come on a real graph, and be compared to a solid version of the serial Dijkstra code. In particular, let's use the 9th DIMACS Implementation Challenge for Shortest Path. We can use the full USA distance map (24 million nodes, 58 million arcs), and find the vertex with the greatest shortest path distance from vertex 0. No cheating by precomputing the result, though - this has to work for other graphs. You have to actually use the other processors, not just tune the serial code. I'll throw in another $1000 if you can show that your cleverness scales linearly up to 16 processors.

Why am I focused on this?

Shortest path is used all over the place; if you need it, it's part of the serial fraction that shows up in Amdahl's Law. There seems to be nothing that can be done about it; if you need to know the shortest way to get from point A to point B, you're stuck. Getting performance scaling to thousands of cores isn't just about having a clever architecture, or a fancy language, or thinking parallel. For massive scaling, you need to eliminate almost all serial bottlenecks, and Dijkstra is one of the ones sitting in the way. With a non-zero serial fraction, the law kicks in, and it's not some sort of failing on the part of the programmer, the hardware, the language.

To pull in the bigger picture; shortest path is not the only problem where it's hard to get speedup. There are plenty of others, and if they're essential parts of an application, there's just no way to scale to large numbers of processors. All the wishing in the world won't change this. This is important to understand: the sooner we stop looking for some sort of magic bullet architecture/language/paradigm, the sooner we can start looking for real solutions.

Saturday, July 12, 2008

The Hurdle and the Hole

Just a quick random thought.... There's a lot of attention to new parallel programming languages, APIs, messaging packages, interconnect architectures, and so on. All sorts of ways to make it easier for a programmer to express parallel code, and to have it map onto the hardware efficiently.

Being able to program in parallel is a hurdle. It's difficult to get over. I don't want to downplay the challenge of doing it, or the complexity of the work involved in developing a new programming language, etc. If we are to get any benefit at all from multiple cores (apart from running many independent programs simultaneously), we need to get over the hurdle.

But if we get over the hurdle -- what's on the other side? Unless the run time of the program is dominated by processing that can be done in a massively parallel manner, the serial component starts to dominate quickly, and then we get little gain from exponentially increasing numbers of cores. This is the hole we can fall into immediately.

We need to look a few steps ahead. If we clear the hurdle, then what?

Monday, July 7, 2008

It's all in how you measure

We're engineers and scientists; we measure things. So here's a fun measurement experiment. Get a PS3, install Fedora Core 7 on it, then install the IBM Cell SDK. The IBM page has some useful links for this stuff. I'm using FC7 on my machine, as it seems to have the widest development support, and I'm able to get packages without doing a ton of recompiles. It's actually a pretty nice development environment, although I have to admit that I'm biased towards Unix as a platform.

After you get all this installed, you can build the SDK, and there are some fun programs in the "demos" subdirectory. Look in /opt/cell/sdk/src/demos/matrix_mul.... Everyone knows that matrix multiplication is something that works well in parallel. You can run the demo program; it generates a random matrix, and then does multiplication (breaking down parts of the matrix, and farming out chunks to different SPUs).

The matrix_mul demo takes a few command line arguments. The -s # specifies the number of SPEs to use (default is 1). The -m # specifies how big of matrix we want (multiples of 64 only, please). The -p option specifies that we want the program to report performance statistics. We're scientists and engineers. Of course we want performance statistics.

  • ./matrix_mul -s 1 -m 1024 -p
    Just use one SPE, and multiply a big 1024-square matrix with a copy of itself.
    Initializing Arrays ... done
    Running test ... done
    Performance Statistics:
    number of SPEs = 1
    execution time = 0.09 seconds
    computation rate = 23.85 GFlops/sec
    data transfer rate = 1.54 GBytes/sec

    23.85 GFlops. Wow, pretty darned good. I wonder what happens if we pull in two SPEs?

  • ./matrix_mul -s 2 -m 1024 -p
    Initializing Arrays ... done
    Running test ... done
    Performance Statistics:
    number of SPEs = 2
    execution time = 0.05 seconds
    computation rate = 42.93 GFlops/sec
    data transfer rate = 2.77 GBytes/sec

    Roughly half the execution time, double the flops. Linear speedup! Wonderful! How about four SPEs?

  • ./matrix_mul -s 4 -m 1024 -p
    Initializing Arrays ... done
    Running test ... done
    Performance Statistics:
    number of SPEs = 4
    execution time = 0.03 seconds
    computation rate = 71.55 GFlops/sec
    data transfer rate = 4.61 GBytes/sec

    Holy smokes, still looking good. I've done a few runs, and some have reported more than 100 GFlops. We're obviously running into some challenges with timer resolution, the thing is so darned fast.



Now lets take a closer look. Open up matrix_mul.c in your favorite editor. Scan down past all the include files, argument handling, and all that jazz. Here's an interesting batch of the code.

... lots of code ...
/* Start time
*/
gtime = times(&tbuf);

/* Send each of the SPUs to get them started.
*/
for (i=0; i<spus; i++) send_mail(i, 1);

/* Wait for the SPUs to complete
*/
for (i=0; i<spus; i++) {
/* Join thread */
if ((rc = pthread_join (threads[i].pthread, NULL)) != 0) {
printf("INTERNAL ERROR: failed to join pthread %d. Error = %s\n", i, str
error(errno));
exit(1);
}
}

/* Stop time
*/
elapsed_time = (double)(times(&tbuf) - gtime) / (double)sysconf(_SC_CLK_TCK)
;

printf("done\n");
... lots of code ...


What's being measured is the time to get the matrix chunks to the SPEs, they compute, and then the threads terminate. We're not measuring the time to create the threads. Or to destroy them. Or to load the program from disk to memory. Or to load the code onto the SPEs. Or to get the result back. Or to print out any informational message.

Suppose we use our trusty Unix friend time to tell us how fast things are?


  • time ./matrix_mul -s 1 -m 1024 -p
    ...
    real 0m0.322s
    user 0m0.188s
    sys 0m0.042s


  • time ./matrix_mul -s 2 -m 1024 -p
    ...
    real 0m0.279s
    user 0m0.188s
    sys 0m0.044s


  • time ./matrix_mul -s 4 -m 1024 -p
    ...
    real 0m0.260s
    user 0m0.189s
    sys 0m0.048s




Two SPEs are about 13% faster than just one, if you count everything (which, let's face it, is the right way to measure). Four SPEs are only about 7% faster than two. I think there's a Law that describes this phenomena; give me a minute, I'm sure I'll think of what it was.

Bigger matrices, and lots more of them would certainly help. This would be the Gustafson perspective on things, and in many cases, that's a valid way to look at it. But if you don't actually need to multiply bigger matrices, or more of them, to get the job done, we're stuck. Even worse, if you only need smaller matrices, using extra SPUs is actually a slow-down.

I'm not anywhere close to being the first person to notice this sort of problem. I highly recommend reading David H. Bailey's Twelve Ways to Fool the Masses When Giving Performance Results on Parallel Computers, from Supercomputing Review, August 1991, pages 54-55. What we have above would probably fall under #2.

As I said at the top, we're scientists and engineers. We measure things. We need to be careful when we do this. There are managers, CEOs, and business-school folk nearby; they sometimes make decisions based on what we tell them. They may have a very difficult time understanding the fine distinction between "we can use ten processors to make the program ten times faster" and "at best we can make it twice as fast, and we get almost no effect after the fourth processor." The experiment above could be described as either a 300% speedup (counting only the multiply) or a 24% speedup (measuring start to finish); which one wins a best paper award, and which one is an accurate description? The end of clock rate scaling has given us a very difficult challenge; now is not the time to be sloppy with our measuring sticks.

Wednesday, July 2, 2008

Intel's Unwelcome Advice

Another interesting tidbit from Anwar Ghuloum on the Intel software blog.

Anwar suggests that we should prepare for thousands of cores. Fortunately, I think there are a lot of Intel folks trying to figure out something other than this. I hope they get it figured out soon.

Tuesday, July 1, 2008

What kind of programs work well with.....

Over on the Intel software blog is a post by Jay Hoeflinger on "What kind of programs work well with Cluster OpenMP?" The observation is that if you have a program that needs memory accesses (particularly writes) that are all over the place, getting a page from a remote compute node will be REALLY SLOW.

From the post:
Yet, we have seen many programs get really good performance with Cluster OpenMP. So, how can you know whether your code is one of the high-performance ones or the REALLY SLOW ones?


One could ask the same question about any kind of software, with respect to multiple processors and any architecture. How can we know if our code will run well on multi-core? If there are memory accesses to the same locations, there's always going to be serialization -- locks, cache coherency issues, and so on. This is not something that is going to go away with a new compiler, a new programming paradigm, some sort of clever network interconnect. It's a fundamental aspect of the problem being solved. For a given program, is it going to scale well to multiple cores, or will it be one of the REALLY SLOW ones?

Hoeflinger notes that applications like ray tracing work well on distributed machines. Ray tracing is the poster child for the parallel community; one might note that parallel POVRAY benchmarks have been around since the
dawn of time. If an application has the obvious parallelism of ray tracing, and there's value in speeding it up, chances are it's already parallel.

What kind of program is a simple question, and a lot of the reason I'm ranting here. If we're not going to increase serial clock rates (and that seems to be impractical), we need to know what sort of programs work well with multiple cores. Ray tracing? check. High volume web servers? check. Something that will draw in lots of customers and serve a broad audience? hmmmm.

My research group is working on this. We've got some ideas, some things we're trying. And we're doing our best to not fall into the same rabbit holes that so many others seem to enjoy