Amdahl's Law

Thursday, October 30, 2008

Hot New Technology That Will Change Everything?

Everyone likes lists of stuff. PCWorld has a list of 15 Hot New Technologies, one of them being massively multicore CPUs. Missing from their write-up is some idea of what all these processors will do. If we're talking two, maybe four, possibly even eight, then yeah, there's probably something for them to do at least part of the time. But 32? 64? More??

And new technology? I guess the 16-processor Atari PC from the 80's doesn't count. Nor do the dozens of other massively parallel machines that have gone down the tubes. I guess a technology is new if it's failed for decades.

Instead of having more cores, why not small scale lithographs of Andy Kaufman on the chips? They'll be at least as useful, more resistant to device failure, consume less power, and would be a great way of saying to consumers that "the joke is on you."

Sunday, October 19, 2008

Everything We Know Is Wrong

An interesting post over at the Economist, noting that published results can be wrong -- and that the more selective a venue is, and the "hotter" the area, the more likely it is that something is amiss. This makes a lot of sense; in complex research, things go wrong; there's no way we can eliminate all errors. If the errors skew things in a "positive" direction, reviewers are impressed, and off it goes.

To my knowledge, I've only had bad numbers in one paper -- we were measuring balance of horizontal and vertical wiring from different placement methods, and while the general trend was right, the relative percentages were off (lots of details on how this went wrong; we detected it after the conference camera-ready went in, so we announced the error at the actual talk, and included a note about the error in the journal version). Safe to assume that I've probably screwed up elsewhere.

Fortunately, I got tenure a few years ago, and can now take my sweet time with papers. I've got a few in the pipe, and I'm going to do my best to get everything right.

Saturday, September 13, 2008

The Fifth Generation

Back when I was an undergrad, there was a major effort going on in Japan known as the Fifth Generation. The idea was that by using massive parallelism, one could simulate thought, and build systems that were intelligent (for some definition of intelligent; there's certainly some debate on what that would mean). Neural networks, genetic algorithms, all that jazz. This wasn't a small effort; it dominated computer science research in Japan, and there were quite a few people in the US who also felt that this was the right direction.

I've got a book from that era, The Fifth Generation, by Edward Feigenbaum (a senior Stanford professor), and Pamela McCorduck. Reading through it is fairly surreal; there's a level of certainty that I find astonishing. I don't get a feeling of this is an interesting research direction, and we should follow it. Rather, it's more of this will certainly work, and we must direct our efforts toward it and away from other ideas.

The Japanese research community lost many valuable years chasing AI systems. Likely many promising researchers were shut down if they did not subscribe to the party line. Research dollars were wasted barking up the wrong trees. The psychological term for this is Groupthink. Scientific discourse is halted, dissenting views are quashed. Everyone falls into lock step, but not necessarily in the right direction.

So a question for the random person who found their way here... Suppose the US government offered funding for perpetual motion research -- they want a system with flywheels, and it has to generate electricity. Would you take the funding? Or would you ask hard questions of the funding director? And suppose they're offering lots of money? Would you take it then? And if you saw many of your colleagues line up, and say how they had always had misgivings about thermodynamics, and are enthusiastic about the prospects of perpetual motion. Previous failures, they assure you, were only due to a lack of effort and funding -- and now, we're serious about it, and it has to work or we're in trouble. If you felt that the research community was headed in the wrong direction, would you speak up, or go along? The only way to break out of Groupthink is for enough members of the group to try to shake things up.

Tuesday, August 26, 2008

Some numbers to ponder

I was picking on Michael Garland's DAC paper a bit in the last post, with respect to using GPUs for shortest path (sorry Michael!). We've bounced a few emails back and forth, and have sorted things out -- his DAC paper didn't have the space to go into much detail, and I think it may give some people the wrong impression (it certainly did so for me!).

First off -- for doing lots of matrix multiplication, the GPU is definitely a good way to go; no gripes from me on that. Where things get a bit tricky is with the shortest path computation. On a "straightforward" shortest path run, Michael's serial Dijkstra implementation beats the daylights out his serial Bellman-Ford approach (a factor of 30 or so), for modest size graphs. This is as one might expect, and as the graphs get larger, one would anticipate even greater advantage for the Dijkstra algorithm. For the meshes that Michael was looking at, the massive parallel version of Bellman-Ford on the GPU is comparable to the serial Dijkstra (but as we get to bigger and bigger graphs, computational complexity tells us that the serial Dijkstra will win).

The application he's really interested in, however, is not strictly shortest path -- but rather the partitioning of a mesh into independent clusters of triangles, so that these can be rendered in parallel. In a real world application, one might break apart a 3D model into chunks (during preprocessing), and then render them independently. In addition to scientific computing co-processors, nVidia apparently also makes graphics cards! The approach to partition a mesh into clusters uses shortest path.

For partitioning, a set of vertices are chosen as "seeds," and shortest path search waves propagate from each. When the waves meet, the search ends there, with the waves progressing to other parts of the mesh. Think of throwing a bunch of stones into a pond at the same time -- the clusters are determined by which side of a "ripple" a particular vertex falls on. For those of you familiar with computational geometry, this might be thought of as a graph-based Voronoi diagram.

The partitioning boils down to a large number of independent shortest path problems, which can be run in parallel. No surprises there -- when you have a lot of work that can be done in parallel, a parallel machine does well. The GPU implementation (with Bellman Ford) is about 5x faster than a serial implementation (with Dijkstra) -- but depending on the size of the graph, the number of processors available, the speed of the serial processor, and so on, things might change. If there are full-blown cores (that can handle priority queues), a parallel multi-core approach would be even faster.

So -- if you have a large number of shortest path problems to solve, you can do them in parallel, and possibly come out ahead of a serial processor. If you have only one shortest path to do, the graph is small, and you have a ton of parallel horsepower, the parallel approach might still be faster. But for large graphs, and only one shortest path, pick the serial Dijkstra. At this point, I think my money is still safe.

Sunday, July 27, 2008

Take $1000 out of my pocket for Thinking Parallel

Over the past few years, I've sat through a few dozen keynote talks on multi-core. In almost all cases, there's some comment along the lines of computer scientists don't know how to think parallel, and if they'd just smarten up, parallel computing will take off. This usually comes from people who have a management background, and sometimes from electrical engineers.

Like it or not, there are some things that can't be made faster with parallel machines (as far as I, or a number of people I respect, know). So here's an opportunity to take $1000 out of my pocket, just for showing me how to think parallel. A very common task in software applications is finding the shortest path between a pair of vertices in a graph; the best known algorithm for this is Dijkstra's, which has a Big-O complexity of O(E + V log V). Dijkstra uses a heap, which is a serial bottleneck; there's no way that's obvious to me to use multiple processors.

The Bellman-Ford algorithm also computes shortest path, and is very very easy to make massively parallel. So why not use the parallel version, you ask? Bellman-Ford is O(EV), which has higher computational complexity. For even modest sized graphs, Dijkstra will be faster in terms of wall-clock time, require less total power, and does all of this with just a single processor. There are very few types of graphs where Bellman-Ford might be better; you'd have to work hard to find one where a serial Dijkstra would not blow the doors off a parallel Bellman-Ford. If your objective is to get the answer as fast as possible with the minimum power, serial Dijkstra is the obvious choice.

How to take $1000 out of my pocket.

So here's the offer. Show me how to think parallel for this problem. Demonstrate a factor of two speedup for shortest path, and use as many parallel processors as you want. The run-time comparison should be between comparable processors (for example, one core of a multi-core AMD/Intel/Sun processor versus all available cores, not a 1987 IBM PC vs a new Xeon). The speedup should come on a real graph, and be compared to a solid version of the serial Dijkstra code. In particular, let's use the 9th DIMACS Implementation Challenge for Shortest Path. We can use the full USA distance map (24 million nodes, 58 million arcs), and find the vertex with the greatest shortest path distance from vertex 0. No cheating by precomputing the result, though - this has to work for other graphs. You have to actually use the other processors, not just tune the serial code. I'll throw in another $1000 if you can show that your cleverness scales linearly up to 16 processors.

Why am I focused on this?

Shortest path is used all over the place; if you need it, it's part of the serial fraction that shows up in Amdahl's Law. There seems to be nothing that can be done about it; if you need to know the shortest way to get from point A to point B, you're stuck. Getting performance scaling to thousands of cores isn't just about having a clever architecture, or a fancy language, or thinking parallel. For massive scaling, you need to eliminate almost all serial bottlenecks, and Dijkstra is one of the ones sitting in the way. With a non-zero serial fraction, the law kicks in, and it's not some sort of failing on the part of the programmer, the hardware, the language.

To pull in the bigger picture; shortest path is not the only problem where it's hard to get speedup. There are plenty of others, and if they're essential parts of an application, there's just no way to scale to large numbers of processors. All the wishing in the world won't change this. This is important to understand: the sooner we stop looking for some sort of magic bullet architecture/language/paradigm, the sooner we can start looking for real solutions.

Saturday, July 12, 2008

The Hurdle and the Hole

Just a quick random thought.... There's a lot of attention to new parallel programming languages, APIs, messaging packages, interconnect architectures, and so on. All sorts of ways to make it easier for a programmer to express parallel code, and to have it map onto the hardware efficiently.

Being able to program in parallel is a hurdle. It's difficult to get over. I don't want to downplay the challenge of doing it, or the complexity of the work involved in developing a new programming language, etc. If we are to get any benefit at all from multiple cores (apart from running many independent programs simultaneously), we need to get over the hurdle.

But if we get over the hurdle -- what's on the other side? Unless the run time of the program is dominated by processing that can be done in a massively parallel manner, the serial component starts to dominate quickly, and then we get little gain from exponentially increasing numbers of cores. This is the hole we can fall into immediately.

We need to look a few steps ahead. If we clear the hurdle, then what?

Monday, July 7, 2008

It's all in how you measure

We're engineers and scientists; we measure things. So here's a fun measurement experiment. Get a PS3, install Fedora Core 7 on it, then install the IBM Cell SDK. The IBM page has some useful links for this stuff. I'm using FC7 on my machine, as it seems to have the widest development support, and I'm able to get packages without doing a ton of recompiles. It's actually a pretty nice development environment, although I have to admit that I'm biased towards Unix as a platform.

After you get all this installed, you can build the SDK, and there are some fun programs in the "demos" subdirectory. Look in /opt/cell/sdk/src/demos/matrix_mul.... Everyone knows that matrix multiplication is something that works well in parallel. You can run the demo program; it generates a random matrix, and then does multiplication (breaking down parts of the matrix, and farming out chunks to different SPUs).

The matrix_mul demo takes a few command line arguments. The -s # specifies the number of SPEs to use (default is 1). The -m # specifies how big of matrix we want (multiples of 64 only, please). The -p option specifies that we want the program to report performance statistics. We're scientists and engineers. Of course we want performance statistics.

./matrix_mul -s 1 -m 1024 -p
Just use one SPE, and multiply a big 1024-square matrix with a copy of itself.
Initializing Arrays ... done
Running test ... done
Performance Statistics:
number of SPEs = 1
execution time = 0.09 seconds
computation rate = 23.85 GFlops/sec
data transfer rate = 1.54 GBytes/sec

23.85 GFlops. Wow, pretty darned good. I wonder what happens if we pull in two SPEs?

./matrix_mul -s 2 -m 1024 -p
Initializing Arrays ... done
Running test ... done
Performance Statistics:
number of SPEs = 2
execution time = 0.05 seconds
computation rate = 42.93 GFlops/sec
data transfer rate = 2.77 GBytes/sec

Roughly half the execution time, double the flops. Linear speedup! Wonderful! How about four SPEs?

./matrix_mul -s 4 -m 1024 -p
Initializing Arrays ... done
Running test ... done
Performance Statistics:
number of SPEs = 4
execution time = 0.03 seconds
computation rate = 71.55 GFlops/sec
data transfer rate = 4.61 GBytes/sec

Holy smokes, still looking good. I've done a few runs, and some have reported more than 100 GFlops. We're obviously running into some challenges with timer resolution, the thing is so darned fast.

Now lets take a closer look. Open up matrix_mul.c in your favorite editor. Scan down past all the include files, argument handling, and all that jazz. Here's an interesting batch of the code.

... lots of code ...
/* Start time
*/
gtime = times(&tbuf);

/* Send each of the SPUs to get them started.
*/
for (i=0; i<spus; i++) send_mail(i, 1);

/* Wait for the SPUs to complete
*/
for (i=0; i<spus; i++) {
/* Join thread */
if ((rc = pthread_join (threads[i].pthread, NULL)) != 0) {
printf("INTERNAL ERROR: failed to join pthread %d. Error = %s\n", i, str
error(errno));
exit(1);
}
}

/* Stop time
*/
elapsed_time = (double)(times(&tbuf) - gtime) / (double)sysconf(_SC_CLK_TCK)
;

printf("done\n");
... lots of code ...

What's being measured is the time to get the matrix chunks to the SPEs, they compute, and then the threads terminate. We're not measuring the time to create the threads. Or to destroy them. Or to load the program from disk to memory. Or to load the code onto the SPEs. Or to get the result back. Or to print out any informational message.

Suppose we use our trusty Unix friend time to tell us how fast things are?

time ./matrix_mul -s 1 -m 1024 -p
...
real 0m0.322s
user 0m0.188s
sys 0m0.042s

time ./matrix_mul -s 2 -m 1024 -p
...
real 0m0.279s
user 0m0.188s
sys 0m0.044s

time ./matrix_mul -s 4 -m 1024 -p
...
real 0m0.260s
user 0m0.189s
sys 0m0.048s

Two SPEs are about 13% faster than just one, if you count everything (which, let's face it, is the right way to measure). Four SPEs are only about 7% faster than two. I think there's a Law that describes this phenomena; give me a minute, I'm sure I'll think of what it was.

Bigger matrices, and lots more of them would certainly help. This would be the Gustafson perspective on things, and in many cases, that's a valid way to look at it. But if you don't actually need to multiply bigger matrices, or more of them, to get the job done, we're stuck. Even worse, if you only need smaller matrices, using extra SPUs is actually a slow-down.

I'm not anywhere close to being the first person to notice this sort of problem. I highly recommend reading David H. Bailey's Twelve Ways to Fool the Masses When Giving Performance Results on Parallel Computers, from Supercomputing Review, August 1991, pages 54-55. What we have above would probably fall under #2.

As I said at the top, we're scientists and engineers. We measure things. We need to be careful when we do this. There are managers, CEOs, and business-school folk nearby; they sometimes make decisions based on what we tell them. They may have a very difficult time understanding the fine distinction between "we can use ten processors to make the program ten times faster" and "at best we can make it twice as fast, and we get almost no effect after the fourth processor." The experiment above could be described as either a 300% speedup (counting only the multiply) or a 24% speedup (measuring start to finish); which one wins a best paper award, and which one is an accurate description? The end of clock rate scaling has given us a very difficult challenge; now is not the time to be sloppy with our measuring sticks.