Amdahl's Law: 2008

Thursday, October 30, 2008

Hot New Technology That Will Change Everything?

Everyone likes lists of stuff. PCWorld has a list of 15 Hot New Technologies, one of them being massively multicore CPUs. Missing from their write-up is some idea of what all these processors will do. If we're talking two, maybe four, possibly even eight, then yeah, there's probably something for them to do at least part of the time. But 32? 64? More??

And new technology? I guess the 16-processor Atari PC from the 80's doesn't count. Nor do the dozens of other massively parallel machines that have gone down the tubes. I guess a technology is new if it's failed for decades.

Instead of having more cores, why not small scale lithographs of Andy Kaufman on the chips? They'll be at least as useful, more resistant to device failure, consume less power, and would be a great way of saying to consumers that "the joke is on you."

Sunday, October 19, 2008

Everything We Know Is Wrong

An interesting post over at the Economist, noting that published results can be wrong -- and that the more selective a venue is, and the "hotter" the area, the more likely it is that something is amiss. This makes a lot of sense; in complex research, things go wrong; there's no way we can eliminate all errors. If the errors skew things in a "positive" direction, reviewers are impressed, and off it goes.

To my knowledge, I've only had bad numbers in one paper -- we were measuring balance of horizontal and vertical wiring from different placement methods, and while the general trend was right, the relative percentages were off (lots of details on how this went wrong; we detected it after the conference camera-ready went in, so we announced the error at the actual talk, and included a note about the error in the journal version). Safe to assume that I've probably screwed up elsewhere.

Fortunately, I got tenure a few years ago, and can now take my sweet time with papers. I've got a few in the pipe, and I'm going to do my best to get everything right.

Saturday, September 13, 2008

The Fifth Generation

Back when I was an undergrad, there was a major effort going on in Japan known as the Fifth Generation. The idea was that by using massive parallelism, one could simulate thought, and build systems that were intelligent (for some definition of intelligent; there's certainly some debate on what that would mean). Neural networks, genetic algorithms, all that jazz. This wasn't a small effort; it dominated computer science research in Japan, and there were quite a few people in the US who also felt that this was the right direction.

I've got a book from that era, The Fifth Generation, by Edward Feigenbaum (a senior Stanford professor), and Pamela McCorduck. Reading through it is fairly surreal; there's a level of certainty that I find astonishing. I don't get a feeling of this is an interesting research direction, and we should follow it. Rather, it's more of this will certainly work, and we must direct our efforts toward it and away from other ideas.

The Japanese research community lost many valuable years chasing AI systems. Likely many promising researchers were shut down if they did not subscribe to the party line. Research dollars were wasted barking up the wrong trees. The psychological term for this is Groupthink. Scientific discourse is halted, dissenting views are quashed. Everyone falls into lock step, but not necessarily in the right direction.

So a question for the random person who found their way here... Suppose the US government offered funding for perpetual motion research -- they want a system with flywheels, and it has to generate electricity. Would you take the funding? Or would you ask hard questions of the funding director? And suppose they're offering lots of money? Would you take it then? And if you saw many of your colleagues line up, and say how they had always had misgivings about thermodynamics, and are enthusiastic about the prospects of perpetual motion. Previous failures, they assure you, were only due to a lack of effort and funding -- and now, we're serious about it, and it has to work or we're in trouble. If you felt that the research community was headed in the wrong direction, would you speak up, or go along? The only way to break out of Groupthink is for enough members of the group to try to shake things up.

Tuesday, August 26, 2008

Some numbers to ponder

I was picking on Michael Garland's DAC paper a bit in the last post, with respect to using GPUs for shortest path (sorry Michael!). We've bounced a few emails back and forth, and have sorted things out -- his DAC paper didn't have the space to go into much detail, and I think it may give some people the wrong impression (it certainly did so for me!).

First off -- for doing lots of matrix multiplication, the GPU is definitely a good way to go; no gripes from me on that. Where things get a bit tricky is with the shortest path computation. On a "straightforward" shortest path run, Michael's serial Dijkstra implementation beats the daylights out his serial Bellman-Ford approach (a factor of 30 or so), for modest size graphs. This is as one might expect, and as the graphs get larger, one would anticipate even greater advantage for the Dijkstra algorithm. For the meshes that Michael was looking at, the massive parallel version of Bellman-Ford on the GPU is comparable to the serial Dijkstra (but as we get to bigger and bigger graphs, computational complexity tells us that the serial Dijkstra will win).

The application he's really interested in, however, is not strictly shortest path -- but rather the partitioning of a mesh into independent clusters of triangles, so that these can be rendered in parallel. In a real world application, one might break apart a 3D model into chunks (during preprocessing), and then render them independently. In addition to scientific computing co-processors, nVidia apparently also makes graphics cards! The approach to partition a mesh into clusters uses shortest path.

For partitioning, a set of vertices are chosen as "seeds," and shortest path search waves propagate from each. When the waves meet, the search ends there, with the waves progressing to other parts of the mesh. Think of throwing a bunch of stones into a pond at the same time -- the clusters are determined by which side of a "ripple" a particular vertex falls on. For those of you familiar with computational geometry, this might be thought of as a graph-based Voronoi diagram.

The partitioning boils down to a large number of independent shortest path problems, which can be run in parallel. No surprises there -- when you have a lot of work that can be done in parallel, a parallel machine does well. The GPU implementation (with Bellman Ford) is about 5x faster than a serial implementation (with Dijkstra) -- but depending on the size of the graph, the number of processors available, the speed of the serial processor, and so on, things might change. If there are full-blown cores (that can handle priority queues), a parallel multi-core approach would be even faster.

So -- if you have a large number of shortest path problems to solve, you can do them in parallel, and possibly come out ahead of a serial processor. If you have only one shortest path to do, the graph is small, and you have a ton of parallel horsepower, the parallel approach might still be faster. But for large graphs, and only one shortest path, pick the serial Dijkstra. At this point, I think my money is still safe.

Sunday, July 27, 2008

Take $1000 out of my pocket for Thinking Parallel

Over the past few years, I've sat through a few dozen keynote talks on multi-core. In almost all cases, there's some comment along the lines of computer scientists don't know how to think parallel, and if they'd just smarten up, parallel computing will take off. This usually comes from people who have a management background, and sometimes from electrical engineers.

Like it or not, there are some things that can't be made faster with parallel machines (as far as I, or a number of people I respect, know). So here's an opportunity to take $1000 out of my pocket, just for showing me how to think parallel. A very common task in software applications is finding the shortest path between a pair of vertices in a graph; the best known algorithm for this is Dijkstra's, which has a Big-O complexity of O(E + V log V). Dijkstra uses a heap, which is a serial bottleneck; there's no way that's obvious to me to use multiple processors.

The Bellman-Ford algorithm also computes shortest path, and is very very easy to make massively parallel. So why not use the parallel version, you ask? Bellman-Ford is O(EV), which has higher computational complexity. For even modest sized graphs, Dijkstra will be faster in terms of wall-clock time, require less total power, and does all of this with just a single processor. There are very few types of graphs where Bellman-Ford might be better; you'd have to work hard to find one where a serial Dijkstra would not blow the doors off a parallel Bellman-Ford. If your objective is to get the answer as fast as possible with the minimum power, serial Dijkstra is the obvious choice.

How to take $1000 out of my pocket.

So here's the offer. Show me how to think parallel for this problem. Demonstrate a factor of two speedup for shortest path, and use as many parallel processors as you want. The run-time comparison should be between comparable processors (for example, one core of a multi-core AMD/Intel/Sun processor versus all available cores, not a 1987 IBM PC vs a new Xeon). The speedup should come on a real graph, and be compared to a solid version of the serial Dijkstra code. In particular, let's use the 9th DIMACS Implementation Challenge for Shortest Path. We can use the full USA distance map (24 million nodes, 58 million arcs), and find the vertex with the greatest shortest path distance from vertex 0. No cheating by precomputing the result, though - this has to work for other graphs. You have to actually use the other processors, not just tune the serial code. I'll throw in another $1000 if you can show that your cleverness scales linearly up to 16 processors.

Why am I focused on this?

Shortest path is used all over the place; if you need it, it's part of the serial fraction that shows up in Amdahl's Law. There seems to be nothing that can be done about it; if you need to know the shortest way to get from point A to point B, you're stuck. Getting performance scaling to thousands of cores isn't just about having a clever architecture, or a fancy language, or thinking parallel. For massive scaling, you need to eliminate almost all serial bottlenecks, and Dijkstra is one of the ones sitting in the way. With a non-zero serial fraction, the law kicks in, and it's not some sort of failing on the part of the programmer, the hardware, the language.

To pull in the bigger picture; shortest path is not the only problem where it's hard to get speedup. There are plenty of others, and if they're essential parts of an application, there's just no way to scale to large numbers of processors. All the wishing in the world won't change this. This is important to understand: the sooner we stop looking for some sort of magic bullet architecture/language/paradigm, the sooner we can start looking for real solutions.

Saturday, July 12, 2008

The Hurdle and the Hole

Just a quick random thought.... There's a lot of attention to new parallel programming languages, APIs, messaging packages, interconnect architectures, and so on. All sorts of ways to make it easier for a programmer to express parallel code, and to have it map onto the hardware efficiently.

Being able to program in parallel is a hurdle. It's difficult to get over. I don't want to downplay the challenge of doing it, or the complexity of the work involved in developing a new programming language, etc. If we are to get any benefit at all from multiple cores (apart from running many independent programs simultaneously), we need to get over the hurdle.

But if we get over the hurdle -- what's on the other side? Unless the run time of the program is dominated by processing that can be done in a massively parallel manner, the serial component starts to dominate quickly, and then we get little gain from exponentially increasing numbers of cores. This is the hole we can fall into immediately.

We need to look a few steps ahead. If we clear the hurdle, then what?

Monday, July 7, 2008

It's all in how you measure

We're engineers and scientists; we measure things. So here's a fun measurement experiment. Get a PS3, install Fedora Core 7 on it, then install the IBM Cell SDK. The IBM page has some useful links for this stuff. I'm using FC7 on my machine, as it seems to have the widest development support, and I'm able to get packages without doing a ton of recompiles. It's actually a pretty nice development environment, although I have to admit that I'm biased towards Unix as a platform.

After you get all this installed, you can build the SDK, and there are some fun programs in the "demos" subdirectory. Look in /opt/cell/sdk/src/demos/matrix_mul.... Everyone knows that matrix multiplication is something that works well in parallel. You can run the demo program; it generates a random matrix, and then does multiplication (breaking down parts of the matrix, and farming out chunks to different SPUs).

The matrix_mul demo takes a few command line arguments. The -s # specifies the number of SPEs to use (default is 1). The -m # specifies how big of matrix we want (multiples of 64 only, please). The -p option specifies that we want the program to report performance statistics. We're scientists and engineers. Of course we want performance statistics.

./matrix_mul -s 1 -m 1024 -p
Just use one SPE, and multiply a big 1024-square matrix with a copy of itself.
Initializing Arrays ... done
Running test ... done
Performance Statistics:
number of SPEs = 1
execution time = 0.09 seconds
computation rate = 23.85 GFlops/sec
data transfer rate = 1.54 GBytes/sec

23.85 GFlops. Wow, pretty darned good. I wonder what happens if we pull in two SPEs?

./matrix_mul -s 2 -m 1024 -p
Initializing Arrays ... done
Running test ... done
Performance Statistics:
number of SPEs = 2
execution time = 0.05 seconds
computation rate = 42.93 GFlops/sec
data transfer rate = 2.77 GBytes/sec

Roughly half the execution time, double the flops. Linear speedup! Wonderful! How about four SPEs?

./matrix_mul -s 4 -m 1024 -p
Initializing Arrays ... done
Running test ... done
Performance Statistics:
number of SPEs = 4
execution time = 0.03 seconds
computation rate = 71.55 GFlops/sec
data transfer rate = 4.61 GBytes/sec

Holy smokes, still looking good. I've done a few runs, and some have reported more than 100 GFlops. We're obviously running into some challenges with timer resolution, the thing is so darned fast.

Now lets take a closer look. Open up matrix_mul.c in your favorite editor. Scan down past all the include files, argument handling, and all that jazz. Here's an interesting batch of the code.

... lots of code ...
/* Start time
*/
gtime = times(&tbuf);

/* Send each of the SPUs to get them started.
*/
for (i=0; i<spus; i++) send_mail(i, 1);

/* Wait for the SPUs to complete
*/
for (i=0; i<spus; i++) {
/* Join thread */
if ((rc = pthread_join (threads[i].pthread, NULL)) != 0) {
printf("INTERNAL ERROR: failed to join pthread %d. Error = %s\n", i, str
error(errno));
exit(1);
}
}

/* Stop time
*/
elapsed_time = (double)(times(&tbuf) - gtime) / (double)sysconf(_SC_CLK_TCK)
;

printf("done\n");
... lots of code ...

What's being measured is the time to get the matrix chunks to the SPEs, they compute, and then the threads terminate. We're not measuring the time to create the threads. Or to destroy them. Or to load the program from disk to memory. Or to load the code onto the SPEs. Or to get the result back. Or to print out any informational message.

Suppose we use our trusty Unix friend time to tell us how fast things are?

time ./matrix_mul -s 1 -m 1024 -p
...
real 0m0.322s
user 0m0.188s
sys 0m0.042s

time ./matrix_mul -s 2 -m 1024 -p
...
real 0m0.279s
user 0m0.188s
sys 0m0.044s

time ./matrix_mul -s 4 -m 1024 -p
...
real 0m0.260s
user 0m0.189s
sys 0m0.048s

Two SPEs are about 13% faster than just one, if you count everything (which, let's face it, is the right way to measure). Four SPEs are only about 7% faster than two. I think there's a Law that describes this phenomena; give me a minute, I'm sure I'll think of what it was.

Bigger matrices, and lots more of them would certainly help. This would be the Gustafson perspective on things, and in many cases, that's a valid way to look at it. But if you don't actually need to multiply bigger matrices, or more of them, to get the job done, we're stuck. Even worse, if you only need smaller matrices, using extra SPUs is actually a slow-down.

I'm not anywhere close to being the first person to notice this sort of problem. I highly recommend reading David H. Bailey's Twelve Ways to Fool the Masses When Giving Performance Results on Parallel Computers, from Supercomputing Review, August 1991, pages 54-55. What we have above would probably fall under #2.

As I said at the top, we're scientists and engineers. We measure things. We need to be careful when we do this. There are managers, CEOs, and business-school folk nearby; they sometimes make decisions based on what we tell them. They may have a very difficult time understanding the fine distinction between "we can use ten processors to make the program ten times faster" and "at best we can make it twice as fast, and we get almost no effect after the fourth processor." The experiment above could be described as either a 300% speedup (counting only the multiply) or a 24% speedup (measuring start to finish); which one wins a best paper award, and which one is an accurate description? The end of clock rate scaling has given us a very difficult challenge; now is not the time to be sloppy with our measuring sticks.

Wednesday, July 2, 2008

Intel's Unwelcome Advice

Another interesting tidbit from Anwar Ghuloum on the Intel software blog.

Anwar suggests that we should prepare for thousands of cores. Fortunately, I think there are a lot of Intel folks trying to figure out something other than this. I hope they get it figured out soon.

Tuesday, July 1, 2008

What kind of programs work well with.....

Over on the Intel software blog is a post by Jay Hoeflinger on "What kind of programs work well with Cluster OpenMP?" The observation is that if you have a program that needs memory accesses (particularly writes) that are all over the place, getting a page from a remote compute node will be REALLY SLOW.

From the post:

Yet, we have seen many programs get really good performance with Cluster OpenMP. So, how can you know whether your code is one of the high-performance ones or the REALLY SLOW ones?

One could ask the same question about any kind of software, with respect to multiple processors and any architecture. How can we know if our code will run well on multi-core? If there are memory accesses to the same locations, there's always going to be serialization -- locks, cache coherency issues, and so on. This is not something that is going to go away with a new compiler, a new programming paradigm, some sort of clever network interconnect. It's a fundamental aspect of the problem being solved. For a given program, is it going to scale well to multiple cores, or will it be one of the REALLY SLOW ones?

Hoeflinger notes that applications like ray tracing work well on distributed machines. Ray tracing is the poster child for the parallel community; one might note that parallel POVRAY benchmarks have been around since the
dawn of time. If an application has the obvious parallelism of ray tracing, and there's value in speeding it up, chances are it's already parallel.

What kind of program is a simple question, and a lot of the reason I'm ranting here. If we're not going to increase serial clock rates (and that seems to be impractical), we need to know what sort of programs work well with multiple cores. Ray tracing? check. High volume web servers? check. Something that will draw in lots of customers and serve a broad audience? hmmmm.

My research group is working on this. We've got some ideas, some things we're trying. And we're doing our best to not fall into the same rabbit holes that so many others seem to enjoy

Monday, June 30, 2008

He's serious, right?

There's an article in Chip Design magazine by Peter Claydon, titled Implementing Multi-Core:The Devil is in the Detail. The standard argument that lots of slower processors can get more computing done, and with less power, than one big high-speed CPU.

Of course, this is true if the computing task is embarrassingly parallel. Graphics, packet switching in a router, high volume web services, and so on -- sure, no problem. But if the task is not like this, the serial bottleneck will kill the overall performance. Has the history of Thinking Machines been wiped from everyones memory?

We've known the problems for four decades. Are people still really falling for this stuff?

Tuesday, June 24, 2008

The Hippocratic Oath...

A nice post over on Living the Scientific Life.

The University of Toronto Institute of Medical Science formulated this:

I promise never to allow financial gain, competitiveness or ambition cloud my judgment in the conduct of ethical research and scholarship. I will pursue knowledge and create knowledge for the greater good, but never to the detriment of colleagues, supervisors, research subjects or the international community of scholars of which I am now a member.

Good stuff; us folks in science and engineering need to keep the big picture in mind. There's tons of pressure put on researchers and faculty to generate funding, push out papers, get awards, and so on. Probably more people than one would wish take the "easy route." It's a great privilege to work in science; the tenured gig at a decent university is particularly great. And as the wise man said, "with great power comes great responsibility."

Sunday, June 22, 2008

Parallel EDA tools from DAC

DAC had a few sessions this year on multi-core processors, and how they were going to be used in electronic design automation. History buffs will note that there's actually some prior work in the area. For example....

Performance of a parallel algorithm for standard cell placement on the Intel hypercube by M. Jones and P. Banerjee..... from DAC 87
Massively parallel switch-level simulation: a feasibility study by Kravitz, Bryant, Rutenbar, DAC 89
and of course there's a good tutorial: Tutorial on parallel processing for design automation applications (tutorial session) from DAC 86.

This is the tip of the iceberg. These papers are easy to find. Here's a link to the ACM portal for papers that have the word parallel in the title. Lots out there. Strangely, these don't get cited in the current work.

If we want to make EDA tools parallel, and there are stacks of papers of papers from the past 20 years on precisely that topic... why not just recompile what we have? Why not cite the prior work? Check out the current crop of papers -- and in particular the references. Not enough people doing their homework, if you ask me....

Tuesday, June 17, 2008

The original paper

Not many people have read the original paper (it's pretty hard to find), so here's a pointer to it. It's retyped by Guihai Chen (thanks!) into an easy-to-read PDF.

Note that this was actually a transcript from a talk, rather than a formal paper. The context of the talk was as a debate between the "serial computing" advocates, and the "parallel computing" advocates. There's a chart that illustrates the "expected zone of operation" for most computers -- essentially how much work was expected to be serial in nature, and how much parallel. Some of the analysis work was done by Professor Knight (can't find his full name) at Stanford, so there's a lot of effort before the intuition sets in.

For most work loads, Amdahl expected that parallel machines would give diminishing returns -- and that seems to be the case. No rule against having things that are massively parallel; just not the common case.

Also note that the "Law" was formulated by someone other than Amdahl. In his ICCAD talk, he mentioned that he had given the above presentation, and because of the contentious nature of the whole event, he put it out of his mind. A few years later, he started hearing about the law from others -- and was somewhat surprised that people had taken the ideas in the talk and run with them.

Saturday, June 14, 2008

From the deck of the Titanic

Over the past few years, there's been a shift from serial performance to massive parallelism; pretty much anyone with a pulse has noticed this. The alarm bells went off for me when the EE Times ran a story on Intel canceling a couple of designs, and going multi-core. I'm anti-parallel, or to be a bit more precise, I'm pro-serial. There's no way to swap increased numbers of processors for increased clock rate. Intel getting out of the clock game said to me that Moore's Law was over, and that pretty soon, technology scaling will halt.

What's wrong with multi-core, you say? Well.... parallel computing companies have failed to go anywhere for the past four decades. Every few years, someone comes along, proclaiming a golden age of parallel computing; everyone gets all excited, and then these prophets quietly slip away in the night, generally with a sack of cash and a few prestigious awards.

The main challenge is Amdahl's Law -- hence the name of the blog. It's a bit of a simplification, but surprisingly useful. Assume you look at some source code, and can determine that P percent of the machine instructions you need to execute can be done in parallel. If you have n processors, then the run time for the parallel section can be sped up by a factor of n (this is really optimistic, but we'll go with it anyway). The serial part, though, (1-P), chunks along at the same pace.

Gene Amdahl presented this in a classic talk from 1967. If P=90%, you have 10% serial -- and the speed up over a serial processor is at most 10X, no matter how many processors are available. Getting to 90% is very difficult; Amdahl's back-of-the-envelope guesstimate was that for most applications, you could expect somewhere from 60 to 70% parallel code. There would be applications higher (some very close to 100%), and some lower, but his ballpark seems to be about right.

With all the excitement over multi-core, I feel like I'm sitting on the deck of the Titanic, watching the captain put the engines on full, accelerating towards the iceberg. We've crashed here dozens of times before, and we're doing exactly the same thing again.

Read up on Amdahl's Law if you have not done so already. You can see a video of Gene Amdahl talking about the law, as well as a panel discussion -- from November 2007. These next few years are going to be a fun ride.

Amdahl's Law