Tuesday, August 26, 2008

Some numbers to ponder

I was picking on Michael Garland's DAC paper a bit in the last post, with respect to using GPUs for shortest path (sorry Michael!). We've bounced a few emails back and forth, and have sorted things out -- his DAC paper didn't have the space to go into much detail, and I think it may give some people the wrong impression (it certainly did so for me!).

First off -- for doing lots of matrix multiplication, the GPU is definitely a good way to go; no gripes from me on that. Where things get a bit tricky is with the shortest path computation. On a "straightforward" shortest path run, Michael's serial Dijkstra implementation beats the daylights out his serial Bellman-Ford approach (a factor of 30 or so), for modest size graphs. This is as one might expect, and as the graphs get larger, one would anticipate even greater advantage for the Dijkstra algorithm. For the meshes that Michael was looking at, the massive parallel version of Bellman-Ford on the GPU is comparable to the serial Dijkstra (but as we get to bigger and bigger graphs, computational complexity tells us that the serial Dijkstra will win).

The application he's really interested in, however, is not strictly shortest path -- but rather the partitioning of a mesh into independent clusters of triangles, so that these can be rendered in parallel. In a real world application, one might break apart a 3D model into chunks (during preprocessing), and then render them independently. In addition to scientific computing co-processors, nVidia apparently also makes graphics cards! The approach to partition a mesh into clusters uses shortest path.

For partitioning, a set of vertices are chosen as "seeds," and shortest path search waves propagate from each. When the waves meet, the search ends there, with the waves progressing to other parts of the mesh. Think of throwing a bunch of stones into a pond at the same time -- the clusters are determined by which side of a "ripple" a particular vertex falls on. For those of you familiar with computational geometry, this might be thought of as a graph-based Voronoi diagram.

The partitioning boils down to a large number of independent shortest path problems, which can be run in parallel. No surprises there -- when you have a lot of work that can be done in parallel, a parallel machine does well. The GPU implementation (with Bellman Ford) is about 5x faster than a serial implementation (with Dijkstra) -- but depending on the size of the graph, the number of processors available, the speed of the serial processor, and so on, things might change. If there are full-blown cores (that can handle priority queues), a parallel multi-core approach would be even faster.

So -- if you have a large number of shortest path problems to solve, you can do them in parallel, and possibly come out ahead of a serial processor. If you have only one shortest path to do, the graph is small, and you have a ton of parallel horsepower, the parallel approach might still be faster. But for large graphs, and only one shortest path, pick the serial Dijkstra. At this point, I think my money is still safe.