I was picking on Michael Garland's DAC paper a bit in the last post, with respect to using GPUs for shortest path (sorry Michael!). We've bounced a few emails back and forth, and have sorted things out  his DAC paper didn't have the space to go into much detail, and I think it may give some people the wrong impression (it certainly did so for me!).
First off  for doing lots of matrix multiplication, the GPU is definitely a good way to go; no gripes from me on that. Where things get a bit tricky is with the shortest path computation. On a "straightforward" shortest path run, Michael's serial Dijkstra implementation beats the daylights out his serial BellmanFord approach (a factor of 30 or so), for modest size graphs. This is as one might expect, and as the graphs get larger, one would anticipate even greater advantage for the Dijkstra algorithm. For the meshes that Michael was looking at, the massive parallel version of BellmanFord on the GPU is comparable to the serial Dijkstra (but as we get to bigger and bigger graphs, computational complexity tells us that the serial Dijkstra will win).
The application he's really interested in, however, is not strictly shortest path  but rather the partitioning of a mesh into independent clusters of triangles, so that these can be rendered in parallel. In a real world application, one might break apart a 3D model into chunks (during preprocessing), and then render them independently. In addition to scientific computing coprocessors, nVidia apparently also makes graphics cards! The approach to partition a mesh into clusters uses shortest path.
For partitioning, a set of vertices are chosen as "seeds," and shortest path search waves propagate from each. When the waves meet, the search ends there, with the waves progressing to other parts of the mesh. Think of throwing a bunch of stones into a pond at the same time  the clusters are determined by which side of a "ripple" a particular vertex falls on. For those of you familiar with computational geometry, this might be thought of as a graphbased Voronoi diagram.
The partitioning boils down to a large number of independent shortest path problems, which can be run in parallel. No surprises there  when you have a lot of work that can be done in parallel, a parallel machine does well. The GPU implementation (with Bellman Ford) is about 5x faster than a serial implementation (with Dijkstra)  but depending on the size of the graph, the number of processors available, the speed of the serial processor, and so on, things might change. If there are fullblown cores (that can handle priority queues), a parallel multicore approach would be even faster.
So  if you have a large number of shortest path problems to solve, you can do them in parallel, and possibly come out ahead of a serial processor. If you have only one shortest path to do, the graph is small, and you have a ton of parallel horsepower, the parallel approach might still be faster. But for large graphs, and only one shortest path, pick the serial Dijkstra. At this point, I think my money is still safe.
DAC 2012: Mystical Confluence: ESL Hockey Stick and The Cup!

Another note from DAC 2012: In Gary Smith’s Sunday night preDAC talk, he
mentioned that in 2011, ESL tools took off – the famous Hockey Stick. See
his s...
5 years ago