Parallelization of Reordering Algorithms for Bandwidth and Wavefront Reduction Konstantinos I. Karantasis∗, Andrew Lenharthy, Donald Nguyenz, Mar´ıa J. Garzaran´ ∗, Keshav Pingaliy,z ∗Department of Computer Science, yInstitute for Computational Engineering and Sciences and University of Illinois at Urbana-Champaign zDepartment of Computer Science, fkik,
[email protected] University of Texas at Austin
[email protected], fddn,
[email protected] Abstract—Many sparse matrix computations can be speeded More recently, reordering has become popular even in the up if the matrix is first reordered. Reordering was originally context of iterative sparse solvers where problems like mini- developed for direct methods but it has recently become popular mizing fill do not arise. The key computation in an iterative for improving the cache locality of parallel iterative solvers since reordering the matrix to reduce bandwidth and wavefront sparse solver is sparse matrix-vector multiplication (SpMV) can improve the locality of reference of sparse matrix-vector (say y = Ax). If the matrix is stored in compressed row- multiplication (SpMV), the key kernel in iterative solvers. storage (CRS) and the SpMV computation is performed by In this paper, we present the first parallel implementations of rows, the accesses to y and A enjoy excellent locality, but the two widely used reordering algorithms: Reverse Cuthill-McKee accesses to x may not. One way to improve the locality of (RCM) and Sloan. On 16 cores of the Stampede supercomputer, accesses to the elements of x is to reorder the sparse matrix our parallel RCM is 5.56 times faster on the average than a state-of-the-art sequential implementation of RCM in the HSL A using a bandwidth-reducing ordering (RCM is popular).