A Flexible Class of Parallel Matrix Multiplication Algorithms

A Flexible Class of Parallel Matrix Multiplication Algorithms JohnGunnels Calvin Lin Greg Morrow Robert vandeGeijn Department of Computer Sciences The University of Texas at Austin Austin, TX 78712 T T = A B Abstract of-vectors multiplications, and C as a sequence of rank-k updates. Later work [5] showed how these tech- This paper explains why parallel implementation of ma- niques can be extended to a large class of commonly used trix multiplication—a seemingly simple algorithm that can matrix-matrix operations that are part of the Level-3 Basic be expressed as one statement and three nested loops—is Linear Algebra Subprograms (BLAS) [9]. complex: Practical algorithms that use matrix multiplica- These previous efforts have focused primarily on the tion tend to use matrices of disparate shapes, and the shape special case where the input matrices are approximately of the matrices can significantly impact the performance of square. Recent work by Li, et al. [15] creates a “poly- matrix multiplication. We provide a class of algorithms that algorithm” that chooses between algorithms (Cannon’s, covers the spectrum of shapes encountered and demonstrate broadcast-multiply-roll, and broadcast-broadcast). Empir- that good performance can be attained if the right algo- ical data show some advantage for different shapes of rithm is chosen. These observations set the stage for hybrid meshes (e.g. square vs rectangular). However, since the algorithms which choose between the algorithms based on three algorithms are not inherently suited to specific shapes the shapes of the matrices involved. While the paper re- of matrices, limited benefit is observed. solves a number of issues, it concludes with discussion of a We previously observed [16] the need for algorithms to number of directions yet to be pursued. be sensitive to the shape of the matrices and hinted that a class of algorithms naturally supported by the Parallel Lin- ear Algebra Package (PLAPACK) provides a good basis 1 Introduction for such shape-adaptive algorithms. Thus, a number of hybrid algorithms are now part of PLAPACK. This observa- Over the last three decades, a number of different ap- tion was also made by the ScaLAPACK [6] project, and proaches have been proposed for implementing matrix- indeed ScaLAPACK includes a number of matrix-matrix matrix multiplication on distributed memory architectures operations that choose algorithms based on the shape of the These include Cannon’s algorithm [4], broadcast-multiply- matrix. We contrast our approach with ScaLAPACK’s later. roll [12, 11], and generalizations of broadcast-multiply-roll This paper describes and analyzes a class of parallel ma- [8, 13, 2]. The approach now considered the most practi- trix multiplication algorithms that naturally lends itself to cal, known as broadcast-broadcast, was first proposed by hybridization. The analysis combines theoretical results Agarwal et al. [1], who showed that a sequence of parallel with empirical results to provide a complete picture of the = rank-k updates is a highly effective way to parallelize C benefits of the different algorithms and how they can be AB . This same observation was independently made by combined into hybrid algorithms. A number of simplifi- van de Geijn and Watts [17], who introduced the Scalable cations are made to focus on the key issues. Having done Universal Matrix Multiplication Algorithm (SUMMA). In so, a number of extensions are identified for further study. = AB addition to computing C , SUMMA implements T T = AB C = A B C and as a sequence of matrix-panel- 2 Data Distribution This research was supported primarily by the PRISM project (ARPA grant P-95006) and the Environmental Molecular Sciences construction project at Pacific Northwest National Laboratory. Additional support For all algorithms, we will assume that the processors c came from the NASA HPCC Program’s Earth and Spaces Sciences Project are logically viewed as a r mesh of computational nodes (NRA Grants NAG5-2497 and NAG5-2511), and the Intel Research Coun- P 0 i r 0 j c which are indexed ij , and , cil. Equipment was provided by the National Partnership for Advanced = r c so that the total number of nodes is p . Physically, Computational Infrastructure, the University of Texas Compuation Cen- ter’s High Performance Computing Facility, and the Texas Institute for these nodes are connected through some communication Computational and Applied Mathematics. network, which could be a hypercube (Intel iPSC/860), a higher dimensional mesh (Intel Paragon, Cray T3D/E) or a multivector (a group of vectors) by simultaneously scatter- multistage network (IBM SP2). ing the columns within rows. For details, see [16]. Physically Based Matrix Distribution (PBMD). We 3 A Class of Algorithms have previously observed [10] that data distributions should focus on the decomposition of the physical problem to be The target operation will be solved, rather than on the decomposition of matrices. Typ- ically, it is the elements of vectors that are associated with = AB + C data of physical significance, and so it is their distribution C to nodes that is significant. A matrix, which is a discretized where for simplicity we will often treat only the case where operator, merely represents the relation between two vec- = 1 = 0 m and . There are three dimensions involved: , = Ax tors, which are discretized spaces: y . Since it is k C A B m n m k k n n, and , where , , and are , , and , more natural to start by distributing the problem to nodes, respectively. y A we first distribute x and to nodes. The matrix is then It is always interesting to investigate extremal cases distributed so as to be consistent with the distribution of x which happen when one or more of the dimensions equal and y . unity: To describe the distribution of the vectors, assume that x n m and y are of length and , respectively. We will distribute One dimension equals unity: b these vectors using a distribution block size of distr . For n A B m, large and become column and row vec- n = N b m = M b distr simplicity assume distr and . Partition = 1 k tors, respectively, and the operation y x and so that becomes a rank-1 update. x y k C B m, large and are column vectors, and C B C B x y n = 1 C B C B the operation becomes a matrix-vector x = y = C B C . and B . A A . . multiply. k C A n, large and are row vectors, and the op- x y N M = 1 m eration becomes a row vector-matrix N p M p x y i where and and each i and is of multiply. b A length distr . Partitioning conformally yields the blocked Two dimensions equal unity: C A matrix m large and become column vectors, and = k = 1 B n a scalar. The operation becomes a A A A N scaled vector addition (axpy). B C A A A N B C n C B A A = B . C (1) large and become row vectors, and a . A = k = 1 . m scalar. Again, the operation becomes A A A M M N M a scaled vector addition. A B k large and become row and column vec- b b distr where each subblock is of size distr . Blocks of = n = 1 C m tors, respectively, and a scalar. The y x and are assigned to a 2D mesh of nodes in column- operation becomes an inner-product major order. The matrix distribution is induced by assigning (dot). A blocks of columns j to the same column of nodes as x A i subvector j , and blocks of rows to the same row of Implementing the matrix-matrix multiplication for these de- y nodes as subvector i [16]. This distribution wraps blocks generate cases is relatively straight-forward: The rank-1 up- of rows and columns onto rows and columns of processors, date is implemented by duplicating vectors and updating respectively. The wrapping is necessary to improve load local parts of C on each node. The matrix-vector multi- balance. plications are implemented by duplicating the vector to be multiplied, performing local matrix-vector multiplications Data movement. In the PBMD representation, vectors on each node, and reducing the result to C . In the cases are fully distributed (not duplicated), while a single row where two dimensions are small, the column or row vectors or column of a matrix will reside in a single row or col- are redistributed like vectors (as described in the previous umn of processors. One benefit of the PBMD represen- section) and local operations are performed on these vec- tation is the clean manner in which vectors, matrix rows, tors, after which the results are redistributed as needed. and matrix columns interact. Converting a column of a In the next section, we will discuss how the case where matrix to a vector requires a scatter communication within none of the dimensions are unity can be implemented as rows. Similarly, a panel of columns can be converted into a a sequence of the above operations, by partitioning the operands appropriately. We present this along with one mi- panel-panel update can be implemented by the following A nor extensionwhere rather than dealing with one row or col- sequence of operations: Duplicate (broadcast) i within ^ B umn vector at a time, a number of these vectors are treated rows; Duplicate (broadcast) i within columns; On each as a block (panel or multivector). This reduces the number node perform an update to the local portion of C . Comput- = AB of messages communicated and improves the performance ing C is then implemented by repeated calls to this ^ A B i on each node by allowing matrix-matrix multiplication ker- kernel, once for each i pair.

Load more