A New Parallel Matrix Multiplication Algorithm on Distributed-Memory

A New Parallel Matrix Multiplication Algorithm on Distributed-Memory Concurrent Computers Jaeyoung Choi Scho ol of Computing So ongsil University 1-1, Sangdo-Dong, Dong jak-Ku Seoul 156-743, KOREA Abstract We present a new fast and scalable matrix multiplication algorithm, called DIMMA Distribution-Indep endent Matrix Multiplication Algorithm, for blo ck cyclic data distribution on distributed-memory concurrent computers. The algorithm is based on two new ideas; it uses a mo di ed pip elined communication scheme to overlap computation and communication e ectively, and exploits the LCM blo ck concept to obtain the maximum p erformance of the sequential BLAS routine in each pro cessor even when the blo ck size is very small as well as very large. The algorithm is implemented and compared with SUMMA on the Intel Paragon computer. 1. Intro duction Anumb er of algorithms are currentlyavailable for multiplying two matrices A and B to yield the pro duct matrix C = A B on distributed-memory concurrent computers [12 , 16 ]. Two classic algorithms are Cannon's algorithm [4 ] and Fox's algorithm [11 ]. They are based on a P P square pro cessor grid with a blo ck data distribution in which each pro cessor holds a large consecutive blo ck of data. Two e orts to implementFox's algorithm on general 2-D grids have b een made: Choi, Dongarra and Walker develop ed `PUMMA' [7] for blo ck cyclic data decomp ositions, and Huss-Lederman, Jacobson, Tsao and Zhang develop ed `BiMMeR' [15 ] for the virtual 2-D torus wrap data layout. The di erences in these data layouts results in di erent algorithms. These two algorithms have b een compared on the Intel Touchstone Delta [14 ]. Recent e orts to implementnumerical algorithms for dense matrices on distributed- memory concurrent computers are based on a blo ck cyclic data distribution [6], in which an M N matrix A consists of m n blo cks of data, and the blo cks are distributed by b b wrapping around b oth row and column directions on an arbitrary P Q pro cessor grid. The distribution can repro duce most data distributions used in linear algebra computations. For details, see Section 2.2. We limit the distribution of data matrices to the blo ck cyclic data distribution. The PUMMA requires a minimum numb er of communications and computations. It consists of only Q 1 shifts for A, LC M P; Q broadcasts for B, and LC M P; Q lo cal multiplications, where LC M P; Q is the least common multipleof P and Q. It multiplies the largest p ossible matrices of A and B for each computation step, so that p erformance of the routine dep ends very weakly on the blo ck size of the matrix. However, PUMMA makes it dicult to overlap computation with communication since it always deals with the largest p ossible matrices for b oth computation and communication, and it requires large memory space to store them temp orarily, which makes it impractical in real applications. Agrawal, Gustavson and Zubair [1] prop osed another matrix multiplication algorithm by eciently overlapping computation with communication on the Intel iPSC/860 and Delta system. Van de Geijn and Watts [18 ] indep endently develop ed the same algorithm on the In- tel paragon and called it SUMMA. Also indep endently, PBLAS [5], which is a ma jor building blo ck of ScaLAPACK [3], uses the same scheme in implementing the matrix multiplication routine, PDGEMM. In this pap er, we present a new fast and scalable matrix multiplication algorithm, called DIMMA Distribution-Independent Matrix Multiplication Algorithm for blo ck cyclic data distribution on distributed-memory concurrent computers. The algorithm incorp orates SUMMA with two new ideas. It uses `a mo di ed pip elined communication scheme', which makes the algorithm the most ecientbyoverlapping computation and communication e ectively.It also exploits `the LCM concept', which maintains the maximum p erformance of the sequential BLAS routine, DGEMM, in each pro cessor, even when the blo ck size is very small as well as very large. The details of the LCM concept is explained in Section 2.2. DIMMA and SUMMA are implemented and compared on the Intel Paragon computer. 3 2 The parallel matrix multiplication requires O N ops and O N communications, i. e., it is computation intensive. For a large matrix, the p erformance di erence b etween SUMMA and DIMMA may b e marginal and negligible. But for small matrix of N = 1000 on a 16 16 pro cessor grid, the p erformance di erence is approximately 10. 2. Design Principles 2.1. Level 3 BLAS Current advanced architecture computers p ossess hierarchical memories in which access to data in the upp er levels of the memory hierarchy registers, cache, and/or lo cal memory is faster than to data in lower levels shared or o -pro cessor memory. One technique to exploit the p ower of such machines more eciently is to develop algorithms that maximize reuse of data held in the upp er levels. This can b e done by partitioning the matrix or matrices into blo cks and by p erforming the computation with matrix-matrix op erations on the blo cks. The Level 3 BLAS [9 ] p erform a numb er of commonly used matrix-matrix op erations, and are available in optimized form on most computing platforms ranging from workstations up to sup ercomputers. The Level 3 BLAS have b een successfully used as the building blo cks of a numb er of applications, including LAPACK [2 ], a software library that uses blo ck-partitioned algorithms for p erforming dense linear algebra computations on vector and shared memory computers. On shared memory machines, blo ck-partitioned algorithms reduce the numb er of times that data must b e fetched from shared memory, while on distributed-memory machines, they reduce the numb er of messages required to get the data from other pro cessors. Thus, there has b een muchinterest in developing versions of the Level 3 BLAS for distributed-memory concurrent computers [5 , 8 , 10 ]. The most imp ortant routine in the Level 3 BLAS is DGEMM for p erforming matrix-matrix multiplication. The general purp ose routine p erforms the following op eration: C opA opB + C T H where opX=X; X or X . And \" denotes matrix-matrix multiplication. A, B and C are matrices, and and are scalars. This pap er fo cuses on the design and implementation of the non-transp osed matrix multiplication routine of C A B + C, but the idea can T b e easily extended to the transp osed multiplication routines of C A B + C and T C A B + C. 2.2. Blo ck Cyclic Data Distribution For p erforming the matrix multiplication C = A B,we assume that A, B and C are M K , K N , and M N , resp ectively. The distributed routine also requires a condition on the blo ck size to ensure compatibility. That is, if the blo ck size of A is m k , then that of b b B and C must b e k n and m n , resp ectively. So the numb er of blo cks of matrices b b b b 0 1 2 3 4 5 6 7 8 91011 0 3 6 9 1 4 710 2 5 811 0 012012012012 0 1 345345345345 2 2 012012012012 4 3 345345345345 6 P0 P1 P2 4 012012012012 8 5 345345345345 10 6 012012012012 1 7 345345345345 3 8 012012012012 5 9 345345345345 7 P3 P4 P5 10 012012012012 9 11 345345345345 11 (a) matrix point-of-view (b) processor point-of-view Figure 1: Blo ck cyclic data distribution. A matrix with 12 12 blo cks is distributed over a2 3 pro cessor grid. a The shaded and unshaded areas represent di erent grids. b It is easier to see the distribution from the pro cessor p oint-of-view to implement algorithms. Each pro cessor has 6 4 blo cks. A, B, and C are M K , K N , and M N , resp ectively, where M = dM=m e, g g g g g g g b N = dN=n e, and K = dK=k e. g b g b The way in which a matrix is distributed over the pro cessors has a ma jor impact on the load balance and communication characteristicsof the concurrent algorithm, hence, largely determines its p erformance and scalability. The blo ck cyclic distribution provides a simple, general-purp ose way of distributing a blo ck-partitioned matrix on distributed- memory concurrent computers. Figure 1a shows an example of the blo ck cyclic data distribution, where a matrix with 12 12 blo cks is distributed over a 2 3 grid. The numb ered squares represent blo cks of elements, and the numb er indicates the lo cation in the pro cessor grid { all blo cks lab eled with the same numb er are stored in the same pro cessor. The slanted numb ers, on the left and on the top of the matrix, represent indices of a row of blo cks and of a column of blo cks, resp ectively. Figure 1b re ects the distribution from a pro cessor p oint-of-view, where each pro cessor has 6 4 blo cks. Denoting the least common multipleof P and Q by LC M ,we refer to a square of LCM LCM blo cks as an LCM blo ck. Thus, the matrix in Figure 1 may b e viewed as a 2 2 array of LCM blo cks.

Load more