A New Parallel Matrix Multiplication Algorithm
on Distributed-Memory Concurrent Computers
Jaeyoung Choi
Scho ol of Computing
So ongsil University
1-1, Sangdo-Dong, Dong jak-Ku
Seoul 156-743, KOREA
Abstract
We present a new fast and scalable matrix multiplication algorithm, called DIMMA
Distribution-Indep endent Matrix Multiplication Algorithm, for blo ck cyclic data distribu-
tion on distributed-memory concurrent computers. The algorithm is based on two new ideas;
it uses a mo di ed pip elined communication scheme to overlap computation and communi-
cation e ectively, and exploits the LCM blo ck concept to obtain the maximum p erformance
of the sequential BLAS routine in each pro cessor even when the blo ck size is very small as
well as very large. The algorithm is implemented and compared with SUMMA on the Intel
Paragon computer.
1. Intro duction
Anumb er of algorithms are currentlyavailable for multiplying two matrices A and B to
yield the pro duct matrix C = A B on distributed-memory concurrent computers [12 , 16 ].
Two classic algorithms are Cannon's algorithm [4 ] and Fox's algorithm [11 ]. They are based
on a P P square pro cessor grid with a blo ck data distribution in which each pro cessor
holds a large consecutive blo ck of data.
Two e orts to implementFox's algorithm on general 2-D grids have b een made: Choi,
Dongarra and Walker develop ed `PUMMA' [7] for blo ck cyclic data decomp ositions, and
Huss-Lederman, Jacobson, Tsao and Zhang develop ed `BiMMeR' [15 ] for the virtual 2-D
torus wrap data layout. The di erences in these data layouts results in di erent algorithms.
These two algorithms have b een compared on the Intel Touchstone Delta [14 ].
Recent e orts to implementnumerical algorithms for dense matrices on distributed-
memory concurrent computers are based on a blo ck cyclic data distribution [6], in which
an M N matrix A consists of m n blo cks of data, and the blo cks are distributed by
b b
wrapping around b oth row and column directions on an arbitrary P Q pro cessor grid. The
distribution can repro duce most data distributions used in linear algebra computations. For
details, see Section 2.2. We limit the distribution of data matrices to the blo ck cyclic data
distribution.
The PUMMA requires a minimum numb er of communications and computations. It
consists of only Q 1 shifts for A, LC M P; Q broadcasts for B, and LC M P; Q lo cal
multiplications, where LC M P; Q is the least common multipleof P and Q. It multiplies
the largest p ossible matrices of A and B for each computation step, so that p erformance of
the routine dep ends very weakly on the blo ck size of the matrix. However, PUMMA makes
it dicult to overlap computation with communication since it always deals with the largest
p ossible matrices for b oth computation and communication, and it requires large memory
space to store them temp orarily, which makes it impractical in real applications.
Agrawal, Gustavson and Zubair [1] prop osed another matrix multiplication algorithm by
eciently overlapping computation with communication on the Intel iPSC/860 and Delta
system. Van de Geijn and Watts [18 ] indep endently develop ed the same algorithm on the In-
tel paragon and called it SUMMA. Also indep endently, PBLAS [5], which is a ma jor building
blo ck of ScaLAPACK [3], uses the same scheme in implementing the matrix multiplication
routine, PDGEMM.
In this pap er, we present a new fast and scalable matrix multiplication algorithm, called
DIMMA Distribution-Independent Matrix Multiplication Algorithm for blo ck cyclic data
distribution on distributed-memory concurrent computers. The algorithm incorp orates SUMMA
with two new ideas. It uses `a mo di ed pip elined communication scheme', which makes the
algorithm the most ecientbyoverlapping computation and communication e ectively.It
also exploits `the LCM concept', which maintains the maximum p erformance of the sequen-
tial BLAS routine, DGEMM, in each pro cessor, even when the blo ck size is very small as well
as very large. The details of the LCM concept is explained in Section 2.2.
DIMMA and SUMMA are implemented and compared on the Intel Paragon computer.
3 2
The parallel matrix multiplication requires O N ops and O N communications, i. e., it
is computation intensive. For a large matrix, the p erformance di erence b etween SUMMA
and DIMMA may b e marginal and negligible. But for small matrix of N = 1000 on a 16 16
pro cessor grid, the p erformance di erence is approximately 10.
2. Design Principles
2.1. Level 3 BLAS
Current advanced architecture computers p ossess hierarchical memories in which access to
data in the upp er levels of the memory hierarchy registers, cache, and/or lo cal memory is
faster than to data in lower levels shared or o -pro cessor memory. One technique to exploit
the p ower of such machines more eciently is to develop algorithms that maximize reuse
of data held in the upp er levels. This can b e done by partitioning the matrix or matrices
into blo cks and by p erforming the computation with matrix-matrix op erations on the blo cks.
The Level 3 BLAS [9 ] p erform a numb er of commonly used matrix-matrix op erations, and
are available in optimized form on most computing platforms ranging from workstations up
to sup ercomputers.
The Level 3 BLAS have b een successfully used as the building blo cks of a numb er of ap-
plications, including LAPACK [2 ], a software library that uses blo ck-partitioned algorithms
for p erforming dense linear algebra computations on vector and shared memory computers.
On shared memory machines, blo ck-partitioned algorithms reduce the numb er of times
that data must b e fetched from shared memory, while on distributed-memory machines, they
reduce the numb er of messages required to get the data from other pro cessors. Thus, there
has b een muchinterest in developing versions of the Level 3 BLAS for distributed-memory
concurrent computers [5 , 8 , 10 ].
The most imp ortant routine in the Level 3 BLAS is DGEMM for p erforming matrix-matrix
multiplication. The general purp ose routine p erforms the following op eration:
C opA opB + C
T H
where opX=X; X or X . And \" denotes matrix-matrix multiplication. A, B and C
are matrices, and and are scalars. This pap er fo cuses on the design and implementation
of the non-transp osed matrix multiplication routine of C A B + C, but the idea can
T
b e easily extended to the transp osed multiplication routines of C A B + C and
T
C A B + C.
2.2. Blo ck Cyclic Data Distribution
For p erforming the matrix multiplication C = A B,we assume that A, B and C are M K ,
K N , and M N , resp ectively. The distributed routine also requires a condition on the
blo ck size to ensure compatibility. That is, if the blo ck size of A is m k , then that of
b b
B and C must b e k n and m n , resp ectively. So the numb er of blo cks of matrices
b b b b 0 1 2 3 4 5 6 7 8 91011 0 3 6 9 1 4 710 2 5 811 0 012012012012 0 1 345345345345 2 2 012012012012 4 3 345345345345 6 P0 P1 P2 4 012012012012 8 5 345345345345 10 6 012012012012 1 7 345345345345 3 8 012012012012 5 9 345345345345 7 P3 P4 P5 10 012012012012 9 11 345345345345 11
(a) matrix point-of-view (b) processor point-of-view
Figure 1: Blo ck cyclic data distribution. A matrix with 12 12 blo cks is distributed over
a2 3 pro cessor grid. a The shaded and unshaded areas represent di erent grids. b It
is easier to see the distribution from the pro cessor p oint-of-view to implement algorithms.
Each pro cessor has 6 4 blo cks.
A, B, and C are M K , K N , and M N , resp ectively, where M = dM=m e,
g g g g g g g b
N = dN=n e, and K = dK=k e.
g b g b
The way in which a matrix is distributed over the pro cessors has a ma jor impact on
the load balance and communication characteristicsof the concurrent algorithm, hence,
largely determines its p erformance and scalability. The blo ck cyclic distribution provides
a simple, general-purp ose way of distributing a blo ck-partitioned matrix on distributed-
memory concurrent computers.
Figure 1a shows an example of the blo ck cyclic data distribution, where a matrix with
12 12 blo cks is distributed over a 2 3 grid. The numb ered squares represent blo cks of
elements, and the numb er indicates the lo cation in the pro cessor grid { all blo cks lab eled
with the same numb er are stored in the same pro cessor. The slanted numb ers, on the left
and on the top of the matrix, represent indices of a row of blo cks and of a column of blo cks,
resp ectively. Figure 1b re ects the distribution from a pro cessor p oint-of-view, where each
pro cessor has 6 4 blo cks.
Denoting the least common multipleof P and Q by LC M ,we refer to a square of LCM
LCM blo cks as an LCM blo ck. Thus, the matrix in Figure 1 may b e viewed as a 2 2
array of LCM blo cks. Blo cks b elong to the same pro cessor if their relative lo cations are
the same in each LCM blo ck. A parallel algorithm, in which the order of execution can b e
intermixed such as matrix multiplication and matrix transp osition, may b e develop ed for
the rst LCM blo ck. Then it can b e directly applied to the other LCM blo cks, whichhave
the same structure and the same data distribution as the rst LCM blo ck, that is, when an
op eration is executed on the rst LCM blo ck, the same op eration can b e done simultaneously
on other LCM blo cks. And the LCM concept is applied to design software libraries for dense
linear algebra computations with algorithmic blo cking [17 , 19 ]. 0 3 6 9 1 4 710 2 5 811 0 3 6 9 1 4 710 2 5 811 0 0 2 2 4 4 6 6 8 8 10 10 1 A 1 B 3 3 5 5 7 7 9 9
11 11
Figure 2: A snapshot of SUMMA. The darkest blo cks are broadcast rst, and lightest blo cks
are broadcast later.
3. Algorithms
3.1. SUMMA
SUMMA is basically a sequence of rank-k up dates. In SUMMA, A and B are divided
b
into several columns and rows of blo cks, resp ectively, whose blo ck sizes are k . Pro cessors
b
multiply the rst column of blo cks of A with the rst row of blo cks of B. Then pro cessors
multiply the next column of blo cks of A and the next row of blo cks of B successively.
As the snapshot of Figure 2 shows, the rst column of pro cessors P and P b egins
0 3
broadcasting the rst column of blo cks of A A:; 0 along eachrow of pro cessors here we
use MATLAB notation to simply represent a p ortion of a matrix. At the same time, the
rst row of pro cessors, P , P , and P broadcasts the rst row of blo cks of B B 0; : along
0 1 2
each column of pro cessors. After the lo cal multiplication, the second column of pro cessors,
P and P , broadcasts A:; 1 rowwise, and the second row of pro cessors, P , P , and P ,
1 4 3 4 5
broadcasts B 1; : columnwise. This pro cedure continues until the last column of blo cks of
A and the last row of blo cks of B.
Agrawal, Gustavson and Zubair [1], and van de Geijn and Watts [18 ] obtained high
eciency on the Intel Delta and Paragon, resp ectively,by exploiting the pip elined commu-
nication scheme, where broadcasting is implemented as passing a column or row of blo cks
around the logical ring that forms the row or column.
3.2. DIMMA
We show a simple simulation in Figure 3. It is assumed that there are 4 pro cessors, each
has 2 sets of data to broadcast, and they use blo cking send and non-blo cking receive. In
the gure, the time to send a data set is assumed to b e 0.2 seconds, and the time for lo cal
computation is 0.6 seconds. Then the pip elined broadcasting scheme takes 8.2 seconds as in Figure 3a . Proc 0 0 1 230 1 23
Proc 1 0 1 230 1 23
Proc 2 0 1 2 3 0 1 2 3
Proc 3 0 1 2 3 0 1 2 3
0 1 2 3 4 5 6 7 8 9 Time
: time to send a message 0 : time for a local computation a SUMMA
Proc 0 0 0 1 1 2 2 3 3
Proc 1 0 0 1 1 2 2 3 3
Proc 2 0 0 1 1 2 2 3 3
Proc 3 0 0 1 1 2 2 3 3
0 1 2 3 4 5 6 7 8 9
Time
b DIMMA
Figure 3: Communication characteristics of SUMMA and DIMMA. It is assumed that blo ck-
ing send and non-blo cking receive are used.
A careful investigation of the pip elined communication shows there is an extra waiting
time b etween two communication pro cedures. If the rst pro cessor broadcasts everything
it contains to other pro cessors b efore the next pro cessor starts to broadcast its data, it is
p ossible to eliminate the unnecessary waiting time. The mo di ed communication scheme in
Figure 3b takes 7.4 seconds. That is, the new communication scheme saves 4 communi-
cation times 8:2 7:4=0:8=4 0:2. Figures 4 and 5 showa Paragraph visualization
[13 ] of SUMMA and DIMMA on the Intel Paragon computer, resp ectively.Paragraph is a
parallel programming to ol that graphically displays the execution of a distributed-memory
program. These gures include spacetime diagrams, which show the communication pattern
between the pro cesses, and utilization Gantt charts, which show when each pro cess is busy
or idle. The dark gray color signi es idle time for a given pro cess, and the light gray color
signals busy time. DIMMA is more ecient in communication than SUMMA as shown in
these gures. The details of analysis of the algorithms is shown in Section 4.
With this mo di ed communication scheme, DIMMA is implemented as follows. After the
rst pro cedure, that is, broadcasting and multiplying A:; 0 and B 0; :, the rst column
Figure 4: Paragraph visualization of SUMMA
Figure 5: Paragraph visualization of DIMMA
of pro cessors, P and P , broadcasts A:; 6 along eachrow of pro cessors, and the rst row
0 3
of pro cessors, P , P , and P sends B 6; : along each column of pro cessors, as shown in
0 1 2
Figure 6. The value 6 app ears since the LCM of P = 2 and Q = 3 is 6.
For the third and fourth pro cedures, the rst column of pro cessors, P and P , broadcasts
0 3
rowwise A:; 3 and A:; 9, and the second row of pro cessors, P , P , and P , broadcasts
3 4 5
columnwise B 3; : and B 9; :, resp ectively. After the rst column of pro cessors, P and
0
P , broadcasts all of their columns of blo cks of A along eachrow of pro cessors, the second
3
column of pro cessors, P and P , broadcasts their columns of A.
1 4
The basic computation of SUMMA and DIMMA in each pro cessor is a sequence of rank-
k up dates of the matrix. The value of k should b e at least 20 Let k b e the optimal blo ck
b b opt
size for the computation, then k = 20 to optimize p erformance of the sequential BLAS
opt
routine, DGEMM, in the Intel Paragon, which corresp onds to ab out 44 M ops on a single no de.
The vectors of blo cks to b e multiplied should b e conglomerated to form larger matrices to
optimize p erformance if k is small.
b
DIMMA is mo di ed with the LCM concept. The basic idea of the LCM concept is to
handle simultaneously several thin columns of blo cks of A, and the same numb er of thin
rows of blo cks of B so that each pro cessor multiplies several thin matrices of A and B
simultaneously in order to obtain the maximum p erformance of the machine. Instead of
broadcasting a single column of A and a single rowof B, a column of pro cessors broadcasts
several M = dk =k e columns of blo cks of A along eachrow of pro cessors, whose distance
X opt b
is LCM blo cks in the column direction. At the same time, a row of pro cessors broadcasts the
same numb er of blo cks of B along each column of pro cessors, whose distance is LCM blo cks in
the row direction as shown in Figure 7. Then each pro cessor executes its own multiplication.
The multiplication op eration is changed from `a sequence = K of rank-k up dates' to `a
g b
sequence = dK =M e of rank-k M up dates' to maximize the p erformance.
g X b X
0 3 6 9 1 4 710 2 5 811 0 3 6 9 1 4 710 2 5 811 0 0 2 2 4 4 6 6 8 8 10 10 1 A 1 B 3 3 5 5 7 7 9 9
11 11
Figure 6: Snapshot of a simple version of DIMMA. The darkest blo cks are broadcast rst.
For example, if P =2;Q =3;k = 10 and k = 20, the pro cessors deal with 2 columns
b opt
of blo cks of A and 2 rows of blo cks of B at a time M = dk =k e = 2. The rst column
X b opt
of pro cessors, P and P , copies two columns of A:; [0; 6] that is, A:; 0 and A:; 6 to T ,
0 3 A 0 3 6 9 1 4 710 2 5 811 0 3 6 9 1 4 710 2 5 811 0 0 2 2 4 4 6 6 8 8 10 10 1 A 1 B 3 3 5 5 7 7 9 9
11 11
Figure 7: A snapshot of DIMMA
and broadcasts them along eachrow of pro cessors. The rst row of pro cessors, P , P and
0 1
P , copies tworows of B [0; 6]; : that is, B 0; : and B 6; : to T and broadcasts them
2 B
along each column of pro cessors. Then all pro cessors multiply T with T to pro duce C.
A B
Next, the second column of pro cessors, P and P , copies the next two columns of A:; [1; 7]
1 4
to T and broadcasts them again rowwise, and the second row of pro cessors, P , P and P ,
A 3 4 5
copies the next tworows of B [1; 7]; : to T and broadcasts them columnwise. The pro duct
B
of T and T is added to C in each pro cessor.
A B
The value of M can b e determined by the blo ck size, available memory space, and
X
machine characteristics such as pro cessor p erformance and communication sp eed. If it is
assumed that k = 20, the value of M should b e 4 if the blo ck size is 5, and the value of
opt X
M should b e 2 if the blo ck size is 10.
X
If k is much larger than the optimal value for example, k = 100, it may b e dicult
b b
to obtain go o d p erformance since it is dicult to overlap the communication with the com-
putation. In addition, the multiplication routine requires a large amount of memory to send
and receive A and B. It is p ossible to divide k into smaller pieces. For example, if k = 100,
b b
pro cessors divide a column of blo cks of A into ve thin columns of blo cks, and divide a row
of blo cks of B into ve thin rows of blo cks. Then they multiply each thin column of blo cks
of A with the corresp onding thin row of blo cks of B successively. The two cases, in which k
b
is smaller and larger than k , are combined, and the pseudo co de of the DIMMA is shown
opt
in Figure 8.
4. Analysis of Multiplication Algorithms
We analyze the elapsed time of SUMMA and DIMMA based on Figure 3. It is assumed
that k = k throughout the computation. Then, for the multiplication C C +
b opt M N M N
A B , there are K = dK=k e columns of blo cks of A and K rows of blo cks of B.
M K K N g b g
At rst, it is assumed that there are P linearly connected pro cessors, in which a column
C =0 C :; : = 0
M = dk =k e
X opt b
DO L1=0;Q 1
DO L2=0; LCM=Q 1
L = LCM M
X X
DO L3=0; dK =L e 1
g X
DO L4=0; dk =k e 1
b opt
L = L1+L2 Q + L3 L +[L4] : LCM : L3+ 1 L 1
m X X
[Copy A:;L to T and broadcast it along eachrow of pro cessors]
m A
[Copy B L ; : to T and broadcast it along each column of pro cessors]
m B
C :; : = C :; : + T T
A B
END DO
END DO
END DO
END DO
Figure 8: The pseudo co de of DIMMA. The DO lo op of L3 is used if k is smaller than k ,
b opt
where the routine handles M columns of blo cks of A and M rows of blo cks of B , whose
X X
blo ck distance are LCM, simultaneously, L is used to select them correctly. The innermost
m
DO lo op of L4 is used if k is larger than k , and the bracket in [L ] represents the L -th
b opt 4 4
thin vector.
of blo cks of A = T is broadcast along P pro cessors at each step and a row of blo cks of
A
B = T always stays in each pro cessor. It is also assumed that the time for sending a
B
column T to the next pro cessor is t , and the time for multiplying T with T and adding
A c A B
N
the pro duct to C is t . Actually t = +Mk and t =2M k , where is a
p c b p b
P
communication start-up time, is a data transfer time, and is a time for multiplication
or addition.
For SUMMA, the time di erence b etween successivetwo pip elined broadcasts of T is
A
2t + t . The total elapsed time of SUMMA with K columns of blo cks on an 1-dimensional
c p g
1D
pro cessor grid, t ,is
summa
1D
t = K 2t + t t +P 2 t = K 2t + t +P 3 t :
g c p c c g c p c
summa
For DIMMA, the time di erence b etween the two pip elined broadcasts is t + t if the
c p
T s are broadcast from the same pro cessor. However, the time di erences is 2t + t if they
A c p
1D
are in di erent pro cessors. The total elapsed time of DIMMA, t ,is
dimma
1D
= K t + t +P 1t +P 2 t = K t + t +2P 3 t : t
g c p c c g c p c dimma
On a 2-dimensional P Q pro cessor grid, the communication time of SUMMA is doubled
in order to broadcast T as well as T . Assume again that the time for sending a column T
B A A
and a row T to the next pro cessor are t and t , resp ectively, and the time for multiplying
B ca cb
N M
k , t = + k , T with T and adding the pro duct to C is t . Actually t = +
b cb b A B p ca
P Q
M N
k . So, and t =2
b p
P Q
2D
= K 2t +2t + t +Q 3 t +P 3 t : 1 t
g ca cb p ca cb
summa
For DIMMA, each column of pro cessors broadcasts T until everything is sent. Mean-
A
while, rows of pro cessors broadcast T if they have the corresp onding T with the T .For
B B A
a column of pro cessors, which currently broadcasts A, P=GCD rows of pro cessors, whose
distance is GCD, haverows of blo cks of B to broadcast along with the T , where GCD is
A
the greatest common divisor of P and Q. The extra idle wait, caused by broadcasting two
T s when they are in di erent pro cessors, is GCD t . Then the total extra waiting time to
B cb
broadcast T sisQ P=GCD GCD t = PQ t .
B cb cb
However, if GCD = P , only one row of pro cessors has T to broadcast corresp ondingto
B
the column of pro cessors, and the total extra waiting time is P t . So,
cb
2D
t = K t + t + t +2Q 3t +P + Q 3t if GCD = P
g ca cb p ca cb
dimma
= K t + t + t +2Q 3t +PQ + P 3t otherwise: 2
g ca cb p ca cb
The time di erence b etween SUMMA and DIMMA is
2D 2D
= K Q t +K P t if GCD = P; t t
g ca g cb
dimma summa
= K Q t +K PQ t otherwise: 3
g ca g cb
5. Implementation and Results
We implemented three algorithms, called them SUMMA0, SUMMA and DIMMA, and com-
pared their p erformance on the 512 no de Intel Paragon at the Oak Ridge National Lab ora-
tory, Oak Ridge, U.S.A., and the 256 no de Intel Paragon at Samsung Advanced Institute
of Technology,Suwon, Korea. SUMMA0 is the original version of SUMMA, which has the
pip elined broadcasting scheme and the xed blo ck size, k . The lo cal matrix multiplication
b
in SUMMA0 is the rank-k up date. SUMMA is a revised version of SUMMA0 with the LCM
b
blo ck concept for the optimized p erformance of DGEMM, so that the lo cal matrix multiplication
is a rank-k up date, where k is computed in the implementation as follows:
appr ox appr o x
k = bk =k ck if k k ;
appr ox opt b b opt b
= bk = dk =k ec otherwise:
b b opt
First of all, wechanged the blo ck size, k , and observed how the blo ck size a ects the b
P Q Matrix Size Blo ck Size SUMMA0 SUMMA DIMMA
1 1 1:135 2:678 2:735
5 5 2:488 2:730 2:735
8 8 2000 2000 20 20 2:505 2:504 2:553
50 50 2:633 2:698 2:733
100 100 1:444 1:945 1:948
1 1 1:296 2:801 2:842
5 5 2:614 2:801 2:842
8 8 4000 4000 20 20 2:801 2:801 2:842
50 50 2:674 2:822 2:844
100 100 2:556 2:833 2:842
1 1 1:842 3:660 3:731
5 5 3:280 3:836 3:917
12 8 4000 4000 20 20 3:928 3:931 4:006
50 50 3:536 3:887 3:897
100 100 2:833 3:430 3:435
Table 1: Dep endence of p erformance on blo ck size Unit: G ops
p erformance of the algorithms. Table 1 shows the p erformance of A = B = C = 2000 2000
and 4000 4000 on 8 8 and 16 8 pro cessor grids with blo ck sizes k =1; 5; 20; 50, and 100.
b
At rst SUMMA0 and SUMMA are compared. With the extreme case of k = 1, SUMMA
b
with the mo di ed blo cking scheme p erformed at least 100 b etter than SUMMA0. When
k = 5, SUMMA shows 7 - 10 enhanced p erformance. If the blo ck size is much larger
b
than the optimal blo ck size, that is, k = 50, or 100, SUMMA0 b ecomes inecient again
b
and it has a dicultyinoverlapping the communications with the computations. SUMMA
outp erformed SUMMA0 ab out 5 10 when A = B = C = 4000 4000 and k =50or
b
100 on 8 8 and 12 8 pro cessor grids.
Note that on an 8 8 pro cessor grid with 2000 2000 matrices, the p erformance of
k = 20 or 100 is much slower than that of other cases. When k = 100, the pro cessors in
b b
the top half have 300 rows of matrices, while those in the b ottom half have just 200 rows.
This leads to load imbalance among pro cessors, and the pro cessors in the top half require
50 more lo cal computation.
Now SUMMA and DIMMA are compared. Figures 9 and 10 show the p erformance of
SUMMA and DIMMA on 16 16 and 16 12 pro cessor grids, resp ectively, with the xed
blo ck size, k = k = 20. DIMMA always p erforms b etter than SUMMA on the 16 16
b opt
3 2
pro cessor grid. These matrix multiplication algorithms require O N ops and O N
communications, that is, the algorithms are computation intensive. For a small matrix of
N = 1000, the p erformance di erence b etween the two algorithms is ab out 10. But for
a large matrix, these algorithms require much more computation, so that the p erformance
di erence caused by the di erent communication schemes b ecomes negligible. For N = 8000,
the p erformance di erence is only ab out 2 3. On the 16 12 pro cessor grid, SUMMA 12 DIMMA SUMMA Gflops 10
8
6
4
2
0 0 1000 2000 3000 4000 5000 6000 7000 8000
Matrix Size, M
Figure 9: Performance of SUMMA and DIMMA on a 16 16 pro cessor grid. k = k = 20.
opt b
10 DIMMA SUMMA Gflops 8
6
4
2
0 0 1000 2000 3000 4000 5000 6000 7000 8000
Matrix Size, M
Figure 10: Performance of SUMMA and DIMMA on a 16 12 pro cessor grid. 12 DIMMA SUMMA Gflops 10
8
6
4
2
0 0 1000 2000 3000 4000 5000 6000 7000 8000
Matrix Size, M
Figure 11: Predicted Performance of SUMMA and DIMMA on a 16 16 pro cessor grid.
10 DIMMA SUMMA Gflops 8
6
4
2
0 0 1000 2000 3000 4000 5000 6000 7000 8000
Matrix Size, M
Figure 12: Predicted Performance of SUMMA and DIMMA on a 16 12 pro cessor grid.
p erforms slightly b etter than DIMMA for small size of matrices, suchas N = 1000 and 2000.
If P = 16 and Q = 12, GCD = 4 6= P . For the problem of M = N = K = 2000 and
k = k = 20, K = K=k = 100. From Eq. 3,
opt b g b
2D 2D
t t = 100 12 t + 100 16 12 t =88t 92t :
ca cb ca cb
summa dimma
From the result, it is exp ected that the SUMMA is faster than DIMMA for the problem if
t = t .
ca cb
We predicted the p erformance on the Intel Paragon using Eqs 1 and 2. Figures 11 and
12 show the predicted p erformance of SUMMA and DIMMA corresp onding to Figures 9 and
10, resp ectively.We used =94:75sec, =0:02218 45 Mbytes/sec, =22:88nsec 43.7
M ops p er no de for the predicted p erformance. Those values are observed in practice.
In Eq. 2, the idle wait, 2Q 3t +PQ + P 3t when GCD 6= P , can b e reduced
ca cb
by a slight mo di cation of the communication scheme. For example, when P =4;Q =8
that is, GCD = Q if a column of pro cessors sends all columns of blo cks of B instead of a
row of pro cessors send all rows of blo cks of A as in Figure 8, the waiting time is reduced to
P + Q 3t +2P 3t .
ca cb
The following example has another communication characteristic. After the rst column
and the rst row of pro cessors send their own A and the corresp onding B, resp ectively, then,
for the next step, the second column and the second row of pro cessors send their A and B,
resp ectively. The communication resembles that of SUMMA, but the pro cessors send all
corresp onding blo cks of A and B. The waiting time is LCM + Q 3t + LCM + P 3t .
ca cb
This mo di cation is faster if 2 GCD < MINP; Q.
The p erformance p er no de of SUMMA and DIMMA is shown in Figures 13 and 14,
resp ectively, when memory usage p er no de is held constant. Both algorithms show good
p erformance and scalability, but DIMMA is always b etter. If each pro cessor has a lo cal
problem size of more than 200 200, the DIMMA always reaches 40 M ops p er pro cessor,
but the SUMMA obtained ab out 38 M ops p er pro cessor.
Currently the mo di ed blo cking scheme in DIMMA uses the rank-k up date. How-
appr ox
ever it is p ossible to mo dify the DIMMA with the exact rank-k up date by dividing the
opt
virtually connected LCM blo cks in each pro cessor. The mo di cation complicates the algo-
rithm implementation, and since the p erformance of DGEMM is not sensitive to the value of
k if it is larger than 20, there would b e no improvement in p erformance.
opt
6. Conclusions
We present a new parallel matrix multiplication algorithm, called DIMMA, for blo ck cyclic
data distribution on distributed-memory concurrent computers. DIMMA is the most e-
cient and scalable matrix multiplication algorithm. DIMMA uses the mo di ed pip elined
broadcasting scheme to overlap computation and communication e ectively, and exploits
the LCM blo ck concept to obtain the maximum p erformance of the sequential BLAS routine
regardless of the blo ck size. DIMMA always shows the same high p erformance even when 50
Mflops 40
30
20
10
0 0 4 8 12 16
Sqrt of number of nodes
Figure 13: Performance p er no de of SUMMA where memory use p er no de is held constant.
The ve curves represent 100 100, 200 200, 300 300, 400 400, and 500 500 lo cal matrices p er no de from the b ottom.
50
Mflops 40
30
20
10
0 0 4 8 12 16
Sqrt of number of nodes
Figure 14: Performance p er no de of DIMMA.
the blo ck size k is very small as well as very large if the matrices are evenly distributed
b
among pro cessors.
Acknowledgement
The author appreciates the anonymous reviewers for their helpful comments to improve the
quality of the pap er The author appreciates Dr. Jack Dongarra at Univ. of Tennessee at
Knoxville and Oak Ride National Lab oratory for providing computing facilities to p erform
this work. And he appreciates the anonymous reviewers for their helpful comments to
improve the quality of the pap er. This researchwas p erformed in part using the Intel
Paragon computers b oth at the Oak Ridge National Lab oratory, Oak Ridge, U.S.A. and at
Samsung Advanced InstituteofTechnology,Suwon, Korea. And it was supp orted by the
Korean Ministry of Information and Communication under contract 96087-IT1-I2.
7. References
[1] R. C. Agarwal, F. G. Gustavson, and M. Zubair. A High-Performance Matrix-
Multiplication Algorithm on a Distributed-Memory Parallel Computer Using Over-
lapp ed Communication. IBM Journal of Research and Development, 386:673{681,
1994.
[2] E. Anderson, Z. Bai, C. Bischof, J. Demmel, J. Dongarra, J. DuCroz, A. Greenbaum,
S. Hammarling, A. McKenney, and D. Sorensen. LAPACK: A Portable Linear Algebra
Library for High-Performance Computers. In Proceedings of Supercomputing '90, pages
1{10. IEEE Press, 1990.
[3] L. Blackford, J. Choi, A. Cleary, J. Demmel, I. Dhillon, J. Dongarra, S. Ham-
marling, G. Henry, A. Petitet, K. Stanley, D. Walker, and R. Whaley. ScaLA-
PACK: A Portable Linear Algebra Library for Distributed Memory Computers
- Design Issues and Performance. In Proceedings of Supercomputing '96, 1996.
http://www.sup ercomp.org/sc96/pro ceedings/.
[4] L. E. Cannon. A Cellular Computer to Implement the Kalman Filter Algorithm. 1969.
Ph.D. Thesis, Montana State University.
[5] J. Choi, J. J. Dongarra, S. Ostrouchov, A. P.Petitet, D. W. Walker, and R. C. Whaley.
A Prop osal for a Set of Parallel Basic Linear Algebra Subprograms. LAPACK Working
Note 100, Technical Rep ort CS-95-292, UniversityofTennessee, 1995.
[6] J. Choi, J. J. Dongarra, R. Pozo, D. C. Sorensen, and D. W. Walker. CRPC Research
into Linear Algebra Software for High Performance Computers. International Journal
of Supercomputing Applications, 82:99{118, Summer 1994.
[7] J. Choi, J. J. Dongarra, and D. W. Walker. PUMMA: Parallel Universal Matrix Mul-
tiplication Algorithms on Distributed Memory Concurrent Computers. Concurrency:
Practice and Experience, 6:543{570, 1994.
[8] J. Choi, J. J. Dongarra, and D. W. Walker. PB-BLAS: A Set of Parallel Blo ck Basic
Linear Algebra Subprograms. Concurrency: Practice and Experience, 8:517{535, 1996.
[9] J. J. Dongarra, J. Du Croz, S. Hammarling, and I. Du . A Set of Level 3 Basic Linear
Algebra Subprograms. ACM Transactions on Mathematical Software, 181:1{17, 1990.
[10] R. D. Falgout, A. Skjellum, S. G. Smith, and C. H. Still. The Multicomputer To olb ox
Approach to Concurrent BLAS and LACS. In Proceedings of the 1992 Scalable High
Performance Computing Conference, pages 121{128. IEEE Press, 1992.
[11] G. C. Fox, S. W. Otto, and A. J. G. Hey. Matrix Algorithms on a Hyp ercub e I: Matrix
Multiplication. Paral lel Computing, 4:17{31, 1987.
[12] G. H. Golub and C. V. Van Loan. Matrix Computations. The Johns Hopkins University
Press, Baltimore, MD, 1989. Second Edition.
[13] M. T. Heath and J. A. Etheridge. Visualizing the Performance of Parallel Programs.
IEEE Software, 85:29{39, Septemb er 1991.
[14] S. Huss-Lederman, E. M. Jacobson, and A. Tsao. Comparison of Scalable Parallel
Multiplication Libraries. In The Scalable Paral lel Libraries Conference, Starksvil le,
MS, pages 142{149. IEEE Computer So ciety Press, Octob er 6-8, 1993.
[15] S. Huss-Lederman, E. M. Jacobson, A. Tsao, and G. Zhang. Matrix Multiplication on
the Intel Touchstone Delta. Concurrency: Practice and Experience, 6:571{594, 1994.
[16] V. Kumar, A. Grama, A. Gupta, and G. Karypis. Introduction to Paral lel Computing.
The Benjamin/Cummings Publishing Company, Inc., Redwood City, CA, 1994.
[17] A. Petitet. Algorithmic Redistribution Metho ds for Blo ck Cyclic Decomp ositions. 1996.
Ph.D. Thesis, UniversityofTennessee, Knoxville.
[18] R. van de Geijn and J. Watts. SUMMA Scalable Universal Matrix Multiplication Al-
gorithm. LAPACK Working Note 99, Technical Rep ort CS-95-286, UniversityofTen-
nessee, 1995.
[19] R. A. van de Geijn. Using PLAPACK. The MIT Press, Cambridge, 1997.