<<

A New Algorithm

on Distributed-Memory Concurrent

Jaeyoung Choi

Scho ol of Computing

So ongsil University

1-1, Sangdo-Dong, Dong jak-Ku

Seoul 156-743, KOREA

Abstract

We present a new fast and scalable matrix multiplication algorithm, called DIMMA

Distribution-Indep endent Matrix Multiplication Algorithm, for blo ck cyclic data distribu-

tion on distributed-memory concurrent computers. The algorithm is based on two new ideas;

it uses a mo di ed pip elined communication scheme to overlap computation and communi-

cation e ectively, and exploits the LCM blo ck concept to obtain the maximum p erformance

of the sequential BLAS routine in each pro cessor even when the blo ck size is very small as

well as very large. The algorithm is implemented and compared with SUMMA on the Intel

Paragon .

1. Intro duction

Anumb er of algorithms are currentlyavailable for multiplying two matrices A and B to

yield the pro duct matrix C = A  B on distributed-memory concurrent computers [12 , 16 ].

Two classic algorithms are Cannon's algorithm [4 ] and Fox's algorithm [11 ]. They are based

on a P  P square pro cessor grid with a blo ck data distribution in which each pro cessor

holds a large consecutive blo ck of data.

Two e orts to implementFox's algorithm on general 2-D grids have b een made: Choi,

Dongarra and Walker develop ed `PUMMA' [7] for blo ck cyclic data decomp ositions, and

Huss-Lederman, Jacobson, Tsao and Zhang develop ed `BiMMeR' [15 ] for the virtual 2-D

torus wrap data layout. The di erences in these data layouts results in di erent algorithms.

These two algorithms have b een compared on the Intel Touchstone Delta [14 ].

Recent e orts to implementnumerical algorithms for dense matrices on distributed-

memory concurrent computers are based on a blo ck cyclic data distribution [6], in which

an M  N matrix A consists of m  n blo cks of data, and the blo cks are distributed by

b b

wrapping around b oth row and column directions on an arbitrary P  Q pro cessor grid. The

distribution can repro duce most data distributions used in linear algebra computations. For

details, see Section 2.2. We limit the distribution of data matrices to the blo ck cyclic data

distribution.

The PUMMA requires a minimum numb er of communications and computations. It

consists of only Q 1 shifts for A, LC M P; Q broadcasts for B, and LC M P; Q lo cal

, where LC M P; Q is the least common multipleof P and Q. It multiplies

the largest p ossible matrices of A and B for each computation step, so that p erformance of

the routine dep ends very weakly on the blo ck size of the matrix. However, PUMMA makes

it dicult to overlap computation with communication since it always deals with the largest

p ossible matrices for b oth computation and communication, and it requires large memory

space to store them temp orarily, which makes it impractical in real applications.

Agrawal, Gustavson and Zubair [1] prop osed another matrix multiplication algorithm by

eciently overlapping computation with communication on the Intel iPSC/860 and Delta

system. Van de Geijn and Watts [18 ] indep endently develop ed the same algorithm on the In-

tel paragon and called it SUMMA. Also indep endently, PBLAS [5], which is a ma jor building

blo ck of ScaLAPACK [3], uses the same scheme in implementing the matrix multiplication

routine, PDGEMM.

In this pap er, we present a new fast and scalable matrix multiplication algorithm, called

DIMMA Distribution-Independent Matrix Multiplication Algorithm for blo ck cyclic data

distribution on distributed-memory concurrent computers. The algorithm incorp orates SUMMA

with two new ideas. It uses `a mo di ed pip elined communication scheme', which makes the

algorithm the most ecientbyoverlapping computation and communication e ectively.It

also exploits `the LCM concept', which maintains the maximum p erformance of the sequen-

tial BLAS routine, DGEMM, in each pro cessor, even when the blo ck size is very small as well

as very large. The details of the LCM concept is explained in Section 2.2.

DIMMA and SUMMA are implemented and compared on the Intel Paragon computer.

3 2

The parallel matrix multiplication requires O N  ops and O N  communications, i. e., it

is computation intensive. For a large matrix, the p erformance di erence b etween SUMMA

and DIMMA may b e marginal and negligible. But for small matrix of N = 1000 on a 16  16

pro cessor grid, the p erformance di erence is approximately 10.

2. Design Principles

2.1. Level 3 BLAS

Current advanced architecture computers p ossess hierarchical memories in which access to

data in the upp er levels of the memory hierarchy registers, cache, and/or lo cal memory is

faster than to data in lower levels shared or o -pro cessor memory. One technique to exploit

the p ower of such machines more eciently is to develop algorithms that maximize reuse

of data held in the upp er levels. This can b e done by partitioning the matrix or matrices

into blo cks and by p erforming the computation with matrix-matrix op erations on the blo cks.

The Level 3 BLAS [9 ] p erform a numb er of commonly used matrix-matrix op erations, and

are available in optimized form on most computing platforms ranging from workstations up

to sup ercomputers.

The Level 3 BLAS have b een successfully used as the building blo cks of a numb er of ap-

plications, including LAPACK [2 ], a software library that uses blo ck-partitioned algorithms

for p erforming dense linear algebra computations on vector and shared memory computers.

On shared memory machines, blo ck-partitioned algorithms reduce the numb er of times

that data must b e fetched from shared memory, while on distributed-memory machines, they

reduce the numb er of messages required to get the data from other pro cessors. Thus, there

has b een muchinterest in developing versions of the Level 3 BLAS for distributed-memory

concurrent computers [5 , 8 , 10 ].

The most imp ortant routine in the Level 3 BLAS is DGEMM for p erforming matrix-matrix

multiplication. The general purp ose routine p erforms the following op eration:

C  opA  opB + C

T H

where opX=X; X or X . And \" denotes matrix-matrix multiplication. A, B and C

are matrices, and and are scalars. This pap er fo cuses on the design and implementation

of the non-transp osed matrix multiplication routine of C  A  B + C, but the idea can

T

b e easily extended to the transp osed multiplication routines of C  A  B + C and

T

C  A  B + C.

2.2. Blo ck Cyclic Data Distribution

For p erforming the matrix multiplication C = A  B,we assume that A, B and C are M  K ,

K  N , and M  N , resp ectively. The distributed routine also requires a condition on the

blo ck size to ensure compatibility. That is, if the blo ck size of A is m  k , then that of

b b

B and C must b e k  n and m  n , resp ectively. So the numb er of blo cks of matrices

b b b b 0 1 2 3 4 5 6 7 8 91011 0 3 6 9 1 4 710 2 5 811 0 012012012012 0 1 345345345345 2 2 012012012012 4 3 345345345345 6 P0 P1 P2 4 012012012012 8 5 345345345345 10 6 012012012012 1 7 345345345345 3 8 012012012012 5 9 345345345345 7 P3 P4 P5 10 012012012012 9 11 345345345345 11

(a) matrix point-of-view (b) processor point-of-view

Figure 1: Blo ck cyclic data distribution. A matrix with 12  12 blo cks is distributed over

a2 3 pro cessor grid. a The shaded and unshaded areas represent di erent grids. b It

is easier to see the distribution from the pro cessor p oint-of-view to implement algorithms.

Each pro cessor has 6  4 blo cks.

A, B, and C are M  K , K  N , and M  N , resp ectively, where M = dM=m e,

g g g g g g g b

N = dN=n e, and K = dK=k e.

g b g b

The way in which a matrix is distributed over the pro cessors has a ma jor impact on

the load balance and communication characteristicsof the concurrent algorithm, hence,

largely determines its p erformance and scalability. The blo ck cyclic distribution provides

a simple, general-purp ose way of distributing a blo ck-partitioned matrix on distributed-

memory concurrent computers.

Figure 1a shows an example of the blo ck cyclic data distribution, where a matrix with

12  12 blo cks is distributed over a 2  3 grid. The numb ered squares represent blo cks of

elements, and the numb er indicates the lo cation in the pro cessor grid { all blo cks lab eled

with the same numb er are stored in the same pro cessor. The slanted numb ers, on the left

and on the top of the matrix, represent indices of a row of blo cks and of a column of blo cks,

resp ectively. Figure 1b re ects the distribution from a pro cessor p oint-of-view, where each

pro cessor has 6  4 blo cks.

Denoting the least common multipleof P and Q by LC M ,we refer to a square of LCM

 LCM blo cks as an LCM blo ck. Thus, the matrix in Figure 1 may b e viewed as a 2  2

array of LCM blo cks. Blo cks b elong to the same pro cessor if their relative lo cations are

the same in each LCM blo ck. A parallel algorithm, in which the order of execution can b e

intermixed such as matrix multiplication and matrix transp osition, may b e develop ed for

the rst LCM blo ck. Then it can b e directly applied to the other LCM blo cks, whichhave

the same structure and the same data distribution as the rst LCM blo ck, that is, when an

op eration is executed on the rst LCM blo ck, the same op eration can b e done simultaneously

on other LCM blo cks. And the LCM concept is applied to design software libraries for dense

linear algebra computations with algorithmic blo cking [17 , 19 ]. 0 3 6 9 1 4 710 2 5 811 0 3 6 9 1 4 710 2 5 811 0 0 2 2 4 4 6 6 8 8 10 10 1 A 1 B 3 3 5 5 7 7 9 9

11 11

Figure 2: A snapshot of SUMMA. The darkest blo cks are broadcast rst, and lightest blo cks

are broadcast later.

3. Algorithms

3.1. SUMMA

SUMMA is basically a sequence of rank-k up dates. In SUMMA, A and B are divided

b

into several columns and rows of blo cks, resp ectively, whose blo ck sizes are k . Pro cessors

b

multiply the rst column of blo cks of A with the rst row of blo cks of B. Then pro cessors

multiply the next column of blo cks of A and the next row of blo cks of B successively.

As the snapshot of Figure 2 shows, the rst column of pro cessors P and P b egins

0 3

broadcasting the rst column of blo cks of A A:; 0 along eachrow of pro cessors here we

use MATLAB notation to simply represent a p ortion of a matrix. At the same time, the

rst row of pro cessors, P , P , and P broadcasts the rst row of blo cks of B B 0; : along

0 1 2

each column of pro cessors. After the lo cal multiplication, the second column of pro cessors,

P and P , broadcasts A:; 1 rowwise, and the second row of pro cessors, P , P , and P ,

1 4 3 4 5

broadcasts B 1; : columnwise. This pro cedure continues until the last column of blo cks of

A and the last row of blo cks of B.

Agrawal, Gustavson and Zubair [1], and van de Geijn and Watts [18 ] obtained high

eciency on the Intel Delta and Paragon, resp ectively,by exploiting the pip elined commu-

nication scheme, where broadcasting is implemented as passing a column or row of blo cks

around the logical ring that forms the row or column.

3.2. DIMMA

We show a simple simulation in Figure 3. It is assumed that there are 4 pro cessors, each

has 2 sets of data to broadcast, and they use blo cking send and non-blo cking receive. In

the gure, the time to send a data set is assumed to b e 0.2 seconds, and the time for lo cal

computation is 0.6 seconds. Then the pip elined broadcasting scheme takes 8.2 seconds as in Figure 3a . Proc 0 0 1 230 1 23

Proc 1 0 1 230 1 23

Proc 2 0 1 2 3 0 1 2 3

Proc 3 0 1 2 3 0 1 2 3

0 1 2 3 4 5 6 7 8 9 Time

: time to send a message 0 : time for a local computation a SUMMA

Proc 0 0 0 1 1 2 2 3 3

Proc 1 0 0 1 1 2 2 3 3

Proc 2 0 0 1 1 2 2 3 3

Proc 3 0 0 1 1 2 2 3 3

0 1 2 3 4 5 6 7 8 9

Time

b DIMMA

Figure 3: Communication characteristics of SUMMA and DIMMA. It is assumed that blo ck-

ing send and non-blo cking receive are used.

A careful investigation of the pip elined communication shows there is an extra waiting

time b etween two communication pro cedures. If the rst pro cessor broadcasts everything

it contains to other pro cessors b efore the next pro cessor starts to broadcast its data, it is

p ossible to eliminate the unnecessary waiting time. The mo di ed communication scheme in

Figure 3b takes 7.4 seconds. That is, the new communication scheme saves 4 communi-

cation times 8:2 7:4=0:8=4 0:2. Figures 4 and 5 showa Paragraph visualization

[13 ] of SUMMA and DIMMA on the Intel Paragon computer, resp ectively.Paragraph is a

parallel programming to ol that graphically displays the execution of a distributed-memory

program. These gures include spacetime diagrams, which show the communication pattern

between the pro cesses, and utilization Gantt charts, which show when each pro cess is busy

or idle. The dark gray color signi es idle time for a given pro cess, and the light gray color

signals busy time. DIMMA is more ecient in communication than SUMMA as shown in

these gures. The details of analysis of the algorithms is shown in Section 4.

With this mo di ed communication scheme, DIMMA is implemented as follows. After the

rst pro cedure, that is, broadcasting and multiplying A:; 0 and B 0; :, the rst column

Figure 4: Paragraph visualization of SUMMA

Figure 5: Paragraph visualization of DIMMA

of pro cessors, P and P , broadcasts A:; 6 along eachrow of pro cessors, and the rst row

0 3

of pro cessors, P , P , and P sends B 6; : along each column of pro cessors, as shown in

0 1 2

Figure 6. The value 6 app ears since the LCM of P = 2 and Q = 3 is 6.

For the third and fourth pro cedures, the rst column of pro cessors, P and P , broadcasts

0 3

rowwise A:; 3 and A:; 9, and the second row of pro cessors, P , P , and P , broadcasts

3 4 5

columnwise B 3; : and B 9; :, resp ectively. After the rst column of pro cessors, P and

0

P , broadcasts all of their columns of blo cks of A along eachrow of pro cessors, the second

3

column of pro cessors, P and P , broadcasts their columns of A.

1 4

The basic computation of SUMMA and DIMMA in each pro cessor is a sequence of rank-

k up dates of the matrix. The value of k should b e at least 20 Let k b e the optimal blo ck

b b opt

size for the computation, then k = 20 to optimize p erformance of the sequential BLAS

opt

routine, DGEMM, in the Intel Paragon, which corresp onds to ab out 44 M ops on a single no de.

The vectors of blo cks to b e multiplied should b e conglomerated to form larger matrices to

optimize p erformance if k is small.

b

DIMMA is mo di ed with the LCM concept. The basic idea of the LCM concept is to

handle simultaneously several thin columns of blo cks of A, and the same numb er of thin

rows of blo cks of B so that each pro cessor multiplies several thin matrices of A and B

simultaneously in order to obtain the maximum p erformance of the machine. Instead of

broadcasting a single column of A and a single rowof B, a column of pro cessors broadcasts

several M = dk =k e columns of blo cks of A along eachrow of pro cessors, whose distance

X opt b

is LCM blo cks in the column direction. At the same time, a row of pro cessors broadcasts the

same numb er of blo cks of B along each column of pro cessors, whose distance is LCM blo cks in

the row direction as shown in Figure 7. Then each pro cessor executes its own multiplication.

The multiplication op eration is changed from `a sequence = K  of rank-k up dates' to `a

g b

sequence = dK =M e of rank-k  M  up dates' to maximize the p erformance.

g X b X

0 3 6 9 1 4 710 2 5 811 0 3 6 9 1 4 710 2 5 811 0 0 2 2 4 4 6 6 8 8 10 10 1 A 1 B 3 3 5 5 7 7 9 9

11 11

Figure 6: Snapshot of a simple version of DIMMA. The darkest blo cks are broadcast rst.

For example, if P =2;Q =3;k = 10 and k = 20, the pro cessors deal with 2 columns

b opt

of blo cks of A and 2 rows of blo cks of B at a time M = dk =k e = 2. The rst column

X b opt

of pro cessors, P and P , copies two columns of A:; [0; 6] that is, A:; 0 and A:; 6 to T ,

0 3 A 0 3 6 9 1 4 710 2 5 811 0 3 6 9 1 4 710 2 5 811 0 0 2 2 4 4 6 6 8 8 10 10 1 A 1 B 3 3 5 5 7 7 9 9

11 11

Figure 7: A snapshot of DIMMA

and broadcasts them along eachrow of pro cessors. The rst row of pro cessors, P , P and

0 1

P , copies tworows of B [0; 6]; : that is, B 0; : and B 6; : to T and broadcasts them

2 B

along each column of pro cessors. Then all pro cessors multiply T with T to pro duce C.

A B

Next, the second column of pro cessors, P and P , copies the next two columns of A:; [1; 7]

1 4

to T and broadcasts them again rowwise, and the second row of pro cessors, P , P and P ,

A 3 4 5

copies the next tworows of B [1; 7]; : to T and broadcasts them columnwise. The pro duct

B

of T and T is added to C in each pro cessor.

A B

The value of M can b e determined by the blo ck size, available memory space, and

X

machine characteristics such as pro cessor p erformance and communication sp eed. If it is

assumed that k = 20, the value of M should b e 4 if the blo ck size is 5, and the value of

opt X

M should b e 2 if the blo ck size is 10.

X

If k is much larger than the optimal value for example, k = 100, it may b e dicult

b b

to obtain go o d p erformance since it is dicult to overlap the communication with the com-

putation. In , the multiplication routine requires a large amount of memory to send

and receive A and B. It is p ossible to divide k into smaller pieces. For example, if k = 100,

b b

pro cessors divide a column of blo cks of A into ve thin columns of blo cks, and divide a row

of blo cks of B into ve thin rows of blo cks. Then they multiply each thin column of blo cks

of A with the corresp onding thin row of blo cks of B successively. The two cases, in which k

b

is smaller and larger than k , are combined, and the pseudo co de of the DIMMA is shown

opt

in Figure 8.

4. Analysis of Multiplication Algorithms

We analyze the elapsed time of SUMMA and DIMMA based on Figure 3. It is assumed

that k = k throughout the computation. Then, for the multiplication C  C +

b opt M N M N

A  B , there are K = dK=k e columns of blo cks of A and K rows of blo cks of B.

M K K N g b g

At rst, it is assumed that there are P linearly connected pro cessors, in which a column

C =0 C :; : = 0

M = dk =k e

X opt b

DO L1=0;Q 1

DO L2=0; LCM=Q 1

L = LCM  M

X X

DO L3=0; dK =L e 1

g X

DO L4=0; dk =k e 1

b opt

L = L1+L2  Q + L3  L +[L4] : LCM : L3+ 1 L 1

m X X

[Copy A:;L to T and broadcast it along eachrow of pro cessors]

m A

[Copy B L ; : to T and broadcast it along each column of pro cessors]

m B

C :; : = C :; : + T  T

A B

END DO

END DO

END DO

END DO

Figure 8: The pseudo co de of DIMMA. The DO lo op of L3 is used if k is smaller than k ,

b opt

where the routine handles M columns of blo cks of A and M rows of blo cks of B , whose

X X

blo ck distance are LCM, simultaneously, L is used to select them correctly. The innermost

m

DO lo op of L4 is used if k is larger than k , and the bracket in [L ] represents the L -th

b opt 4 4

thin vector.

of blo cks of A = T  is broadcast along P pro cessors at each step and a row of blo cks of

A

B = T  always stays in each pro cessor. It is also assumed that the time for sending a

B

column T to the next pro cessor is t , and the time for multiplying T with T and adding

A c A B

N

the pro duct to C is t . Actually t = +Mk   and t =2M k  , where is a

p c b p b

P

communication start-up time, is a data transfer time, and is a time for multiplication

or addition.

For SUMMA, the time di erence b etween successivetwo pip elined broadcasts of T is

A

2t + t . The total elapsed time of SUMMA with K columns of blo cks on an 1-dimensional

c p g

1D

pro cessor grid, t ,is

summa

1D

t = K 2t + t  t +P 2 t = K 2t + t +P 3 t :

g c p c c g c p c

summa

For DIMMA, the time di erence b etween the two pip elined broadcasts is t + t if the

c p

T s are broadcast from the same pro cessor. However, the time di erences is 2t + t if they

A c p

1D

are in di erent pro cessors. The total elapsed time of DIMMA, t ,is

dimma

1D

= K t + t +P 1t +P 2 t = K t + t +2P 3 t : t

g c p c c g c p c dimma

On a 2-dimensional P  Q pro cessor grid, the communication time of SUMMA is doubled

in order to broadcast T as well as T . Assume again that the time for sending a column T

B A A

and a row T to the next pro cessor are t and t , resp ectively, and the time for multiplying

B ca cb

N M

k   , t = + k   , T with T and adding the pro duct to C is t . Actually t = +

b cb b A B p ca

P Q

M N

k  . So, and t =2

b p

P Q

2D

= K 2t +2t + t +Q 3 t +P 3 t : 1 t

g ca cb p ca cb

summa

For DIMMA, each column of pro cessors broadcasts T until everything is sent. Mean-

A

while, rows of pro cessors broadcast T if they have the corresp onding T with the T .For

B B A

a column of pro cessors, which currently broadcasts A, P=GCD rows of pro cessors, whose

distance is GCD, haverows of blo cks of B to broadcast along with the T , where GCD is

A

the greatest common divisor of P and Q. The extra idle wait, caused by broadcasting two

T s when they are in di erent pro cessors, is GCD  t . Then the total extra waiting time to

B cb

broadcast T sisQ P=GCD GCD  t = PQ t .

B cb cb

However, if GCD = P , only one row of pro cessors has T to broadcast corresp ondingto

B

the column of pro cessors, and the total extra waiting time is P  t . So,

cb

2D

t = K t + t + t +2Q 3t +P + Q 3t if GCD = P

g ca cb p ca cb

dimma

= K t + t + t +2Q 3t +PQ + P 3t otherwise: 2

g ca cb p ca cb

The time di erence b etween SUMMA and DIMMA is

2D 2D

= K Q t +K P  t if GCD = P; t t

g ca g cb

dimma summa

= K Q t +K PQ t otherwise: 3

g ca g cb

5. Implementation and Results

We implemented three algorithms, called them SUMMA0, SUMMA and DIMMA, and com-

pared their p erformance on the 512 no de Intel Paragon at the Oak Ridge National Lab ora-

tory, Oak Ridge, U.S.A., and the 256 no de Intel Paragon at Samsung Advanced Institute

of Technology,Suwon, Korea. SUMMA0 is the original version of SUMMA, which has the

pip elined broadcasting scheme and the xed blo ck size, k . The lo cal matrix multiplication

b

in SUMMA0 is the rank-k up date. SUMMA is a revised version of SUMMA0 with the LCM

b

blo ck concept for the optimized p erformance of DGEMM, so that the lo cal matrix multiplication

is a rank-k up date, where k is computed in the implementation as follows:

appr ox appr o x

k = bk =k ck if k  k ;

appr ox opt b b opt b

= bk = dk =k ec otherwise:

b b opt

First of all, wechanged the blo ck size, k , and observed how the blo ck size a ects the b

P  Q Matrix Size Blo ck Size SUMMA0 SUMMA DIMMA

1  1 1:135 2:678 2:735

5  5 2:488 2:730 2:735

8  8 2000  2000 20  20 2:505 2:504 2:553

50  50 2:633 2:698 2:733

100  100 1:444 1:945 1:948

1  1 1:296 2:801 2:842

5  5 2:614 2:801 2:842

8  8 4000  4000 20  20 2:801 2:801 2:842

50  50 2:674 2:822 2:844

100  100 2:556 2:833 2:842

1  1 1:842 3:660 3:731

5  5 3:280 3:836 3:917

12  8 4000  4000 20  20 3:928 3:931 4:006

50  50 3:536 3:887 3:897

100  100 2:833 3:430 3:435

Table 1: Dep endence of p erformance on blo ck size Unit: G ops

p erformance of the algorithms. Table 1 shows the p erformance of A = B = C = 2000  2000

and 4000  4000 on 8  8 and 16  8 pro cessor grids with blo ck sizes k =1; 5; 20; 50, and 100.

b

At rst SUMMA0 and SUMMA are compared. With the extreme case of k = 1, SUMMA

b

with the mo di ed blo cking scheme p erformed at least 100 b etter than SUMMA0. When

k = 5, SUMMA shows 7 - 10 enhanced p erformance. If the blo ck size is much larger

b

than the optimal blo ck size, that is, k = 50, or 100, SUMMA0 b ecomes inecient again

b

and it has a dicultyinoverlapping the communications with the computations. SUMMA

outp erformed SUMMA0 ab out 5  10 when A = B = C = 4000  4000 and k =50or

b

100 on 8  8 and 12  8 pro cessor grids.

Note that on an 8  8 pro cessor grid with 2000  2000 matrices, the p erformance of

k = 20 or 100 is much slower than that of other cases. When k = 100, the pro cessors in

b b

the top half have 300 rows of matrices, while those in the b ottom half have just 200 rows.

This leads to load imbalance among pro cessors, and the pro cessors in the top half require

50 more lo cal computation.

Now SUMMA and DIMMA are compared. Figures 9 and 10 show the p erformance of

SUMMA and DIMMA on 16  16 and 16  12 pro cessor grids, resp ectively, with the xed

blo ck size, k = k = 20. DIMMA always p erforms b etter than SUMMA on the 16  16

b opt

3 2

pro cessor grid. These matrix multiplication algorithms require O N  ops and O N 

communications, that is, the algorithms are computation intensive. For a small matrix of

N = 1000, the p erformance di erence b etween the two algorithms is ab out 10. But for

a large matrix, these algorithms require much more computation, so that the p erformance

di erence caused by the di erent communication schemes b ecomes negligible. For N = 8000,

the p erformance di erence is only ab out 2  3. On the 16  12 pro cessor grid, SUMMA 12 DIMMA SUMMA Gflops 10

8

6

4

2

0 0 1000 2000 3000 4000 5000 6000 7000 8000

Matrix Size, M

Figure 9: Performance of SUMMA and DIMMA on a 16  16 pro cessor grid. k = k = 20.

opt b

10 DIMMA SUMMA Gflops 8

6

4

2

0 0 1000 2000 3000 4000 5000 6000 7000 8000

Matrix Size, M

Figure 10: Performance of SUMMA and DIMMA on a 16  12 pro cessor grid. 12 DIMMA SUMMA Gflops 10

8

6

4

2

0 0 1000 2000 3000 4000 5000 6000 7000 8000

Matrix Size, M

Figure 11: Predicted Performance of SUMMA and DIMMA on a 16  16 pro cessor grid.

10 DIMMA SUMMA Gflops 8

6

4

2

0 0 1000 2000 3000 4000 5000 6000 7000 8000

Matrix Size, M

Figure 12: Predicted Performance of SUMMA and DIMMA on a 16  12 pro cessor grid.

p erforms slightly b etter than DIMMA for small size of matrices, suchas N = 1000 and 2000.

If P = 16 and Q = 12, GCD = 4 6= P . For the problem of M = N = K = 2000 and

k = k = 20, K = K=k = 100. From Eq. 3,

opt b g b

2D 2D

t t = 100 12 t + 100 16  12 t =88t 92t :

ca cb ca cb

summa dimma

From the result, it is exp ected that the SUMMA is faster than DIMMA for the problem if

t = t .

ca cb

We predicted the p erformance on the Intel Paragon using Eqs 1 and 2. Figures 11 and

12 show the predicted p erformance of SUMMA and DIMMA corresp onding to Figures 9 and

10, resp ectively.We used =94:75sec, =0:02218 45 Mbytes/sec, =22:88nsec 43.7

M ops p er no de for the predicted p erformance. Those values are observed in practice.

In Eq. 2, the idle wait, 2Q 3t +PQ + P 3t when GCD 6= P , can b e reduced

ca cb

by a slight mo di cation of the communication scheme. For example, when P =4;Q =8

that is, GCD = Q if a column of pro cessors sends all columns of blo cks of B instead of a

row of pro cessors send all rows of blo cks of A as in Figure 8, the waiting time is reduced to

P + Q 3t +2P 3t .

ca cb

The following example has another communication characteristic. After the rst column

and the rst row of pro cessors send their own A and the corresp onding B, resp ectively, then,

for the next step, the second column and the second row of pro cessors send their A and B,

resp ectively. The communication resembles that of SUMMA, but the pro cessors send all

corresp onding blo cks of A and B. The waiting time is LCM + Q 3t + LCM + P 3t .

ca cb

This mo di cation is faster if 2  GCD < MINP; Q.

The p erformance p er no de of SUMMA and DIMMA is shown in Figures 13 and 14,

resp ectively, when memory usage p er no de is held constant. Both algorithms show good

p erformance and scalability, but DIMMA is always b etter. If each pro cessor has a lo cal

problem size of more than 200  200, the DIMMA always reaches 40 M ops p er pro cessor,

but the SUMMA obtained ab out 38 M ops p er pro cessor.

Currently the mo di ed blo cking scheme in DIMMA uses the rank-k up date. How-

appr ox

ever it is p ossible to mo dify the DIMMA with the exact rank-k up date by dividing the

opt

virtually connected LCM blo cks in each pro cessor. The mo di cation complicates the algo-

rithm implementation, and since the p erformance of DGEMM is not sensitive to the value of

k if it is larger than 20, there would b e no improvement in p erformance.

opt

6. Conclusions

We present a new parallel matrix multiplication algorithm, called DIMMA, for blo ck cyclic

data distribution on distributed-memory concurrent computers. DIMMA is the most e-

cient and scalable matrix multiplication algorithm. DIMMA uses the mo di ed pip elined

broadcasting scheme to overlap computation and communication e ectively, and exploits

the LCM blo ck concept to obtain the maximum p erformance of the sequential BLAS routine

regardless of the blo ck size. DIMMA always shows the same high p erformance even when 50

Mflops 40

30

20

10

0 0 4 8 12 16

Sqrt of number of nodes

Figure 13: Performance p er no de of SUMMA where memory use p er no de is held constant.

The ve curves represent 100  100, 200  200, 300  300, 400  400, and 500  500 lo cal matrices p er no de from the b ottom.

50

Mflops 40

30

20

10

0 0 4 8 12 16

Sqrt of number of nodes

Figure 14: Performance p er no de of DIMMA.

the blo ck size k is very small as well as very large if the matrices are evenly distributed

b

among pro cessors.

Acknowledgement

The author appreciates the anonymous reviewers for their helpful comments to improve the

quality of the pap er The author appreciates Dr. Jack Dongarra at Univ. of Tennessee at

Knoxville and Oak Ride National Lab oratory for providing computing facilities to p erform

this work. And he appreciates the anonymous reviewers for their helpful comments to

improve the quality of the pap er. This researchwas p erformed in part using the Intel

Paragon computers b oth at the Oak Ridge National Lab oratory, Oak Ridge, U.S.A. and at

Samsung Advanced InstituteofTechnology,Suwon, Korea. And it was supp orted by the

Korean Ministry of Information and Communication under contract 96087-IT1-I2.

7. References

[1] R. C. Agarwal, F. G. Gustavson, and M. Zubair. A High-Performance Matrix-

Multiplication Algorithm on a Distributed-Memory Parallel Computer Using Over-

lapp ed Communication. IBM Journal of Research and Development, 386:673{681,

1994.

[2] E. Anderson, Z. Bai, C. Bischof, J. Demmel, J. Dongarra, J. DuCroz, A. Greenbaum,

S. Hammarling, A. McKenney, and D. Sorensen. LAPACK: A Portable Linear Algebra

Library for High-Performance Computers. In Proceedings of Supercomputing '90, pages

1{10. IEEE Press, 1990.

[3] L. Blackford, J. Choi, A. Cleary, J. Demmel, I. Dhillon, J. Dongarra, S. Ham-

marling, G. Henry, A. Petitet, K. Stanley, D. Walker, and R. Whaley. ScaLA-

PACK: A Portable Linear Algebra Library for Distributed Memory Computers

- Design Issues and Performance. In Proceedings of Supercomputing '96, 1996.

http://www.sup ercomp.org/sc96/pro ceedings/.

[4] L. E. Cannon. A Cellular Computer to Implement the Kalman Filter Algorithm. 1969.

Ph.D. Thesis, Montana State University.

[5] J. Choi, J. J. Dongarra, S. Ostrouchov, A. P.Petitet, D. W. Walker, and R. C. Whaley.

A Prop osal for a Set of Parallel Basic Linear Algebra Subprograms. LAPACK Working

Note 100, Technical Rep ort CS-95-292, UniversityofTennessee, 1995.

[6] J. Choi, J. J. Dongarra, R. Pozo, D. C. Sorensen, and D. W. Walker. CRPC Research

into Linear Algebra Software for High Performance Computers. International Journal

of Supercomputing Applications, 82:99{118, Summer 1994.

[7] J. Choi, J. J. Dongarra, and D. W. Walker. PUMMA: Parallel Universal Matrix Mul-

tiplication Algorithms on Distributed Memory Concurrent Computers. Concurrency:

Practice and Experience, 6:543{570, 1994.

[8] J. Choi, J. J. Dongarra, and D. W. Walker. PB-BLAS: A Set of Parallel Blo ck Basic

Linear Algebra Subprograms. Concurrency: Practice and Experience, 8:517{535, 1996.

[9] J. J. Dongarra, J. Du Croz, S. Hammarling, and I. Du . A Set of Level 3 Basic Linear

Algebra Subprograms. ACM Transactions on Mathematical Software, 181:1{17, 1990.

[10] R. D. Falgout, A. Skjellum, S. G. Smith, and C. H. Still. The Multicomputer To olb ox

Approach to Concurrent BLAS and LACS. In Proceedings of the 1992 Scalable High

Performance Computing Conference, pages 121{128. IEEE Press, 1992.

[11] G. C. Fox, S. W. Otto, and A. J. G. Hey. Matrix Algorithms on a Hyp ercub e I: Matrix

Multiplication. Paral lel Computing, 4:17{31, 1987.

[12] G. H. Golub and C. V. Van Loan. Matrix Computations. The Johns Hopkins University

Press, Baltimore, MD, 1989. Second Edition.

[13] M. T. Heath and J. A. Etheridge. Visualizing the Performance of Parallel Programs.

IEEE Software, 85:29{39, Septemb er 1991.

[14] S. Huss-Lederman, E. M. Jacobson, and A. Tsao. Comparison of Scalable Parallel

Multiplication Libraries. In The Scalable Paral lel Libraries Conference, Starksvil le,

MS, pages 142{149. IEEE Computer So ciety Press, Octob er 6-8, 1993.

[15] S. Huss-Lederman, E. M. Jacobson, A. Tsao, and G. Zhang. Matrix Multiplication on

the Intel Touchstone Delta. Concurrency: Practice and Experience, 6:571{594, 1994.

[16] V. Kumar, A. Grama, A. Gupta, and G. Karypis. Introduction to Paral lel Computing.

The Benjamin/Cummings Publishing Company, Inc., Redwood City, CA, 1994.

[17] A. Petitet. Algorithmic Redistribution Metho ds for Blo ck Cyclic Decomp ositions. 1996.

Ph.D. Thesis, UniversityofTennessee, Knoxville.

[18] R. van de Geijn and J. Watts. SUMMA Scalable Universal Matrix Multiplication Al-

gorithm. LAPACK Working Note 99, Technical Rep ort CS-95-286, UniversityofTen-

nessee, 1995.

[19] R. A. van de Geijn. Using PLAPACK. The MIT Press, Cambridge, 1997.