<<

A Flexible Class of Parallel 

JohnGunnels Calvin Lin Greg Morrow Robert vandeGeijn Department of Computer Sciences The University of Texas at Austin

Austin, TX 78712

T T

= A B Abstract of-vectors , and C as a of -k updates. Later work [5] showed how these tech- This paper explains why parallel implementation of ma- niques can be extended to a large class of commonly used trix multiplication—a seemingly simple that can matrix-matrix operations that are part of the Level-3 Basic be expressed as one statement and three nested loops—is Linear Subprograms (BLAS) [9]. complex: Practical algorithms that use matrix multiplica- These previous efforts have focused primarily on the tion tend to use matrices of disparate shapes, and the shape special case where the input matrices are approximately of the matrices can significantly impact the performance of square. Recent work by Li, et al. [15] creates a “poly- matrix multiplication. We provide a class of algorithms that algorithm” that chooses between algorithms (Cannon’s, covers the spectrum of shapes encountered and demonstrate broadcast-multiply-roll, and broadcast-broadcast). Empir- that good performance can be attained if the right algo- ical data show some advantage for different shapes of rithm is chosen. These observations set the stage for hybrid meshes (e.g. square vs rectangular). However, since the algorithms which choose between the algorithms based on three algorithms are not inherently suited to specific shapes the shapes of the matrices involved. While the paper re- of matrices, limited benefit is observed. solves a number of issues, it concludes with discussion of a We previously observed [16] the need for algorithms to number of directions yet to be pursued. be sensitive to the shape of the matrices and hinted that a class of algorithms naturally supported by the Parallel Lin- ear Algebra Package (PLAPACK) provides a good 1 Introduction for such shape-adaptive algorithms. Thus, a number of hy- brid algorithms are now part of PLAPACK. This observa- Over the last three decades, a number of different ap- tion was also made by the ScaLAPACK [6] project, and proaches have been proposed for implementing matrix- indeed ScaLAPACK includes a number of matrix-matrix matrix multiplication on distributed memory architectures operations that choose algorithms based on the shape of the These include Cannon’s algorithm [4], broadcast-multiply- matrix. We contrast our approach with ScaLAPACK’s later. roll [12, 11], and generalizations of broadcast-multiply-roll This paper describes and analyzes a class of parallel ma- [8, 13, 2]. The approach now considered the most practi- trix multiplication algorithms that naturally lends itself to cal, known as broadcast-broadcast, was first proposed by hybridization. The analysis combines theoretical results

Agarwal et al. [1], who showed that a sequence of parallel with empirical results to provide a complete picture of the = rank-k updates is a highly effective way to parallelize C benefits of the different algorithms and how they can be

AB . This same observation was independently made by combined into hybrid algorithms. A number of simplifi- van de Geijn and Watts [17], who introduced the Scalable cations are made to focus on the key issues. Having done

Universal Matrix Multiplication Algorithm (SUMMA). In so, a number of extensions are identified for further study.

= AB

addition to computing C , SUMMA implements

T T

= AB C = A B C and as a sequence of matrix-panel- 2 Data Distribution  This research was supported primarily by the PRISM project (ARPA grant P-95006) and the Environmental Molecular Sciences construction

project at Pacific Northwest National Laboratory. Additional support For all algorithms, we will assume that the processors

c came from the NASA HPCC Program’s Earth and Spaces Sciences Project are logically viewed as a r mesh of computational nodes

(NRA Grants NAG5-2497 and NAG5-2511), and the Intel Research Coun-

P 0 i r 0 j c

which are indexed ij , and ,

cil. Equipment was provided by the National Partnership for Advanced

= r c so that the total number of nodes is p . Physically, Computational Infrastructure, the University of Texas Compuation Cen- ter’s High Performance Computing Facility, and the Texas Institute for these nodes are connected through some communication Computational and Applied . network, which could be a hypercube (Intel iPSC/860), a higher dimensional mesh (Intel Paragon, Cray T3D/E) or a (a of vectors) by simultaneously scatter- multistage network (IBM SP2). ing the columns within rows. For details, see [16].

Physically Based Matrix Distribution (PBMD). We 3 A Class of Algorithms have previously observed [10] that data distributions should focus on the decomposition of the physical problem to be The target operation will be solved, rather than on the decomposition of matrices. Typ-

ically, it is the elements of vectors that are associated with

= AB + C data of physical significance, and so it is their distribution C to nodes that is significant. A matrix, which is a discretized where for simplicity we will often treat only the case where

operator, merely represents the between two vec-

= 1 = 0 m

and . There are three dimensions involved: ,

= Ax

tors, which are discretized spaces: y . Since it is

k C A B m n m k k n n, and , where , , and are , , and ,

more natural to start by distributing the problem to nodes, respectively.

y A we first distribute x and to nodes. The matrix is then It is always interesting to investigate extremal cases distributed so as to be consistent with the distribution of x which happen when one or more of the dimensions equal and y . unity:

To describe the distribution of the vectors, assume that x

n m

and y are of length and , respectively. We will distribute One dimension equals unity: b

these vectors using a distribution block size of distr . For

n A B

m, large and become column and row vec-

n = N b m = M b distr

simplicity assume distr and . Partition

= 1

k tors, respectively, and the operation y

x and so that becomes a rank-1 update.

x y

k C B

m, large and are column vectors, and

C B C B

x y

n = 1

C B C

B the operation becomes a matrix-vector

x = y =

C B C

. and B .

A A

. . multiply.

k C A

n, large and are row vectors, and the op-

x y

N M

= 1

m eration becomes a row vector-matrix

N p M p x y i

where and and each i and is of multiply.

b A

length distr . Partitioning conformally yields the blocked Two dimensions equal unity:

C A

matrix m large and become column vectors, and

= k = 1 B

n a . The operation becomes a

A A A

N

scaled vector addition (axpy).

B C

A A A

N

B C

n C B A

A = B

. . . C (1) large and become row vectors, and a

. . . A

= k = 1

. . . m scalar. Again, the operation becomes

A A A

M M N

M a scaled vector addition.

A B

k large and become row and column vec-

b b distr

where each subblock is of size distr . Blocks of

= n = 1 C

m tors, respectively, and a scalar. The y x and are assigned to a 2D mesh of nodes in column- operation becomes an inner-

major order. The matrix distribution is induced by assigning (dot).

A 

blocks of columns j to the same column of nodes as

x A

i

subvector j , and blocks of rows to the same row of Implementing the matrix-matrix multiplication for these de- y nodes as subvector i [16]. This distribution wraps blocks generate cases is relatively straight-forward: The rank-1 up- of rows and columns onto rows and columns of processors, date is implemented by duplicating vectors and updating

respectively. The wrapping is necessary to improve load local parts of C on each node. The matrix-vector multi- balance. plications are implemented by duplicating the vector to be multiplied, performing local matrix-vector multiplications

Data movement. In the PBMD representation, vectors on each node, and reducing the result to C . In the cases are fully distributed (not duplicated), while a single row where two dimensions are small, the column or row vectors or column of a matrix will reside in a single row or col- are redistributed like vectors (as described in the previous umn of processors. One benefit of the PBMD represen- section) and local operations are performed on these vec- tation is the clean manner in which vectors, matrix rows, tors, after which the results are redistributed as needed. and matrix columns interact. Converting a column of a In the next section, we will discuss how the case where matrix to a vector requires a scatter communication within none of the dimensions are unity can be implemented as rows. Similarly, a panel of columns can be converted into a a sequence of the above operations, by partitioning the

operands appropriately. We present this along with one mi- panel-panel update can be implemented by the following A

nor extensionwhere rather than dealing with one row or col- sequence of operations: Duplicate (broadcast) i within

^ B umn vector at a time, a number of these vectors are treated rows; Duplicate (broadcast) i within columns; On each

as a block (panel or multivector). This reduces the number node perform an update to the local portion of C . Comput-

= AB

of messages communicated and improves the performance ing C is then implemented by repeated calls to this

^

A B i on each node by allowing matrix-matrix multiplication ker- , once for each i pair.

nels to be called locally. k Matrix-panel multiply based algorithm: m and large.

3.1 Notation

k n A If m and are both large compared to , then contains

the most elements, and an algorithm that moves data in C

For simplicity, we will assume that the dimensions m, A

and B rather than may be a reasonable choice. k

n, and are integer multiples of the algorithmic block size Observe that

b A B

alg , we will use the following partitionings of , , and

B C B C

= A C =

N N

C :

^

X

C = AB j

so that j . Thus, if we know how to perform

C

B .

X X

X = =

C AB

A

.

n b

j j X

alg . in parallel, the complete matrix-matrix mul-

^ ^

tiplication can be implemented as N calls to this “matrix-

X

b m X

alg panel multiplication” kernel. In [16] it is shown that one

fA B C g m n

X matrix-panel update can be implemented by the following X

where , and X and are the row and

^

B

X X

sequence of operations: Scatter j within rows followed i column dimension of the indicated matrix. Here i and indicate panels of columns and rows, respectively. Also, by a collect (all-gather) within columns to duplicate it;

Perform a matrix-multivector multiplication local on each

X X

C

n b

X j

alg node, yielding contributions to ; Reduce (sum) the result

C B

B C C = AB

. . C

j j

to j . By calling this kernel for each pair,

X = A . .

is computed.

X X

m b m b n b

X X X

alg alg alg

b

n k

In these partitionings, alg is chosen to maximize the per- Panel-matrix multiply based algorithm: and large.

k m B

formance of the local matrix-matrix multiplication opera- Finally, if n and are both large compared to , then

^

M = mb

tion. In the below discussion, we will use alg , contains the most elements, and an algorithm that moves

^ ^

C A B

N = nb K k b

= data in and rather than may be a reasonable choice. alg , and alg . The algorithm for this is similar to the matrix-panel based 3.2 Two dimensions are “large” algorithm, except that it depends on a panel-matrix kernel instead. First we present algorithms that use kernels that corre- spond to the extremal cases discussed above where one di- 3.3 Two dimensions are “small” mension equals unity. Next we present algorithms that use kernels that corre-

spond to the extremal cases discussed above where two di- n Panel-panel update based algorithm: m and large.

mensions equal unity. Notice that implementation of these

n k Notice that if m and are both large compared to , then kernels is unique to PLAPACK, since other libraries like C contains the most elements, and an algorithm that moves

ScaLAPACK do not provide a facility for distributing vec-

B C data in A and rather than may be a reasonable choice.

Observe that tors other than as columns or rows of matrices.

^

B

B C

. Column-axpy based algorithm: m large. Notice that if

A A

C = AB =

A

.

K .

= k = 1

n , the matrix-matrix multiplication becomes a

^

B

y x + y

K scaled vector addition: , sometimes known

^ ^

n k

B B + + A A

= as an “axpy” operation. If and are both small (e.g.,

K K A

equal one), then C and will exist within only one or a

C + B

Thus, if we know how to perform one update C few columns of nodes, while is a small matrix that exists

^

B A i

i in parallel, the complete matrix-matrix multiplica- within one node, or at most a few nodes. To get good paral- ^

tion can be implemented as K calls to this “panel-panel lelism, it becomes beneficial to redistribute the columns of A multiplication” kernel. In [17, 16] it is shown that one C and like vectors.

^

C = A B C

i j ij

Observe that so that ij . Computation of each can be ac-

^

A B

i j

C C =

C complished by redistributing columns of and rows of

N

like vectors, performing local operations that are like local

B B

N C

inner-products which are then reduced to ij .

C

B . .

A A

=

A

. .

K . .

B B

3.4 Cost analysis

K K N

C = A B + + A B

j

So that j . Thus, if we know K j

K Notice that the above algorithms are implemented as an

^

B C C + A

pj j p

how to perform one update j in parallel, iteration over a sequence of communications and computa-

^

C K the complete computation of j can be implemented as tions. By modeling the cost of each of the components, one

calls to this “axpy” kernel, and the complete matrix-matrix can derive models for the cost of the different algorithms. C

multiply is accomplished by computation of all j s. The Thus, we obtain a performance model based on the esti- C following operations implement the computation of j : mated cost of local computation as well as models of the

communication required. Parameters for the model include

C Redistribute (scatter) the columns of j like vectors the cost of local computation, message latency and band-

width, and cost of packing and unpacking data before and

^

q = 1 K for after communications. Due to space limitations, we do not

develop this cost analysis in this paper. Rather, we present A

– Redistribute (scatter) the columns of q like vec-

graphs for the estimated cost of the different algorithms in B

tors; Duplicate (broadcast) q j to all nodes; Up-

the next section.

^

B C C + A

q j j q

date local part of j .

C Redistribute (gather) the columns of j to where they 4 Results started

Repeated calls to this kernel compute the final C . This section provides performance results for the de-

C A q It may appear that the cost of redistributing j and is scribed algorithms on the Cray T3E and IBM SP-2. Our

prohibitive, making this algorithm inefficient. Notice how- implementations are coded using PLAPACK which natu- C ever that the redistribution of j is potentially amortized rally supports these kinds of algorithms. In addition, we

over many updates. Also, the ratio of the computation with report results from our performance model, using parame- A

q and the cost of redistribution of this data is typically ters measured for the different architectures.

b : 1 k b alg alg . Thus, when is small but comparable to , Ultimately, it is the performance of the local 64- which itself is reasonably large, this approach is a viable bit matrix-matrix multiplication implementation that de- alternative. termines the asymptotic performance rate that can be achieved. Since the algorithmsmake repeated calls to paral-

lel implementations of kernels that assume that one or more

m = Row-axpy based algorithm: n large. Notice that if

of the dimensions are small, the operands to the matrix-

= 1 k , the matrix-matrix multiplication again becomes a

matrix multiplication performed on each individual proces- B scaled vector addition, except that now rows of C and are redistributed like vectors. The algorithm is similar to sor also have different shapes that depend on the algorithm the last case discussed. chosen. Even on an individual processor the shape of the operands can greatly affect the performance of the local

computation, as is demonstrated in Table 1. In that ta-

m = n = 1 Dot based algorithm: k large. Notice that if , ble, the small dimensions were set equal to the blocking

the matrix-matrix multiplication becomes an inner-product size used for the parallel algorithm, specifically 128 for the

B m n between one row A and one column . For and small, Cray T3E. It is this performance that becomes an asymp-

we observe that tote for the per-node performance of the parallel implemen-

tation. It should be noted that since these timings were first

C C

N

performed, Cray has updated its BLAS library, and perfor- C

B . .

C = A

. . mance for the sequential matrix-matrix multiply kernels is

C C

much less shape dependent. Thus, the presented data is

M M N

of qualitative interest, but performance is now considerably

^

A

better than presented in the performance graphs. C

B .

B B

=

A

.

. N It is interesting to note that the local matrix-matrix mul- ^

A tiply performs best for panel-panel update operations. To

M

Rate 5 Conclusions and Future Work

n k m (MFLOPS/sec) 4096 4096 128 439 We have presented a flexible class of parallel matrix mul- 4096 128 4096 172 tiplication algorithms. Theoretical and experimental results 128 4096 4096 237 show that by choosing the appropriate algorithm from this 4096 128 128 230 class, high performance can be attained across a range of 128 4096 128 184 matrix shapes. 128 128 4096 184 The analysis provides a means for choosing between the algorithmsbased on the shape of the operants. Furthermore,

Table 1: Performance of local 64-bit matrix-matrix multi- the developedtechniques can easily used to parallelize other

T

= AB C =

plication kernel on one processor of the Cray T3E. cases of matrix-matrix multiplication (C ,

T T T

A B = A B , and C ) and other matrix-matrix opera- tions, as are included in the level-3 BLAS [9]. We intend to pursue this line of research in the near future. those of us experienced with high-performance architec- tures this is not a surprise: This is the case that is heavily Acknowledgments. We are indebted to the rest of the used by the LINPACK benchmark, and thus the first to be PLAPACK team (Philip Alpatov, Greg Baker, Carter Ed- fully optimized. There is really no technical reason why for wards, James Overfelt, and Yuan-Jye Jason Wu) for provid- a block size of 128 the other shapes do not perform equally ing the infrastructure that made this research possible. well. Indeed, as an architecture matures, those cases also get optimized. For example, on the Cray T3D the perfor- mance of the local matrix-matrix BLAS kernel is not nearly References as dependent on the shape. [1] Agarwal, R. C., F. Gustavson, and M. Zubair, “A high- In Fig. 1, we reportperformanceon the Cray T3E thedif- performance matrix multiplication algorithm on a distributed ferent algorithms. By fixing one or more dimensions to be memory parallel computer using overlapped communica- large, we demonstrate that, indeed, the appropriateness of tion,” IBM J. of Research and Development, 38 (6), 1994. the presented algorithms is dependent upon the shape of the [2] Agarwal, R. C., S. M. Balle, F. G. Gustavson, M. Joshi, and P. operands. Of particular interest are the asymptotes reached Palkar, “A 3-Dimensional Approach to Parallel Matrix Mul- by the different algorithms and how quickly the asymptotes tiplication,” IBM J. of Research and Development, 39(5), pp. are approached. 1–8, 1995.

Let us start by concentrating on the top two graphs of [3] Barnett, M., S. Gupta, D. Payne, L. Shuler, R. van de Geijn,

and J. Watts, “Interprocessor Collective Communication Li- n Fig. 1. Notice that when m and are large, it is the brary (InterCom),” Scalable High Performance Computing panel-panel based algorithm that attains high performance Conf. 1994. quickly. This is not surprising since the implementation makes repeated calls to an algorithm that is specifically [4] Cannon, L.E., A Cellular Computer to Implement the Kalman

designed for small k . The matrix-panel based algorithms Filter Algorithm, Ph.D. Thesis (1969), Montana State Univer-

suffer from the fact that for small k they incur severe load sity. imbalance. The dot and axpy based algorithms incur con- [5] Chtchelkanova, A., J. Gunnels, G. Morrow, J. Overfelt, R. van siderable overhead, which is particularly obvious when k is de Geijn, “Parallel Implementation of BLAS: General Tech- small, and also hampers the asymptotic behavior of these niques for Level 3 BLAS,” TR-95-40, Dept. of Computer Sci- algorithms. ences, UT-Austin, Oct. 1995. In the middle two graphs of Fig. 1 we show that when [6] Choi J., J. J. Dongarra, R. Pozo, and D. W. Walker, “ScaLA-

PACK: A Scalable Library for Distributed k m and are large, a matrix-panel algorithm is most ap- Memory Concurrent Computers,” Fourth Symposium on the propriate, at least when n is small. However, since the lo- cal matrix-matrix multiply does not perform as well in the Frontiers of Massively Parallel Computation. IEEE Comput. matrix-panel case, eventually the panel-panel based algo- Soc. Press, 1992, pp. 120-127. rithm outperforms the others. [7] Choi, J., J. J. Dongarra, and D. W. Walker, “Level 3 BLAS for distributed memory concurrent computers”, CNRS-NSF In the bottom two graphs of Fig. 1, we show that when Workshop on Environments and Tools for Parallel Scientific only one dimension is large it becomes advantageous to Computing, Sept. 7-8, 1992. Elsevier Science Publishers, consider the dot or axpy based algorithms. 1992. Predicted Observed m = n = 4096 m = n = 4096 model data t3e data

400.0 400.0

panpan panpan matpan matpan panmat panmat 300.0 dot 300.0 dot axpy row axpy row axpy col axpy col

200.0 200.0 MFLOPS/node MFLOPS/node

100.0 100.0

0.0 0.0 0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 matrix dimension, k matrix dimension, k

m = k = 4096 m = k = 4096 model data t3e data

panpan panpan 400.0 matpan 400.0 matpan panmat panmat dot dot axpy row axpy row axpy col axpy col 300.0 300.0

200.0 200.0 MFLOPS/node MFLOPS/node

100.0 100.0

0.0 0.0 0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 matrix dimension, n matrix dimension, n

n = 4096 n = 4096 model data t3e data

400.0 400.0

panpan panpan matpan matpan panmat panmat 300.0 dot 300.0 dot axpy row axpy row axpy col axpy col

200.0 200.0 MFLOPS/node MFLOPS/node

100.0 100.0

0.0 0.0 0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 matrix dimension, m = k matrix dimension, m = k

Figure 1: Performance of the various algorithms on a 16 processor configuration of the Cray T3E. Within each row of graphs, n

we compare predicted performance (left) to observed performance (right). In the top row of graphs, m and are fixed to be

m k n n

large, and k is varied. In the middle row, and are fixed to be large, and is varied. Finally, in the bottom row, is fixed k to be large, and m and are simultaneously varied. [8] Choi, J., J. J. Dongarra, and D. W. Walker, “PUMMA: Paral- lel Universal Matrix Multiplication Algorithms on distributed memory concurrent computers,” Concurrency: Practice and Experience, 6(7), 543-570, 1994.

[9] Dongarra, J. J., J. Du Croz, S. Hammarling, and I. Duff, “A Set of Level 3 Basic Linear Algebra Subprograms,” TOMS, 16 (1) pp. 1–16, 1990.

[10] Edwards, C., P. Geng, A. Patra, and R. van de Geijn, Parallel matrix distributions: have we been doing it all wrong?, Tech. Report TR-95-40, Dept of Computer Sciences, UT-Austin, 1995.

[11] Fox, G., M. Johnson, G. Lyzenga, S. Otto, J. Salmon, and D. Walker, Solving Problems on Concurrent Processors, Vol. 1, Prentice Hall, Englewood Cliffs, N.J., 1988.

[12] Fox, G., S. Otto, and A. Hey, “Matrix algorithms on a hyper- cube I: matrix multiplication,” Parallel Computing 3 (1987), pp 17-31.

[13] Huss-Lederman, S., E. Jacobson, A. Tsao, G. Zhang, “Ma- trix Multiplication on the Intel Touchstone DELTA,” Concur- rency: Practice and Experience, 6 (7) Oct. 1994, pp. 571-594.

[14] Lin, C., and L. Snyder, “A Matrix Product Algorithm and its Comparative Performance on Hypercubes,” in Scalable High Performance Computing Conference, IEEE Press, 1992, pp. 190–3.

[15] Li, J., A. Skjellum, and R. D. Falgout, “A Poly-Algorithm for Parallel Dense Matrix Multiplication on Two-Dimensional Process Grid Topologies,” Concurrency, Practice and Expe- rience, 9(5) 345–389, May 1997.

[16] van de Geijn, R., Using PLAPACK: Parallel Linear Algebra Package, The MIT Press, 1997.

[17] van de Geijn, R. and J. Watts, “SUMMA: Scalable Univer- sal Matrix Multiplication Algorithm,” Concurrency: Practice and Experience, 9(4) 255–274, April 1997.