Implementation of Strassen's Algorithm for Matrix Multiplication 1

Implementation of Strassen's Algorithm for Matrix 1 Multiplication 2 Steven Huss-Lederman 3 Elaine M. Jacobson 4 Jeremy R. Johnson 5 Anna Tsao 6 Thomas Turnbull August 1, 1996 1 This work was partially supp orted by the Applied and Computational Mathematics Program, Defense Advanced Research Pro jects Agency, under Contract P-95006. 2 Computer Sciences Department, Universit y of Wisconsin-Madison, 1210 W. Dayton St., Madi- son, WI 53706, Phone: (608) 262-0664, F AX: (608) 262-9777, email: [email protected] 3 Center for Computing Sciences, 17100 Science Dr., Bowie, MD 20715, Phone: (301) 805-7435, FAX: (301) 805-7602, email: [email protected] 4 Department of Mathematics and Computer Science, Drexel Universit y , Philadelphia, PA 19104, Phone: (215) 895-2893, FAX: (610) 647-8633, email: [email protected] 5 Center for Computing Sciences, 17100 Science Dr., Bowie, MD 20715, Phone: (301) 805-7432, FAX: (301) 805-7602, email: [email protected] 6 Center for Computing Sciences, 17100 Science Dr., Bowie, MD 20715, Phone: (301) 805-7358, FAX: (301) 805-7602, email: [email protected] 0-89791-854-1/1996/$5.00 © 1996 IEEE Abstract In this pap er we rep ort on the development of an ecient and p ortable implementation of Strassen's matrix multiplication algorithm. Our implementation is designed to b e used in place of DGEMM, the Level 3 BLAS matrix multiplication routine. Ecient p erformance will b e obtained for all matrix sizes and shap es and the additional memory needed for tem- porary variables has been minimized. Replacing DGEMM with our routine should provide a signi cant p erformance gain for large matrices while providing the same p erformance for small matrices. We measure p erformance of our co de on the IBM RS/6000, CRAY YMP C90, and CRAY T3D single pro cessor, and o er comparisons to other co des. Our p erformance data recon rms that Strassen's algorithm is practical for realistic size matrices. The usefulness of our implementation is demonstrated by replacing DGEMM with our routine in a large application co de. Keywords: matrix multiplication, Strassen's algorithm, Winograd variant, Level 3 BLAS 1 Intro duction The multiplication of two matrices is one of the most basic op erations of linear algebra and scienti c computing and has provided an imp ortant fo cus in the search for metho ds to sp eed up scienti c computation. Its central role is evidenced by its inclusion as a key primitive op eration in p ortable libraries, such as the Level 3 BLAS [7], where it can then be used as a building blo ck in the implementation of many other routines, as done in LAPACK [1]. Thus, any sp eedup in matrix multiplication can improve the p erformance of a wide variety of numerical algorithms. Much of the e ort invested in sp eeding up practical matrix multiplication implementations has concentrated on the well-known standard algorithm, with improvements seen when the required inner pro ducts are computed in various ways that are b etter-suited to a given machine architecture. Much less e ort has been given towards the investigation of alterna- 3 tive algorithms whose asymptotic complexity is less than the (m ) op erations required by the conventional algorithm to multiply m m matrices. One such algorithm is Strassen's lg (7) algorithm, intro duced in 1969 [19], which has complexity(m ), where lg(7) 2:807 and lg(x) denotes the base 2 logarithm of x. Strassen's algorithm has long su ered from the erroneous assumptions that it is not ecient for matrix sizes that are seen in practice and that it is unstable. Both of these assumptions have b een questioned in recentwork. By stopping the Strassen recursions early and p erforming the b ottom-level multiplications using the traditional algorithm, comp etitive p erformance is seen for matrix sizes in the hundreds in Bailey's FORTRAN implementation on the CRAY 2 [2], Douglas, et al.'s [8] C implementation of the Winograd variant of Strassen's algorithm on various machines, and IBM's ESSL library routine [16]. In addition, the stability analyses of Brent [4] and then Higham [11, 12] show that Strassen's algorithm is stable enough to be studied further and considered seriously in the development of high- p erformance co des for matrix multiplication. A useful implementation of Strassen's algorithm must rst eciently handle matrices of arbitrary size. It is well known that Strassen's algorithm can b e applied in a straightforward fashion to square matrices whose order is a power of two, but issues arise for matrices that are non-square or those having odd dimensions. Second, establishing an appropriate cuto criterion for stopping the recursions early is crucial to obtaining comp etitive p erformance on matrices of practical size. Finally, excessive amounts of memory should not be required to store temp orary results. Earlier work addressing these issues can be found in [2, 3, 4, 5, 8, 9, 10, 11, 17, 19]. In this pap er we rep ort on our development of a general, ecient, and p ortable implementation of Strassen's algorithm that is usable in any program in place of calls to DGEMM, the Level 3 BLAS matrix multiplication routine. Careful consideration has b een given to all 1 of the issues mentioned ab ove. Our analysis provides an improved cuto condition for rect- angular matrices, a demonstration of the viability of dynamic p eeling, a simple technique for dealing with o dd matrix dimensions that had previously b een dismissed [8], and a reduction in the amount of memory required for temp orary variables. We measure p erformance of our co de on the IBM RS/6000, CRAY YMP C90, and CRAY T3D single pro cessor and examine the results in several ways. Comparisons with machine- sp eci c implementations of DGEMM recon rm that Strassen's algorithm can provide an improvementover the standard algorithm for matrices of practical size. Timings of our co de using several di erent cuto criteria are compared, demonstrating the b ene ts of our new technique. Comparisons to the Strassen routine in the ESSL RS/6000 and the CRAY C90 libraries and the implementation of Douglas, et al., show that comp etitive p erformance can be obtained in a portable co de that uses the previously untried dynamic peeling metho d for o dd-sized matrices. This is esp ecially signi cant since for certain cases our memory requirements have been reduced by 40 to more than 70 p ercent over these other co des. The remainder of this pap er is organized as follows. Section 2 reviews Strassen's algorithm. In Section 3 we describ e our implementation and address implementation issues related to cuto , o dd dimensions, and memory usage. Performance of our implementation is examined in Section 4, where we also rep ort on using our Strassen co de for the matrix multiplications in an eigensolver application. We o er a summary and conclusions in Section 5. 2 2 Strassen's Algorithm Here we review Strassen's algorithm and some of its key algorithmic issues within the frame- work of an op eration count mo del. The interested reader is referred to [14] for more details on this and other mo dels, some of which also take into account memory access patterns, p ossible data reuse, and di erences in sp eed b etween di erent arithmetic op erations. In this pap er the simpler op eration count mo del will meet our needs for discussion of the various issues that had an e ect on our co de design. 3 The standard algorithm for multiplying two m m matrices requires m scalar multipli- 3 2 3 2 cations and m m scalar additions, for a total arithmetic op eration countof2m m . In Strassen's now famous 1969 pap er [19], he intro duced an algorithm, stated there for square matrices, which is based on a clever wayofmultiplying 2 2 matrices using 7 multiplications and 18 additions/subtractions. His construction do es not dep end on the commutativity of the comp onent multiplications and hence can be applied to blo ck matrices and then used recursively. If one level of Strassen's algorithm is applied to 2 2 matrices whose elements are m=2 m=2 blo cks and the standard algorithm is used for the seven blo ck matrix multiplications, 3 2 2 3 2 the total op eration count is 7(2(m=2) (m=2) ) + 18(m=2) = (7=4)m +(11=4)m . The ratio of this op eration count to that required by the standard algorithm alone is seen to b e 3 2 7m +11m ; (1) 3 2 8m 4m whichapproaches 7/8 as m gets large, implying that for suciently large matrices one level of Strassen's construction pro duces a 12.5% improvement over regular matrix multiplication. Applying Strassen's construction recursively leads to the complexity result stated in the intro duction [6]. We remark that the asymptotic complexity do es not dep end on the number of additions/subtractions; however, reducing the numb er of additions/subtractions can have practical signi cance. Winograd's variant of Strassen's algorithm (credited to M. Paterson) uses 7 multiplications and 15 additions/subtractions [10]. The algorithm partitions input matrices A and B into 2 2 blo cks and computes C = AB as ! ! ! B B A A C C 11 12 11 12 11 12 : = B B A A C C 21 22 21 22 21 22 Stages (1) and (2) of the algorithm compute S = A + A ; T = B B ; 1 21 22 1 12 11 S = S A ; T = B T ; 2 1 11 2 22 1 S = A A ; T = B B ; 3 11 21 3 22 12 S = A S ; T = B T ; 4 12 2 4 21 2 3 and stages (3) and (4) compute, resp ectively, the seven pro ducts and seven sums P = A B ; P = S T ; U = P + P ; U = U + P ; 1 11 11 5 3 3 1 1 2 5 3 3 P = A B ; P = S B ; U = P + P ; U = U + P ; 2 12 21 6 4 22 2 1 4 6 2 3 P = S T ; P = A T ; U = U + P ; U = U + P : 3 1 1 7 22 4 3 2 5 7 6 6 P = S T ; U = U + P ; 4 2 2 4 3 7 It is easy to verify that C = U , C = U , C = U , and C = U .

Implementation of Strassen's Algorithm for Matrix Multiplication 1

Fast Matrix Multiplication

Efficient Matrix Multiplication on Simd Computers* P

Matrix Multiplication: a Study Kristi Stefa Introduction

The Strassen Algorithm and Its Computational Complexity

Using Machine Learning to Improve Dense and Sparse Matrix Multiplication Kernels

Divide-And-Conquer Algorithms

Algebraic Elimination of Epsilon-Transitions Gérard Duchamp, Hatem Hadj Kacem, Éric Laugerotte

Implementing Strassen's Algorithm with BLIS

A Fast and Scalable Matrix Inversion Method in Apache Spark ICDCN’18,January4–7,2018,Varanasi,India

28 Matrix Operations

Minimizing Communication in All-Pairs Shortest Paths

High-Performance Matrix Multiplication on Intel and FGPA