Multiplication, a Li‚le Faster

Elaye Karstadt Oded Schwartz Œe Hebrew University of Jerusalem Œe Hebrew University of Jerusalem [email protected] [email protected]

ABSTRACT Œe leading coecient of Strassen-Winograd’s algorithm was be- Strassen’s algorithm (1969) was the €rst sub-cubic matrix multipli- lieved to be optimal, due to a lower bound on the number of ad- 1 cation algorithm. Winograd (1971) improved its complexity by a ditions for algorithms with 2 × 2 base case, constant factor. Many asymptotic improvements followed. Unfor- obtained by Probert [31]. tunately, most of them have done so at the cost of very large, o‰en We obtain a method for improving the practical performance gigantic, hidden constants. Consequently, Strassen-Winograd’s of Strassen and Strassen-like fast matrix multiplication algorithms   O nlog2 7 algorithm o‰en outperforms other matrix multiplica- by improving the hidden constants inside the O-notation. To this tion algorithms for all feasible matrix dimensions. Œe leading end, we extend Bodrato’s (2010) method for matrix squaring, and coecient of Strassen-Winograd’s algorithm was believed to be transform matrices to an alternative basis. optimal for matrix multiplication algorithms with 2 × 2 base case, due to a lower bound of Probert (1976). 1.1 Strassen-like Algorithms Surprisingly, we obtain a faster matrix , Strassen-like algorithms are a class of divide-and-conquer algo- with the same base case size and asymptotic complexity as Strassen- rithms which utilize a base hn0,m0,k0; ti-algorithm: multiplying Winograd’s algorithm, but with the coecient reduced from 6 an n0 ×m0 matrix by anm0 ×k0 matrix using t scalar multiplications, to 5. To this end, we extend Bodrato’s (2010) method for matrix where n0,m0,k0 and t are positive integers. When multiplying an squaring, and transform matrices to an alternative basis. We prove a n × m matrix by an m × k matrix, the algorithm splits them into generalization of Probert’s lower bound that holds under change of blocks (each of size n × m and m × k , respectively), and works n0 m0 m0 k0 basis, showing that for matrix multiplication algorithms with a 2×2 block-wise, according to the base algorithm. Additions and multipli- base case, the leading coecient of our algorithm cannot be further cation by scalar in the base algorithm are interpreted as block-wise reduced, hence optimal. We apply our technique to other Strassen- additions. Multiplications in the base algorithm are interpreted as like algorithms, improving their arithmetic and communication block-wise multiplication via recursion. We refer to a Strassen-like costs by signi€cant constant factors. algorithm by its base case. Hence, an hn,m,k; ti-algorithm may refer to either the algorithm’s base case or the corresponding block CCS CONCEPTS recursive algorithm, as obvious from context. •Mathematics of computing → Computations on matrices; •Computing methodologies → algorithms; 1.2 Known Strassen-like algorithms Since Strassen’s original discovery, many fast matrix multiplication KEYWORDS algorithms followed and improved the asymptotic complexity [6, Fast Matrix Multiplication, Bilinear Algorithms 12, 13, 26, 29, 32, 33, 36, 38, 39]. Some of these improvements have come at the cost of very large, o‰en gigantic, hidden constants. Le Gall [26] estimated that even if matrix multiplication could be done   in O n2 arithmetic operations, it is unlikely to be applicable as 1 INTRODUCTION the base case sizes would have to be astronomical. Strassen’s algorithm [37] was the €rst sub-cubic matrix multipli- Recursive fast matrix multiplication algorithms with reasonable   cation algorithm, with complexity O nlog2 7 . Winograd [40] re- base case size for both square and rectangular matrices have been duced the leading coecient from 7 to 6 by decreasing the number discovered [3, 20, 24, 25, 29, 30, 35]. Œus, they have manageable of additions and subtractions from 18 to 15. In practice, Strassen- hidden constants, some of which are asymptotically faster than Winograd’s algorithm o‰en performs beŠer than some asymptoti- Strassen’s algorithm. While many fast matrix multiplication al- cally faster algorithms [3] due to these smaller hidden constants. gorithms fail to compete with Strassen’s in practice due to their hidden constants. However, some have achieved competitive per- formance (e.g., Kaporin’s [21] implementation of Laderman et al.’s Permission to make digital or hard copies of all or part of this work for personal or algorithm [24]). classroom use is granted without fee provided that copies are not made or distributed Recently, Smirnov presented several fast matrix multiplication for pro€t or commercial advantage and that copies bear this notice and the full citation on the €rst page. Copyrights for components of this work owned by others than the algorithms derived by computer aided optimization tools [35], in- author(s) must be honored. Abstracting with credit is permiŠed. To copy otherwise, or cluding an h6, 3, 3; 40i-algorithm with asymptotic complexity of republish, to post on servers or to redistribute to lists, requires prior speci€c permission  3  log54 40 and/or a fee. Request permissions from [email protected]. O n , faster than Strassen’s algorithm. Ballard and Benson SPAA’17, July 24-26, 2017, Washington, DC, USA [3] later presented several additional fast Strassen-like algorithms, © 2017 Copyright held by the owner/author(s). Publication rights licensed to ACM. 978-1-4503-4593-4/17/07...$15.00 DOI: 10.1145/3087556.3087579 1From here on, when referring to addition and subtraction count we say additions. found using computer aided optimization tools as well. Œey im- Theorem 1.1 (Probert’s lower bound). [31] 15 additions are plemented several Strassen-like algorithms, including Smirnov’s necessary for any h2, 2, 2; 7i-algorithm. h6, 3, 3; 40i-algorithm, on shared-memory architecture in order to demonstrate that Strassen and Strassen-like algorithms can out- Our result seemingly contradicts Probert’s lower bound. How- perform classical matrix multiplication in practice (such as, Intel’s ever, his bound implicitly assumes that the input and output are MKL), on modestly sized problems (at least up to n=13000), in represented in the standard basis, thus there is not contradiction. a shared-memory environment. Œeir experiments also showed We extend Probert’s lower bound to account for alternative bases Strassen’s algorithm outperforming Smirnov’s algorithm in some (see Section 4): of the cases. Theorem 1.2 (Basis invariant lower bound). 12 additions are necessary for any matrix multiplication algorithm that uses a 1.3 Previous work recursive-bilinear algorithm with a 2 × 2 base case with 7 multiplica- Bodrato [7] introduced the intermediate representation method, tions, regardless of basis. for repeated squaring and for chain matrix multiplication compu- tations. Œis enables decreasing the number of additions between Our alternative basis variant of Strassen’s algorithm performs 12 consecutive multiplications. Œus, he obtained an algorithm with a additions in the base case, matching the lower bound in Œeorem 2 × 2 base case, which uses 7 multiplications, and has a leading co- 1.2. Hence, it is optimal. ecient of 5 for chain multiplication and for repeated squaring, for every multiplication outside the €rst one. Bodrato also presented 2 PERLIMINARIES an invertible linear function which recursively transforms a 2k ×2k 2.1 ‡e communication bottleneck matrix to and from the intermediate transformation. While this is Fast matrix multiplication algorithms have lower IO-complexity not the €rst time that linear transformations are applied to matrix than the classical algorithm. Œat is, they communicate asymptoti- multiplication, the main focus of previous research on the subject cally less data within the memory hierarchy and between proces- was on improving asymptotic performance rather than reducing sors. Œe IO-compleixty is measured as a function of the number of the number of additions [10, 17]. processors P, the local memory size M, and the matrix dimension Very recently, Cenk and Hasan [8] showed a clever way to ap- n. Namely, the communication costs of a parallel hn ,n ,n ; ti- ! 0 0 0 ply Strassen-Winograd’s algorithm directly to n × n matrices by  logn t algorithm are Θ √n 0 M [1, 2, 4, 34]. Œus, parallel ver- forsaking the uniform divide-and-conquer paŠern of Strassen-like M P algorithms. Instead, their algorithm splits Strassen-Winograd’s al- sions of Strassen’s algorithm which minimize communication cost gorithm into two linear divide-and-conquer algorithms which recur- outperform the well tuned classical matrix multiplication in prac- sively perform all pre-computations, followed by vector multiplica- tice, both in shared-memory [3, 14, 23] and distributed-memory tion of their results, and €nally performs linear post-computations architectures [1, 16, 27]. to calculate the output. Œeir method enables reuse of sums, result- Our h2, 2, 2; 7i-algorithm not only reduces the arithmetic com- ing in a matrix multiplication algorithm with arithmetic complexity plexity by 16.66%, but also the IO-complexity by 20%, compared to log 7 log 6 log 5 2 of 5n 2 + 0.5 · n 2 + 2n 2 − 6.5n . However, this comes at Strassen-Winograd’s algorithm. Hence, performance gain should the cost of increased communication costs and memory footprint. be in range of 16-20% on a shared-memory machine.

1.4 Our contribution 2.2 Encoding and Decoding matrices We present the Alternative Basis Matrix Multiplication method, Fact 2.1. Let R be a , and let f : Rn × Rm → Rk be a and show how to apply it to existing Strassen-like algorithms (see bilinear function which performs t multiplications. Œere exist Sections 3). While basis transformation is, in general, as expensive U ∈ Rt×n, V ∈ Rt×m, W ∈ Rt×k such that as matrix multiplications, some can be performed very fast (e.g.,   n m T Hadamard in O n2 logn using FFT [11]). Fortunately, so is the ∀x ∈ R , y ∈ R f (x,y) = W ((U · x) (V · y)) case for our basis transformation (see Section 3.1). Œus, it is a where is element-wise vector product (Hadamard product). worthwhile trade-o‚ of reducing the leading coecient in exchange of an asymptotically insigni€cant overhead (see Section 3.2). We De€nition 2.2. (Encoding/Decoding matrices). We refer to the provide analysis as to how these constants are a‚ected and the hU , V , W i of a recursive-bilinear algorithm as its encoding/decoding impact on both arithmetic and IO-complexity. matrices (where U , V are the encoding matrices and W is the de- We discuss the problem of €nding alternative bases to improve coding matrix). Strassen-like algorithms (see Section 5), and present several im- ∈ n×m proved variants of existing algorithms, most notable of which De€nition 2.3. Let R be a ring, and let A R a matrix. We ~ are the alternative basis variant of Strassen’s h2, 2, 2; 7i-algorithm denote the vectorization of A by A. We use the notation U`, (i, j) 0 which reduces the number of additions from 15 to 12 (see Section when referring to the element in the ` th row on the column corre- 3.3), and the variant of Smirnov’s h6, 3, 3; 40i-algorithm with leading sponding with the index (i, j) in the vectorization of A. For ease of ~ 3 coecient reduced by about 83.2%.2 notation, we sometimes write Ai, j rather than A(i, j).

2Files containing the encoding/decoding matrices and the corresponding basis trans- 3All basis transformations and encoding/decoding matrices assume row-ordered vec- formations can be found at hŠps://github.com/elayeek/matmultfaster. torization of matrices. Table 1: h2, 2, 2; 7i-algorithms

Algorithm Additions Arithmetic Computations I/O-Complexity  √ log 7 3·n 2 Strassen [37] 18 7nlog2 7 − 6n2 6 · √ · M − 18n2 + 3M M  √ log 7 3·n 2 Strassen-Winograd [40] 15 6nlog2 7 − 5n2 5 · √ · M − 15n2 + 3M M  √ log 7 √  3·n 2 n Ours 12 5nlog2 7 − 4n2 + 3n2 log n 4 · √ · M − 12n2 + 3n2 · log 2 · √ + 5M 2 M 2 M

Table 2: Alternative Basis Algorithms

Improved Arithmetic Improved Linear Computations Algorithm Linear Leading Leading Operations Saved Operations Coecient Coecient h2, 2, 2; 7i [40] 15 12 6 5 16.6% h3, 2, 3; 15i[3] 64 52 15.06 7.94 47.3% h2, 3, 4; 20i[3] 78 58 9.96 7.46 25.6% h3, 3, 3; 23i[3] 87 75 8.91 6.57 26.3% h6, 3, 3; 40i[35] 1246 202 55.63 9.39 83.2%

Fact 2.4. (Triple product condition). [22] Let R be a ring, and De€nition 3.1. Let R be a ring and let ϕ, ψ, υ be automorphisms let U ∈ Rt×n·m, V ∈ Rt×m·k , W ∈ Rt×n·k . hU , V , W i are encod- of Rn·m, Rm·k , Rn·k (respectively). We denote a Strassen-like al- ing/decoding matrices of an hn,m,k; ti-algorithm if and only if: gorithm which takes ϕ (A) , ψ (B) as inputs and outputs υ (A · B) using t multiplications by hn,m,k; tiϕ,ψ,υ -algorithm. If n = m = k ∀i1,i2 ∈ [n] , k1,k2 ∈ [m] , j1, j2 ∈ [k] and ϕ = ψ = υ, we use the notation hn,n,n; tiϕ -algorithm. Œis Xt notation extends the hn,m,k; ti-algorithm notation as laŠer applies U V W = δ δ δ r, (i1,k1) r, (k2, j1) r, (i2, j2) i1,i2 k1,k2 j1, j2 when the three basis transformations are the identity map. r =1 Given a recursive-bilinear, hn,m,k; ti -algorithm, ALG, al- = = ϕ,ψ,υ where δi, j 1 if i j and 0 otherwise. ternative basis matrix multiplication works as follows: Notation 2.5. Denote the number of nonzero entries in a matrix by nnz (A), and the number of rows/columns by rows (A) ,cols (A). Algorithm 1 Alternative Basis Matrix Multiplication Algorithm ∈ n×m m×k Remark 2.6. Œe number of linear operations used by a bilinear Input: A R , B × · algorithm is determined by its encoding/decoding matrices. Œe Output: n k matrix C = A B 1: function ABS(A, B) number of additions performed by each of the encoding is: n×m 2: A˜ = ϕ(A) . R basis transformation m×k AdditionsU = nnz (U ) − rows (U ) 3: B˜ = ψ (B) . R basis transformation AdditionsV = nnz (V ) − rows (V ) 4: C˜ = ALG(A˜, B˜) . hn,m,k; tiϕ,ψ,υ -algorithm − n×k 5: C = υ 1(C˜) . R basis transformation Œe number of additions performed by the decoding is: 6: return C AdditionsW = nnz (W ) − cols (W ) Œe number of scalar multiplication performed by each of the encod- Lemma 3.2. Let R be a ring, let hU , V , W i be the encoding/decoding ing/decoding is equal to the total number of matrix entries which matrices of an hn,m,k; ti-algorithm, and let ϕ, ψ, υ be automor- D E are not 1, −1, and 0. phisms of Rn·m, Rm·k , Rn·k (respectively). Uϕ−1, Vψ −1, WυT are encoding/decoding matrices of an hn,m,k; tiϕ,ψ,υ -algorithm. 3 ALTERNATIVE BASIS MATRIX hU , V , W i MULTIPLICATION Proof. are encoding/decoding matrices of an hn,m,k; ti-algorithm. Hence, for any A ∈ Rn×m, B ∈ Rm×k Fast matrix multiplication algorithms are bilinear computations. T  ~  ~ −−−→ Œe number of operations performed in the linear phases of such W U · A V · B = A · B algorithms (the application of their encoding/decoding matrices Hence, hU , V , W i in the case of matrix multiplication, see De€nition 2.2)   −−−→  T  ~ ~ depends on basis of representation. In this section, we detail how υ A · B = υ W U · A V · B alternative basis algorithms work and address the e‚ects of using  T      = T −1 · ~ −1 · ~ alternative bases on arithmetic complexity and IO-complexity. Wυ Uϕ ϕ A Vψ ψ B n ×n n ×n  Claim 3.7. Let R be a ring, let ψ1 : R 0 0 → R 0 0 be a linear ∈ n×n k map, and let A R where n = n0 . Še arithmetic complexity of Corollary 3.3. Let R be a ring, and let ϕ, ψ, υ be automor- computing ψ (A) is n·m m·k n·k phisms of R , R , R (respectively). hU , V , W i are encod- q F (n) = n2 log n ing/decoding matrices of an hn,m,k; tiϕ,ψ,υ -algorithm if and only if ψ n0 D E n2 Uϕ, Vψ, Wυ−T are encoding/decoding matrices of an hn,m,k; ti- 0 algorithm. where q is the number of linear operations performed by ψ1.

Proof. Let Fψ (n) be the number of additions required by ψ . 3.1 Fast basis transformation 2 Each step of the recursion consists of computing n0 sub-problems De€nition 3.4. Let R be a ring and let ψ : Rn0×m0 → Rn0×m0 be 1 and performs q additions of sub-matrices. Œerefore, Fψ (n) = a linear map. We recursively de€ne a linear map ψ : Rn×m →   k+1 2  n  n2 × ` ` n F + q and F (1) = 0. Œus, n m 1 2 ≤ 0 ψ n0 n0 ψ R (where n = n0 , m = m0 for some `1, `2 k + 1) by n m (ψk+1 (A))i, j = ψk (ψ1 (A))i, j , where Ai, j are n × m sub-matrices. ! !2 0 0 2 n n Fψ (n) = n0Fψ + q Note that ψk+1 is a linear map. For convenience, we omit the n0 n0 subscript of ψ when obvious from context. log (n)−1 n0 2 X  2k n Let R be a ring, let ψ Rn0×m0 → Rn0×m0 be a linear = n0 q Claim 3.5. 1 : nk+1 ∈ n×m k+1 k+1 ˜ k=0 0 map, and let A R (where n = n0 , m = m0 ). De€ne A by * * + +       log (n)−1. / ˜ ˜ n0 2 k A = ψk Ai, j . Šen ψ1 A = ψk+1 (A). q X , n, - q- i, j = n2 · 0 = n2 · log (n) n2 n2 n2 n0 0 k=0 0 0 Proof. ψ1 is a linear map. Hence, for any i ∈ [n0] , j ∈ [m0], * + (ψ1 (A)) , is a linear sum of elements of A. Œerefore, there exist  i j  , - (i, j) n ×n n ×n scalars x such that Claim 3.8. Let R be a ring and let ψ1 : R 0 0 → R 0 0 be a r,` ∈ , `∈ r [n0] [m0] ∈ n×n k linear map, and let A R where n = n0 . Še IO-complexity of (ψk+1 (A))i, j =ψk (ψ1 (A))i, j computing ψ (A) is √ ! X (i, j) 3q 2 n =ψ x · A IOψ (n, M) ≤ n logn 2 √ + 2M k r,` r,` n2 0 M r,` 0 .* /+ By linearity of ψk where q is the number of linear operations performed by ψ1. X,   -    = x (i, j)ψ A = ψ A˜ Proof. Each step of the recursion consists of computing n2 sub- r,` k r,` 1 i, j 0 r,` problems and performs q linear operations. Œe base case occurs  when the problem €ts entirely in the fast memory (or local memory in parallel seŠing), namely 2n2 ≤ M. Each addition requires at n ×m n ×m Claim 3.6. Let R be a ring, let ψ1 : R 0 0 → R 0 0 be an most 3 data transfers (one of each input and one for writing the invertible linear map, and letψ as de€ned above. ψ is invertible output). Hence, a basis transformation which performs q linear   k+1   k+1 and its inverse is ψ −1 (A) = ψ −1 ψ −1 (A) . operations at each recursive steps has the recurrence: k+1 i, j k 1 i, j    2     n2IO n , M + 3q · n 2n2 > M −1 0 ψ n0 n0 Proof. De€ne A˜ by A˜ = ψ Ai, j and de€ne ψ by IOψ (n, M) ≤ i,j k k+1 2M otherwise      ψ −1 (A) = ψ −1 ψ −1 (A) . Œen:  k+1 i, j k 1 i, j Œerefore   ! !     n n 2 −1 −1 −1 ≤ 2 ψk+1 (ψk+1 (A)) = ψk ψ1 (ψk+1 (A)) IOψ (n, M) n0IOψ , M + 3q i, j i, j n0 n0 By Claim 3.5       log √n −1 −1 −1 ˜ −1 ˜ n0 M = ψ ψ1 ψ1 A = ψ A 2 2 k i, j k i, j *X +  k n = n2 3q + 2M By de€nition of A˜ , - 0 nk+1 k=0 * 0 + −1   √  . * + / = ψk ψk (A)i, j = Ai, j log 2· √n −1 n0 M , k - 3q X , n2 -  = n2 · 0 + 2M n2 n2 0 k=0 0 We next analyze the arithmetic complexity and IO-complexity *! + 3q √ n of fast basis transformations. For convenience and readability, we = n2 · log 2 · √ , +-2M 2 n0 presented here the square case only. Œe analysis for rectangular n0 M matrices is similar.  3.2 Computing matrix multiplication in Each entry in the output C˜ is of the form: alternative basis X C˜i, j = Wr, (i, j)Pr n ×m m ×k Claim 3.9. Let ϕ1,ψ1,υ1 be automorphisms of R 0 0 , R 0 0 , r ∈[t] n ×k R 0 0 (respectively), and let ALG be an hn0,m0,k0; tiϕ ,ψ ,υ algo- X 1 1 1 = W , , υ` (Sr · Tr ) ∈ n×m ∈ m×k r (i j) rithm. For any A R , B R : r ∈[t] By linearity of υ` ALG (ϕ` (A) ,ψ` (B)) = υ` (A · B) X ` ` ` = υ W (S · T ) where n = n0, m = m0, k = k0 . ` r, (i, j) r r r ∈[t] .* /+ ˜ And, as noted in the base case: Proof. Denote C = ALG (ϕ`+1 (A) ,ψ`+1 (B)) and the encod- , - ing/decoding matrices of ALG by hU , V , W i. We prove by induction X on ` that C˜ = υ` (A · B). For r ∈ [t], denote (υ1 (A · B))i, j = Wr, (i, j) (Sr · Tr ) r ∈[t] (i, j) .* /+ X Hence, , - Sr = Ur, (i, j) (ϕ1 (A))i, j C˜i,j = υ` (υ1 (A · B))i, j i ∈[n0], j ∈[m0] X Œerefore, by De€nition 3.4, C˜ = υ`+1 (A · B)  Tr = Vr, (i, j) (ψ1 (B))i, j i ∈[m ], j ∈[k ] 0 0 Notation 3.10. When discussing an hn0,n0,n0; tiϕ,ψ,υ -algorithm, we denote ω = log t. Œe base case, ` = 1, holds by Lemma 3.2 since ALG is an 0 n0 h i n0,m0,k0; t ϕ1,ψ1,υ1 -algorithm. Note that this means that for any Claim 3.11. Let ALG be an hn0,n0,n0; tiϕ,ψ,υ -algorithm which i ∈ [n0] , j ∈ [k0] performs q linear operations at its base case. Še arithmetic complexity of ALG is  T  (υ1 (AB)) = W ((U · ϕ1 (A)) (V · ψ1 (B))) i, j i, j q q X F (n) = 1 + nω0 − n2 = W (S · T ) ALG − 2 − 2 r, (i, j) r r t n0 t n0 r ∈[t] * + * + Proof. Each step of, the recursion- consists, of computing- t sub- Next, we assume the claim holds for ` ∈ N and show for ` + 1. problems and performs q linear operations (additions/multiplication    2 Given input A˜ = ϕ`+1 (A) , B˜ = ψ`+1 (B), ALG performs t multi- n n by scalar) of sub-matrices. Œerefore FALG (n) = tFψ n +q 0 plications P ,..., P . For each multiplication P , its le‰ hand side 0 n 1 t r and FALG (1) = 1. Œus, multiplicand is of the form log n−1 n0 2 X X k n log n ˜ F (n) = t · q · + t n0 · F (1) Lr = Ur, (i, j)Ai, j ALG k+1 ALG k=0 n0 i ∈[n0], j ∈[m0] * + log n−1 n0 k q 2 X , t - ω By De€nition 3.4, (ϕ`+1 (A))i, j = ϕ` (ϕ1 (A))i, j . Hence, = n · + n 0 n2 n2 0 k=0 0 X   * + = U ϕ (ϕ (A))  log n r, (i, j) ` 1 i, j t n0 2 , −-1 ω 2 i ∈[n0], j ∈[m0] q n n 0 − n = n2 0 + nω0q + nω0 n2 t − 1 t − n2 ϕ 0 * n2 + 0 From linearity of ` . 0 / * + . / , -  X , - = ϕ` Ur, (i, j) (ϕ1 (A))i, j Claim 3.12. Let ALG be an hn0,n0,n0; tiϕ,ψ,υ -algorithm which i ∈[n0],j ∈[m0] .* /+ performs q linear operations at its base case. Še IO-complexity of = ϕ` (Sr ) , - ALG is !ω ! And similarly, the right hand multiplication R is of the form q √ n 0 r IO (n, M) ≤ M 3 · √ − 3n2 + 3M ALG − 2 t n0 M Rr = ψ` (Tr ) * + Proof. Each step, of the recursion- consists of computing t sub- ∈ ` × ` ` × ` problems and performs q linear operations. Œe base case occurs Note that for any r [t], Sr , Tr are n0 m0 and m0 k0 matrices, respectively. Hence, by the induction hypothesis, when the problem €ts entirely in the fast memory (or local memory in parallel seŠing), namely 3n2 ≤ M. Each addition requires at Pr = ALG (ϕ` (Sr ) ,ψ` (Tr )) = υ` (Sr · Tr ) most 3 data transfers (one of each input and one for writing the output). Hence, a recursive bi-linear algorithm which performs q 3.3 Optimal h2, 2, 2; 7i-algorithm linear operations at each recursive steps has the recurrence: 4 4 We now present a basis transformation ψopt : R → R and an h2, 2, 2; 7i -algorithm which performs only 12 linear operations.  n   n 2 2 ψ t · IOψ n , M + 3q · n 3n > M IOALG (n, M) ≤ 0 0 Notation 3.15. Let, ψ refer to the following transformation: 3M otherwise opt   1 0 0 0 1 0 0 0 Œerefore   0 1 −1 1 −1 0 1 −1 0 ψopt = ψopt = log √n −1 *0 0 −1 1+ *0 −1 0 1+ n0 M . / . / 3 2 .0 1 0 1/ .0 −1 1 1/ X k n . / . / IOALG (n, M) ≤ t · 3q + 3M . / . / nk+1 For convenience, when applying ψ to matrices, we omit the k=0 0 , - , - * + vectorization and refer to it as ψ : R2×2 → R2×2: log √n −1 n0 M ! ! X 3 , k - A A A A − A + A 3q 2 t ψ (A) = ψ 1,1 1,2 = 1,1 1,2 2,1 2,2 = n + 3M opt 1 − n2 n2 A2,1 A2,2 A21 A2,2 A1,2 + A2,2 0 k=0 0 * + −1 log √n Where Ai, j can be ring elements or sub-matrices. ψ is de€ned   n0 M opt t , 3 -− −1 2 1 analogously. Bothψopt andψ extend recursively as in De€nition 3q 2 n0 opt = n + 3M 3.4. n2 t − 1 0 * n2 + . 0 / . / D E .√ ω0 / U , V , W = M 3 · √n − 3n2 opt opt opt , M - = q + 3M 0 0 0 1 0 0 0 1 0 0 0 1 t − n2 * 0 + 0 0 1 0 0 0 1 0 0 0 1 0 . / . / * 0 1 0 0 0 1 0 0 0 1 0 0 +  * + * + * + , - . 1 0 0 0/ , . 1 0 0 0/ , .1 0 0 0/ . − / . − / . / . 0 1 1 0/ . 0 1 0 1/ .1 1 0 0/ Corollary 3.13. If ABS (Algorithm 1) performs q linear opera- .−1 1 0 0/ . 0 1 −1 0/ .0 −1 0 −1/ . / . / . / tions at the base case, then its arithmetic complexity is . 0 −1 0 1/ .−1 1 0 0/ .0 1 1 0/ . / . / . / q q   F (n) = 1 + nω0 − n2 + O n2 logn Figure, 1: hU , V , W i- are, the encoding/decoding- , matrices- of ABS − 2 − 2 t n0 t n0 our h2, 2, 2; 7i -algorithm which performs 12 linear opera- * + * + ψopt tions Proof. Œe number, of ƒops- performed, - by the algorithm is the sum of: (1) the number of ƒops performed by the basis transforma- D E tions (denoted ϕ, ψ, υ) and (2) the number of ƒops performed by Claim 3.16. Uopt , Vopt , Wopt are encoding/decoding matrices the recursive bilinear algorithms ALG. h i of an 2, 2, 2; 7 ψopt -algorithm.

FABS (n) = FALG (n) + Fϕ (n) + Fψ (n) + Fυ (n) Proof. Observe that D −T E Œe result immediately follows from Claim 3.7 and Claim 3.11  Uopt · ψopt , Vopt · ψopt , Wopt · ψopt = 0 1 0 1 0 1 0 1 0 0 1 1 Corollary 3.14. If ABS (Algorithm 1) performs q linear opera- 0 0 −1 1 0 0 −1 1 0 −1 0 1 tions at the base case, then its IO-complexity is − − − − ** 0 1 1 1+ * 0 1 1 1+ *0 1 1 1++ √ . 1 0 0 0/ , . 1 0 0 0/ , .1 0 0 0/ !ω0 . / . / . / 3q 3 · n 2 . 0 1 0 0/ . 0 0 1 0/ .1 1 −1 −1/ IOALG (n, M) ≤ M √ − n . / . / . / t − n2 M .−1 1 −1 1/ . 0 1 0 0/ .0 −1 0 0/ 0 . / . / . / * + * ! + . 0 0 1 0/ .−1 1 −1 1/ .0 0 −1 0/ 2 n . / . / . / ,+ 3M +-O, n log √ - . / . D / . / E It is easy to verify that U · ψ , V · ψ , W · ψ −T M , - , opt opt - opt, opt opt opt- satisfy the triple product condition in Fact 2.4. Hence, they are en- Proof. Œe IO-complexity is the sum of: (1) the IO-complexity coding/decoding algorithm of an h2, 2, 2; 7i-algorithm. By Corollary of the recursive bilinear algorithms ALG and (2) the IO-complexity 3.3, the claim follows.  of the basis transformations (denoted ϕ, ψ, υ). Claim 3.17. Let R be a ring, and let A ∈ Rn×n where n = 2k . Še IOABS (n, M) =IOALG (n, M) + IOϕ (n, M) arithmetic complexity of computing ψopt (A) is + IO (n, M) + IOυ (n, M) 2 ψ Fψopt (n) = n log2 n −1 Œe result immediately follows from Claim 3.8 and Claim 3.12.  Še same holds for computing ψopt (A). −1 Proof. Both ψopt , ψopt perform q = 4 linear operations at each 4 BASIS-INVARIANT LOWER BOUND ON recursive step and has base case size of n0 = 2. Œe lemma follows ADDITIONS FOR 2 × 2 MATRIX immediately from Claim 3.7.  MULTIPLICATION Claim 3.18. Let R be a ring, and let A ∈ Rn×n where n = 2k . Še In this section we prove Œeorem 1.2 which says that 12 additions I/O-complexity of computing ψopt (A) is are necessary to compute 2 × 2 matrix multiplication recursively ! with base case of 2 × 2 and 7 multiplications, irrespective of basis. √ n ≤ 2 IOψopt (n, M) 2n log2 2 √ + 2M Œeorem 1.2 completes Probert’s lower bound which says that for M standard basis, 15 additions are required. ψ , ψ −1 q = Proof. Both opt opt perform 4 linear operations at each De€nition 4.1. Denote the permutation matrix which swaps row- recursive step. Œe lemma follows immediately from Claim 3.8 with order for column-order of vectorization of an I × J matrix by PI ×J . base case n0 = 2.  Lemma 4.2. [19] Let hU , V , W i be the encoding/decoding matrices Corollary 3.19. Our h2, 2, 2; 7i -algorithm’s arithmetic com- ψopt of an hm,k,n; ti-algorithm. Šen WPn×m, U , VPn×k are encod- log 7 2 plexity is Fopt (n) = 5n 2 − 4n . ing/decoding matrices of an hn,m,k; ti-algorithm. h i × Proof. Our 2, 2, 2; 7 ψopt -algorithm has a 2 2 base case and We use the following results, shown by Hopcro‰ and Kerr [20]: performs 7 multiplications. Applying Fact 2.6 to its encoding/decoding D E Lemma 4.3. [20] If an algorithm for 2 × 2 matrix multiplica- matrices U , V , W , we see that it performs 12 linear op- opt opt opt tion has k le‡ (right) hand side multiplicands from the set S = erations. Œe result follows immediately from Claim 3.11     A1,1, A1,2 + A2,1 , A1,1 + A1,2 + A2,1 , where additions are done h i modulo 2, then it requires at least 6 + k multiplications. Corollary 3.20. Our 2, 2, 2; 7 ψopt -algorithm’s IO-complexity is Corollary 4.4. [20] Lemma 4.3 also applies for the following !log 7 √ n 2 de€nitions of S: IO (n, M) ≤12 · 3 · M √ − 3n2 + 3M opt    M (1) A1,1 + A2,1 , A1,2 + A2,1 + A2,2 , A1,1 + A1,2 + A2,2 * +    (2) A1,1 + A1,2 , A1,2 + A2,1 + A2,2 , A1,1 + A2,1 + A2,2 Proof. Our algorithm has a 2 × 2 base case and performs 7    , - (3) A1,1 + A1,2 + A2,1 + A2,2 , A1,2 + A2,1 , A1,1 + A2,2 multiplications. By applying Fact 2.6 to its encoding/decoding   (4) A2,1, A1,1 + A2,2 , A1,1 + A2,1 + A2,2 matrices (as shown in Figure 1), we see that it performs 12 linear    (5) A2,1 + A2,2 , A1,1 + A1,2 + A2,2 , A1,1 + A1,2 + A2,1 operations. Œe result follows immediately from Claim 3.12.    (6) A1,2, A1,1 + A2,2 , A1,1 + A1,2 + A2,2    Corollary 3.21. Še arithmetic complexity of ABS (Algorithm 1) (7) A1,2 + A2,2 , A1,1 + A2,1 + A2,2 , A1,1 + A1,2 + A2,1 h i   with our 2, 2, 2; 7 ψopt -algorithm is (8) A2,2, A1,2 + A2,1 , A1,2 + A2,1 + A2,2

log2 7 2 2 FABS (n) = 5n − 4n + 3n log2 n Corollary 4.5. Any 2 × 2 matrix multiplication algorithm where a le‡ hand (or right hand) multiplicand appears at least twice (modulo Proof. Œe proof is similar to that of Corollary 3.13  2) requires 8 or more multiplications. Corollary 3.22. Še IO-complexity of ABS (Algorithm 1) with Proof. Immediate from Lemma 4.3 and Corollary 4.4 since it our h2, 2, 2; 7iψ -algorithm is opt covers all possible linear sums of matrix elements, modulo 2.  √ !log2 7 3 · n 2 IOALG (n, M) ≤4 · √ · M − 12n Fact 4.6. A simple counting argument shows that any 7 × 4 M binary matrix with less than 10 non-zero entries has a duplicate √ ! 2 n row (modulo 2) or an all zero row. + 3n · log2 2 · √ + 5M M Lemma 4.7. Irrespective of basis transformations ϕ, ψ , υ, the en- Proof. Œe proof is similar to that of Corollary 3.14  coding matrices U , V , of an h2, 2, 2; 7iϕ,ψ,υ -algorithm contain no h i duplicate rows. Theorem 3.23. Our 2, 2, 2; 7 ψopt -algorithm’s sequential and  ω0  parallel IO-complexity is bound by Ω √n M (where P is the Proof. Let hU , V , W i be encoding/decoding matrices of an M P D −T E h2, 2, 2; 7iϕ,ψ,υ -algorithm. By Corollary 3.3, Uϕ, Vψ, Wυ are number of processors, 1 in the sequential case, and ω0 = log2 7). encoding/decoding matrices of a h2, 2, 2; 7i-algorithm. Assume, Proof. We refer to the undirected bipartite graph de€ned by w.l.o.g, that U contains a duplicate row, which means that Uϕ con- D E the decoding matrix of our a Strassen-like algorithm as its decod- tains a duplicate row as well. In that case, Uψ, Vψ, Wυ−T is a ing graph (i.e., the edge (i, j) exists if Wi, j , 0). In [2], Ballard h2, 2, 2; 7i-algorithm in which one of the encoding matrices contains et al. proved that for any square recursive-bilinear Strassen-like a duplicate row, in contradiction to Corollary 4.5.  algorithm with n0 × n0 base case which performs t multiplications, if the decoding graph is connected then these bounds apply with Lemma 4.8. Irrespective of basis transformations ϕ, ψ, υ, the en- ω = t coding matrices U ,V , of a h , , i -algorithm have at least 0 logn0 . Œe decoding graph of our algorithm is connected. 2 2 2; 7 ϕ,ψ,υ 10 Hence, the claim is true.  non-zero entries. Proof. No row in U ,V can be zeroed out since otherwise the algorithms proved useful. Our resulting alternative basis Strassen- algorithm would require less than 7 multiplications. Œe result then like algorithms are summarized in Table 2. Note, particularly, our follows from Fact 4.6 and Lemma 4.7.  alternative basis version of Smirnov’s h6, 3, 3; 40i-algorithm, which is asymptotically faster than Strassen’s, where we have reduced Lemma 4.9. Irrespective of basis transformations ϕ, ψ, υ, the de- the number of linear operations in the bilinear-recursive algorithm coding matrix W of an h2, 2, 2; 7iϕ,ψ,υ -algorithm has at least 10 non- from 1246 to 202, thus reducing the leading coecient by 83.2%. zero entries. 6 IMPLEMENTATION AND PRACTICAL Proof. Let hU ,V ,W i be encoding/decoding matrices of an CONSIDERATIONS h2, 2, 2; 7iϕ,ψ,υ -algorithm and suppose by contradiction that W has less than 10 non-zero entries. W is a 7 × 4 matrix. By Fact 4.6, it 6.1 Recursion cuto‚ point either has a duplicate row (modulo 2) or an all zero row. Corollary Implementations of fast matrix multiplication algorithms o‰en D −T E 3.3 states that Uϕ,Vψ,Wυ are encoding/decoding matrices of take several recursive steps then call the classical algorithm from a D E a h2, 2, 2; 7i-algorithm. By Lemma 4.2, because Uϕ,Vψ,Wυ−T vendor-tuned library. Œis gives beŠer performance in practice due D −T E to two main reasons: (1) the asymptotic improvement of Strassen- de€ne a h2, 2, 2; 7i-algorithm, so does Wυ P2×2,Uϕ,VψP2×2 . like algorithms makes them faster than classical algorithms by mar- Hence Wυ−T P is an encoding matrix of a h2, 2, 2; 7i-algorithm 2×2 gins which increase with matrix size, and (2) vendor-tuned libraries which has duplicate or all zero rows, contradicting Corollary 4.5. have extensive built-in optimization, which makes them perform  beŠer than existing implementations of fast matrix multiplication Proof (of Theorem 1.2). Lemma 4.8 and 4.9 then show that algorithms on small matrices. each encoding/decoding matrix must contain at least 10 non-zero We next present theoretical analysis for €nding the optimal entries. By Remark 2.6, the number of additions used by each of the number of recursive steps without tuning. encoding matrix is nnz (U ) − rows (U ) (and analogously for V ) and Claim 6.1. Let ALG be an hn,n,n; ti-algorithm with q linear the number of additions used by the decoding is nnz (W ) −cols (W ). operations at the base case. Œe arithmetic complexity of running Hence, 12 additions are necessary for an h2, 2, 2; 7iϕ,ψ,υ -algorithm ALG for ` steps, then switching to classical matrix multiplication is: irrespective of basis transformations ϕ, ψ, υ.  ` q t F (n, `) = − 1 · n2+ 5 OPTIMAL ALTERNATIVE BASES ALG − 2 2 t n0 n0 ** + + To apply our alternative basis method to other Strassen-like matrix . 3 /2 multiplication algorithms, we €nd bases which reduce the number n n + t ` 2 ,, −- - of linear operations performed by the algorithm. As we mentioned ` ` n0 n0 in Fact 2.6, the non-zero entries of the encoding/decoding matri- .* * + * + /+ ces determine the number of linear operations performed by an Proof. Each step of the recursion, , - consists, - of- computing t sub- algorithm. Hence, we want our encoding/decoding matrices to be problems and performsq linear operations. Œerefore, FALG (n, `) = as sparse as possible, and ideally to have only entries of the form  n   n 2 3 2 t · FALG , ` − 1 + q 0 and FALG (n, 0) = 2n − n . Œus −1, 0 and 1. From Lemma 3.2 and Corollary 3.3 we see that any n0 n hn,m,k; ti-algorithm and dimension compatible basis transforma- `−1 2 X n n tions ϕ, ψ, υ can be composed into an hn,m,k; ti -algorithm. F (n, `) = tk · q · + t ` · F ϕ,ψ,υ ALG k+1 ALG ` Œerefore, the problem of €nding a basis in which a Strassen-like k=0 n0 n0 * + * + algorithm performs the least amount of linear operations is closely  ` t , − 1- 3, - 2 tied to the Matrix Sparsi€cation problem: q n2 n n = n2 0 + t ` 2 − n2 t − 1 ` ` Problem 5.1. Matrix Sparsi€cation Problem (MS): Let U be an 0 * n2 + n0 n0 . 0 / * + m × n matrix of full rank, €nd an A such that . / . * + * + / . ` / , - ,3 - 2 q t , n -n A = argmin (nnz (UA)) = , − 1- · n2 + t ` 2 − A∈GLn − 2 2 ` ` t n0 n0 n0 n0 .** + /+ .* * + * + /+ Œat is, €nding basis transformations for a Strassen-like algo-  rithm consists of three independent MS problems. Unfortunately, ,, - - , , - , - - MS is not only NP-Hard [28] to solve, but also NP-Hard to approxi- When running an alternative basis algorithm for a limited num- .5−o(1) mate to within a factor of 2log n [15] (Over Q, assuming NP ber of recursive steps, the basis transformation needs to be com- does not admit quasi-polynomial time deterministic algorithms). puted only for the same number of recursive steps. If the basis Œere seem to be very few heuristics for matrix sparsi€cation (e.g., transformation is computed for more steps than the alternative [9]), or algorithms under very limiting assumptions (e.g., [18]). basis multiplication, the classical algorithm will compute incorrect Nevertheless, for existing Strassen-like algorithms with small base results as it does not account for the input being represented in an cases, the use of search heuristics to €nd bases which signi€cantly alternative basis. Œis introduces a small saving in the runtime of sparsify the encoding/decoding matrices of several Strassen-like basis transformation. n ×n n ×n Claim 6.2. Let R be a ring and let ψ1 : R 0 0 → R 0 0 be an ∈ n×n k ≤ invertible linear map, let A R where n = n0 , and let ` k. Še arithmetic complexity of computing ψ` (A) is q F (n, `) = n2 · ` ψ 2 n0

Proof. Let Fψ` (n) be the number of additions required by ψ`. Each recursive consists of computing n2 sub-problems and perform- 0   ing q linear operations. Œerefore, F (n, `) = n2 · F n , ` − 1 + ψ 0 ψ n0  2 q · n and F (n, 0) = 0 n0 ψ `−1 2 X  k n F (n) = n2 · q ψ` 0 k+1 k=0 n0 * + `− k q X1 n2 q = n2 · ,0 =- n2 · ` n2 n2 n2 0 k=0 0 0 Figure 2: Comparing the performance of our h2, 2, 2; 7i - * + ψopt  algorithm to Strassen-Winograd’s on square matrices of , - €xed dimension N = 32768. ‡e graph shows our algo- 6.2 Performance experiments rithm’s runtime, normalized by Strassen-Winograd’s algo- h i rithm’s runtime, as a function of the number of recursive We next present performance results for our 2, 2, 2; 7 ψopt -algorithm. All experiments were conducted on a single compute node of steps taken before switching to Intel’s MKL DGEMM. ‡e HLRS’s Hazel Hen, with two 12-core (24 threads) Intel Xeon CPU top horizontal (at 1) line represents Strassen-Winograd’s E5-2680 v3 and 128GB of memory. performance and the bottom horizontal line (at 0.83) repre- We used a straightforward implementation of both our algorithm sents the theoretical ratio when taking the maximal number and Strassen-Winograd’s [40] algorithm using OpenMP. Each al- of recursive steps. gorithm runs for a pre-selected number of recursive steps before switching to Intel’s MKL DGEMM routine. Each DGEMM call uses all threads, matrix additions are always fully parallelized. All results and Humanities) and grant 3-10891 from the Ministry of Science are the median over 6 experiments. and Technology, Israel. Research is also supported by the Einstein In Figure 2 we see that our algorithm outperforms Strassen- Foundation and the Minerva Foundation. Œis work was supported Winograd’s, with the margin of improvement increasing with each by the PetaCloud industry-academia consortium. Œis research was recursive step and nearing the theoretical improvement. supported by a grant from the United States-Israel Bi-national Sci- ence Foundation (BSF), Jerusalem, Israel. Œis work was supported 7 DISCUSSION by the HUJI Cyber Security Research Center in conjunction with Our method obtained novel variants of existing Strassen-like al- the Israel National Cyber Bureau in the Prime Minister’s Oce. gorithms, reducing the number of linear operations required. Our We acknowledge PRACE for awarding us access to Hazel Hen at algorithm also outperforms Strassen-Winograd’s algorithm for any GCS@HLRS, Germany. matrix dimension n ≥ 32. Furthermore, we’ve obtained an alterna- tive basis algorithm of Smirnov’s h6, 3, 3; 40i-algorithm, reducing REFERENCES the number of additions by 83.8%. While the problem of €nding [1] Grey Ballard, James Demmel, Olga Holtz, Benjamin Lipshitz, and Oded Schwartz. 2012. Communication-optimal parallel algorithm for strassen’s matrix multipli- bases which optimally sparsify an algorithm’s encoding/decoding cation. In Proceedings of the twenty-fourth annual ACM symposium on Parallelism matrices is NP-Hard (see Section 5), it is still solvable for many in algorithms and architectures. ACM, 193–204. fast matrix multiplication algorithms with small base cases. Hence, [2] Grey Ballard, James Demmel, Olga Holtz, and Oded Schwartz. 2012. Graph expansion and communication costs of fast matrix multiplication. Journal of the €nding basis transformations could be done in practice using search ACM (JACM) 59, 6 (2012), 32. heuristics, leading to further improvements. [3] Austin R Benson and Grey Ballard. 2015. A framework for practical parallel fast We leave large scale implementations for future research but matrix multiplication. ACM SIGPLAN Notices 50, 8 (2015), 42–53. [4] Gianfranco Bilardi and Lorenzo De Stefani. 2016. Œe I/O complexity of Strassen’s note that both kernels of our alternative basis algorithms (basis matrix multiplication with recomputation. arXiv preprint arXiv:1605.02224 (2016). transformation and recursive-bilinear algorithms) are known to [5] Gianfranco Bilardi, Michele Scquizzato, and Francesco Silvestri. 2012. A Lower Bound Technique for Communication on BSP with Application to the FFT. In be highly parallelizable recursive divide-and-conquer algorithms, European Conference on Parallel Processing. Springer, 676–687. and admit various communication minimizing parallelization tech- [6] Dario Bini, Milvio Capovani, Francesco Romani, and Grazia LoŠi. 1979. O niques (e.g., [1, 5]). (n2.7799) complexity for n× n approximate matrix multiplication. Information processing leˆers 8, 5 (1979), 234–235. [7] Marco Bodrato. 2010. A Strassen-like matrix multiplication suited for squaring ACKNOWLEDGMENTS and higher power computation. In Proceedings of the 2010 International Sympo- sium on Symbolic and Algebraic Computation. ACM, 273–280. Research is supported by grants 1878/14, and 1901/14 from the Is- [8] Murat Cenk and M Anwar Hasan. 2017. On the arithmetic complexity of strassen- rael Science Foundation (founded by the Israel Academy of Sciences like matrix multiplications. Journal of Symbolic Computation 80 (2017), 484–501. [9] S Frank Chang and S Œomas McCormick. 1992. A hierarchical algorithm for (1992), 557–588. making sparse matrices sparser. Mathematical Programming 56, 1 (1992), 1–30. [25] Julian D Laderman. 1976. A noncommutative algorithm for multiplying 3× 3 [10] Henry Cohn and Christopher Umans. 2003. A group-theoretic approach to fast matrices using 23 multiplications. In Am. Math. Soc, Vol. 82. 126–128. matrix multiplication. In Foundations of Computer Science, 2003. Proceedings. 44th [26] Franc¸ois Le Gall. 2014. Powers of tensors and fast matrix multiplication. In Annual IEEE Symposium on. IEEE, 438–449. Proceedings of the 39th international symposium on symbolic and algebraic com- [11] James W Cooley and John W Tukey. 1965. An algorithm for the machine cal- putation. ACM, 296–303. culation of complex Fourier series. Mathematics of computation 19, 90 (1965), [27] Benjamin Lipshitz, Grey Ballard, James Demmel, and Oded Schwartz. 2012. 297–301. Communication-avoiding parallel strassen: Implementation and performance. [12] Don Coppersmith and Shmuel Winograd. 1982. On the asymptotic complexity In Proceedings of the International Conference on High Performance Computing, of matrix multiplication. SIAM J. Comput. 11, 3 (1982), 472–492. Networking, Storage and Analysis. IEEE Computer Society Press, 101. [13] Don Coppersmith and Shmuel Winograd. 1990. Matrix multiplication via arith- [28] S Œomas McCormick. 1983. A Combinatorial Approach to Some metic progressions. Journal of symbolic computation 9, 3 (1990), 251–280. Problems. Technical Report. DTIC Document. [14] Paolo D’alberto, Marco Bodrato, and Alexandru Nicolau. 2011. Exploiting par- [29] V Ya Pan. 1978. Strassen’s algorithm is not optimal trilinear technique of ag- allelism in matrix-computation kernels for symmetric multiprocessor systems: gregating, uniting and canceling for constructing fast algorithms for matrix Matrix-multiplication and matrix-addition algorithm optimizations by so‰ware operations. In Foundations of Computer Science, 1978., 19th Annual Symposium pipelining and threads allocation. ACM Transactions on Mathematical So‡ware on. IEEE, 166–176. (TOMS) 38, 1 (2011), 2. [30] V Ya Pan. 1982. Trilinear aggregating with implicit canceling for a new accelera- [15] Lee-Ad GoŠlieb and Tyler Neylon. 2010. Matrix sparsi€cation and the sparse tion of matrix multiplication. Computers & Mathematics with Applications 8, 1 null space problem. In Approximation, Randomization, and Combinatorial Opti- (1982), 23–34. mization. Algorithms and Techniques. Springer, 205–218. [31] Robert L Probert. 1976. On the additive complexity of matrix multiplication. [16] Brian Grayson and Robert Van De Geijn. 1996. A high performance parallel SIAM J. Comput. 5, 2 (1976), 187–203. Strassen implementation. Parallel Processing Leˆers 6, 01 (1996), 3–12. [32] Francesco Romani. 1982. Some properties of disjoint sums of tensors related to [17] Vince Grolmusz. 2008. Modular representations of polynomials: Hyperdense matrix multiplication. SIAM J. Comput. 11, 2 (1982), 263–267. coding and fast matrix multiplication. IEEE Transactions on Information Šeory [33] Arnold Schonhage.¨ 1981. Partial and total matrix multiplication. SIAM J. Comput. 54, 8 (2008), 3687–3692. 10, 3 (1981), 434–455. [18] Alan J Ho‚man and ST McCormick. 1984. A fast algorithm that makes matrices [34] Jacob ScoŠ, Olga Holtz, and Oded Schwartz. 2015. Matrix multiplication I/O- optimally sparse. Progress in Combinatorial Optimization (1984), 185–196. complexity by path routing. In Proceedings of the 27th ACM symposium on Paral- [19] John Hopcro‰ and Jean Musinski. 1973. Duality applied to the complexity of lelism in Algorithms and Architectures. ACM, 35–45. matrix multiplications and other bilinear forms. In Proceedings of the €‡h annual [35] AV Smirnov. 2013. Œe bilinear complexity and practical algorithms for matrix ACM symposium on Šeory of computing. ACM, 73–87. multiplication. Computational Mathematics and Mathematical Physics 53, 12 [20] John E Hopcro‰ and Leslie R Kerr. 1971. On minimizing the number of multi- (2013), 1781–1795. plications necessary for matrix multiplication. SIAM J. Appl. Math. 20, 1 (1971), [36] Andrew James Stothers. 2010. On the complexity of matrix multiplication. (2010). 30–36. [37] . 1969. is not optimal. Numerische mathe- [21] Igor Kaporin. 2004. Œe aggregation and cancellation techniques as a practical matik 13, 4 (1969), 354–356. tool for faster matrix multiplication. Šeoretical Computer Science 315, 2-3 (2004), [38] Volker Strassen. 1986. Œe asymptotic spectrum of tensors and the exponent 469–510. of matrix multiplication. In Foundations of Computer Science, 1986., 27th Annual [22] Donald E Knuth. 1981. Œe Art of Computer Programming, Volume 2: Seminu- Symposium on. IEEE, 49–54. merical Algorithms, Addison-Wesley. Reading, MA (1981). [39] Virginia Vassilevska Williams. 2012. Multiplying matrices faster than [23] Bharat Kumar, C-H Huang, P Sadayappan, and Rodney W Johnson. 1995. A Coppersmith-Winograd. In Proceedings of the forty-fourth annual ACM sym- formulation of Strassen’s matrix multiplication algorithm with posium on Šeory of computing. ACM, 887–898. memory reduction. Scienti€c Programming 4, 4 (1995), 275–289. [40] Shmuel Winograd. 1971. On multiplication of 2× 2 matrices. Linear algebra and [24] Julian Laderman, Victor Pan, and Xuan-He Sha. 1992. On practical algorithms its applications 4, 4 (1971), 381–388. for accelerated matrix multiplication. Linear Algebra and Its Applications 162