Matrix Multiplication, a Lile Faster
Elaye Karstadt Oded Schwartz e Hebrew University of Jerusalem e Hebrew University of Jerusalem [email protected] [email protected]
ABSTRACT e leading coecient of Strassen-Winograd’s algorithm was be- Strassen’s algorithm (1969) was the rst sub-cubic matrix multipli- lieved to be optimal, due to a lower bound on the number of ad- 1 cation algorithm. Winograd (1971) improved its complexity by a ditions for matrix multiplication algorithms with 2 × 2 base case, constant factor. Many asymptotic improvements followed. Unfor- obtained by Probert [31]. tunately, most of them have done so at the cost of very large, oen We obtain a method for improving the practical performance gigantic, hidden constants. Consequently, Strassen-Winograd’s of Strassen and Strassen-like fast matrix multiplication algorithms O nlog2 7 algorithm oen outperforms other matrix multiplica- by improving the hidden constants inside the O-notation. To this tion algorithms for all feasible matrix dimensions. e leading end, we extend Bodrato’s (2010) method for matrix squaring, and coecient of Strassen-Winograd’s algorithm was believed to be transform matrices to an alternative basis. optimal for matrix multiplication algorithms with 2 × 2 base case, due to a lower bound of Probert (1976). 1.1 Strassen-like Algorithms Surprisingly, we obtain a faster matrix multiplication algorithm, Strassen-like algorithms are a class of divide-and-conquer algo- with the same base case size and asymptotic complexity as Strassen- rithms which utilize a base hn0,m0,k0; ti-algorithm: multiplying Winograd’s algorithm, but with the coecient reduced from 6 an n0 ×m0 matrix by anm0 ×k0 matrix using t scalar multiplications, to 5. To this end, we extend Bodrato’s (2010) method for matrix where n0,m0,k0 and t are positive integers. When multiplying an squaring, and transform matrices to an alternative basis. We prove a n × m matrix by an m × k matrix, the algorithm splits them into generalization of Probert’s lower bound that holds under change of blocks (each of size n × m and m × k , respectively), and works n0 m0 m0 k0 basis, showing that for matrix multiplication algorithms with a 2×2 block-wise, according to the base algorithm. Additions and multipli- base case, the leading coecient of our algorithm cannot be further cation by scalar in the base algorithm are interpreted as block-wise reduced, hence optimal. We apply our technique to other Strassen- additions. Multiplications in the base algorithm are interpreted as like algorithms, improving their arithmetic and communication block-wise multiplication via recursion. We refer to a Strassen-like costs by signicant constant factors. algorithm by its base case. Hence, an hn,m,k; ti-algorithm may refer to either the algorithm’s base case or the corresponding block CCS CONCEPTS recursive algorithm, as obvious from context. •Mathematics of computing → Computations on matrices; •Computing methodologies → Linear algebra algorithms; 1.2 Known Strassen-like algorithms Since Strassen’s original discovery, many fast matrix multiplication KEYWORDS algorithms followed and improved the asymptotic complexity [6, Fast Matrix Multiplication, Bilinear Algorithms 12, 13, 26, 29, 32, 33, 36, 38, 39]. Some of these improvements have come at the cost of very large, oen gigantic, hidden constants. Le Gall [26] estimated that even if matrix multiplication could be done in O n2 arithmetic operations, it is unlikely to be applicable as 1 INTRODUCTION the base case sizes would have to be astronomical. Strassen’s algorithm [37] was the rst sub-cubic matrix multipli- Recursive fast matrix multiplication algorithms with reasonable cation algorithm, with complexity O nlog2 7 . Winograd [40] re- base case size for both square and rectangular matrices have been duced the leading coecient from 7 to 6 by decreasing the number discovered [3, 20, 24, 25, 29, 30, 35]. us, they have manageable of additions and subtractions from 18 to 15. In practice, Strassen- hidden constants, some of which are asymptotically faster than Winograd’s algorithm oen performs beer than some asymptoti- Strassen’s algorithm. While many fast matrix multiplication al- cally faster algorithms [3] due to these smaller hidden constants. gorithms fail to compete with Strassen’s in practice due to their hidden constants. However, some have achieved competitive per- formance (e.g., Kaporin’s [21] implementation of Laderman et al.’s Permission to make digital or hard copies of all or part of this work for personal or algorithm [24]). classroom use is granted without fee provided that copies are not made or distributed Recently, Smirnov presented several fast matrix multiplication for prot or commercial advantage and that copies bear this notice and the full citation on the rst page. Copyrights for components of this work owned by others than the algorithms derived by computer aided optimization tools [35], in- author(s) must be honored. Abstracting with credit is permied. To copy otherwise, or cluding an h6, 3, 3; 40i-algorithm with asymptotic complexity of republish, to post on servers or to redistribute to lists, requires prior specic permission 3 log54 40 and/or a fee. Request permissions from [email protected]. O n , faster than Strassen’s algorithm. Ballard and Benson SPAA’17, July 24-26, 2017, Washington, DC, USA [3] later presented several additional fast Strassen-like algorithms, © 2017 Copyright held by the owner/author(s). Publication rights licensed to ACM. 978-1-4503-4593-4/17/07...$15.00 DOI: 10.1145/3087556.3087579 1From here on, when referring to addition and subtraction count we say additions. found using computer aided optimization tools as well. ey im- Theorem 1.1 (Probert’s lower bound). [31] 15 additions are plemented several Strassen-like algorithms, including Smirnov’s necessary for any h2, 2, 2; 7i-algorithm. h6, 3, 3; 40i-algorithm, on shared-memory architecture in order to demonstrate that Strassen and Strassen-like algorithms can out- Our result seemingly contradicts Probert’s lower bound. How- perform classical matrix multiplication in practice (such as, Intel’s ever, his bound implicitly assumes that the input and output are MKL), on modestly sized problems (at least up to n=13000), in represented in the standard basis, thus there is not contradiction. a shared-memory environment. eir experiments also showed We extend Probert’s lower bound to account for alternative bases Strassen’s algorithm outperforming Smirnov’s algorithm in some (see Section 4): of the cases. Theorem 1.2 (Basis invariant lower bound). 12 additions are necessary for any matrix multiplication algorithm that uses a 1.3 Previous work recursive-bilinear algorithm with a 2 × 2 base case with 7 multiplica- Bodrato [7] introduced the intermediate representation method, tions, regardless of basis. for repeated squaring and for chain matrix multiplication compu- tations. is enables decreasing the number of additions between Our alternative basis variant of Strassen’s algorithm performs 12 consecutive multiplications. us, he obtained an algorithm with a additions in the base case, matching the lower bound in eorem 2 × 2 base case, which uses 7 multiplications, and has a leading co- 1.2. Hence, it is optimal. ecient of 5 for chain multiplication and for repeated squaring, for every multiplication outside the rst one. Bodrato also presented 2 PERLIMINARIES an invertible linear function which recursively transforms a 2k ×2k 2.1 e communication bottleneck matrix to and from the intermediate transformation. While this is Fast matrix multiplication algorithms have lower IO-complexity not the rst time that linear transformations are applied to matrix than the classical algorithm. at is, they communicate asymptoti- multiplication, the main focus of previous research on the subject cally less data within the memory hierarchy and between proces- was on improving asymptotic performance rather than reducing sors. e IO-compleixty is measured as a function of the number of the number of additions [10, 17]. processors P, the local memory size M, and the matrix dimension Very recently, Cenk and Hasan [8] showed a clever way to ap- n. Namely, the communication costs of a parallel hn ,n ,n ; ti- ! 0 0 0 ply Strassen-Winograd’s algorithm directly to n × n matrices by logn t algorithm are Θ √n 0 M [1, 2, 4, 34]. us, parallel ver- forsaking the uniform divide-and-conquer paern of Strassen-like M P algorithms. Instead, their algorithm splits Strassen-Winograd’s al- sions of Strassen’s algorithm which minimize communication cost gorithm into two linear divide-and-conquer algorithms which recur- outperform the well tuned classical matrix multiplication in prac- sively perform all pre-computations, followed by vector multiplica- tice, both in shared-memory [3, 14, 23] and distributed-memory tion of their results, and nally performs linear post-computations architectures [1, 16, 27]. to calculate the output. eir method enables reuse of sums, result- Our h2, 2, 2; 7i-algorithm not only reduces the arithmetic com- ing in a matrix multiplication algorithm with arithmetic complexity plexity by 16.66%, but also the IO-complexity by 20%, compared to log 7 log 6 log 5 2 of 5n 2 + 0.5 · n 2 + 2n 2 − 6.5n . However, this comes at Strassen-Winograd’s algorithm. Hence, performance gain should the cost of increased communication costs and memory footprint. be in range of 16-20% on a shared-memory machine.
1.4 Our contribution 2.2 Encoding and Decoding matrices We present the Alternative Basis Matrix Multiplication method, Fact 2.1. Let R be a ring, and let f : Rn × Rm → Rk be a and show how to apply it to existing Strassen-like algorithms (see bilinear function which performs t multiplications. ere exist Sections 3). While basis transformation is, in general, as expensive U ∈ Rt×n, V ∈ Rt×m, W ∈ Rt×k such that as matrix multiplications, some can be performed very fast (e.g., n m T Hadamard in O n2 logn using FFT [11]). Fortunately, so is the ∀x ∈ R , y ∈ R f (x,y) = W ((U · x) (V · y)) case for our basis transformation (see Section 3.1). us, it is a where is element-wise vector product (Hadamard product). worthwhile trade-o of reducing the leading coecient in exchange of an asymptotically insignicant overhead (see Section 3.2). We Denition 2.2. (Encoding/Decoding matrices). We refer to the provide analysis as to how these constants are aected and the hU , V , W i of a recursive-bilinear algorithm as its encoding/decoding impact on both arithmetic and IO-complexity. matrices (where U , V are the encoding matrices and W is the de- We discuss the problem of nding alternative bases to improve coding matrix). Strassen-like algorithms (see Section 5), and present several im- ∈ n×m proved variants of existing algorithms, most notable of which Denition 2.3. Let R be a ring, and let A R a matrix. We ~ are the alternative basis variant of Strassen’s h2, 2, 2; 7i-algorithm denote the vectorization of A by A. We use the notation U`, (i, j) 0 which reduces the number of additions from 15 to 12 (see Section when referring to the element in the ` th row on the column corre- 3.3), and the variant of Smirnov’s h6, 3, 3; 40i-algorithm with leading sponding with the index (i, j) in the vectorization of A. For ease of ~ 3 coecient reduced by about 83.2%.2 notation, we sometimes write Ai, j rather than A(i, j).
2Files containing the encoding/decoding matrices and the corresponding basis trans- 3All basis transformations and encoding/decoding matrices assume row-ordered vec- formations can be found at hps://github.com/elayeek/matmultfaster. torization of matrices. Table 1: h2, 2, 2; 7i-algorithms
Algorithm Additions Arithmetic Computations I/O-Complexity √ log 7 3·n 2 Strassen [37] 18 7nlog2 7 − 6n2 6 · √ · M − 18n2 + 3M M √ log 7 3·n 2 Strassen-Winograd [40] 15 6nlog2 7 − 5n2 5 · √ · M − 15n2 + 3M M √ log 7 √ 3·n 2 n Ours 12 5nlog2 7 − 4n2 + 3n2 log n 4 · √ · M − 12n2 + 3n2 · log 2 · √ + 5M 2 M 2 M
Table 2: Alternative Basis Algorithms
Improved Arithmetic Improved Linear Computations Algorithm Linear Leading Leading Operations Saved Operations Coecient Coecient h2, 2, 2; 7i [40] 15 12 6 5 16.6% h3, 2, 3; 15i[3] 64 52 15.06 7.94 47.3% h2, 3, 4; 20i[3] 78 58 9.96 7.46 25.6% h3, 3, 3; 23i[3] 87 75 8.91 6.57 26.3% h6, 3, 3; 40i[35] 1246 202 55.63 9.39 83.2%
Fact 2.4. (Triple product condition). [22] Let R be a ring, and Denition 3.1. Let R be a ring and let ϕ, ψ, υ be automorphisms let U ∈ Rt×n·m, V ∈ Rt×m·k , W ∈ Rt×n·k . hU , V , W i are encod- of Rn·m, Rm·k , Rn·k (respectively). We denote a Strassen-like al- ing/decoding matrices of an hn,m,k; ti-algorithm if and only if: gorithm which takes ϕ (A) , ψ (B) as inputs and outputs υ (A · B) using t multiplications by hn,m,k; tiϕ,ψ,υ -algorithm. If n = m = k ∀i1,i2 ∈ [n] , k1,k2 ∈ [m] , j1, j2 ∈ [k] and ϕ = ψ = υ, we use the notation hn,n,n; tiϕ -algorithm. is Xt notation extends the hn,m,k; ti-algorithm notation as laer applies U V W = δ δ δ r, (i1,k1) r, (k2, j1) r, (i2, j2) i1,i2 k1,k2 j1, j2 when the three basis transformations are the identity map. r =1 Given a recursive-bilinear, hn,m,k; ti -algorithm, ALG, al- = = ϕ,ψ,υ where δi, j 1 if i j and 0 otherwise. ternative basis matrix multiplication works as follows: Notation 2.5. Denote the number of nonzero entries in a matrix by nnz (A), and the number of rows/columns by rows (A) ,cols (A). Algorithm 1 Alternative Basis Matrix Multiplication Algorithm ∈ n×m m×k Remark 2.6. e number of linear operations used by a bilinear Input: A R , B × · algorithm is determined by its encoding/decoding matrices. e Output: n k matrix C = A B 1: function ABS(A, B) number of additions performed by each of the encoding is: n×m 2: A˜ = ϕ(A) . R basis transformation m×k AdditionsU = nnz (U ) − rows (U ) 3: B˜ = ψ (B) . R basis transformation AdditionsV = nnz (V ) − rows (V ) 4: C˜ = ALG(A˜, B˜) . hn,m,k; tiϕ,ψ,υ -algorithm − n×k 5: C = υ 1(C˜) . R basis transformation e number of additions performed by the decoding is: 6: return C AdditionsW = nnz (W ) − cols (W ) e number of scalar multiplication performed by each of the encod- Lemma 3.2. Let R be a ring, let hU , V , W i be the encoding/decoding ing/decoding is equal to the total number of matrix entries which matrices of an hn,m,k; ti-algorithm, and let ϕ, ψ, υ be automor- D E are not 1, −1, and 0. phisms of Rn·m, Rm·k , Rn·k (respectively). Uϕ−1, Vψ −1, WυT are encoding/decoding matrices of an hn,m,k; tiϕ,ψ,υ -algorithm. 3 ALTERNATIVE BASIS MATRIX hU , V , W i MULTIPLICATION Proof. are encoding/decoding matrices of an hn,m,k; ti-algorithm. Hence, for any A ∈ Rn×m, B ∈ Rm×k Fast matrix multiplication algorithms are bilinear computations. T ~ ~ −−−→ e number of operations performed in the linear phases of such W U · A V · B = A · B algorithms (the application of their encoding/decoding matrices Hence, hU , V , W i in the case of matrix multiplication, see Denition 2.2) −−−→ T ~ ~ depends on basis of representation. In this section, we detail how υ A · B = υ W U · A V · B alternative basis algorithms work and address the eects of using T = T −1 · ~ −1 · ~ alternative bases on arithmetic complexity and IO-complexity. Wυ Uϕ ϕ A Vψ ψ B n ×n n ×n Claim 3.7. Let R be a ring, let ψ1 : R 0 0 → R 0 0 be a linear ∈ n×n k map, and let A R where n = n0 . e arithmetic complexity of Corollary 3.3. Let R be a ring, and let ϕ, ψ, υ be automor- computing ψ (A) is n·m m·k n·k phisms of R , R , R (respectively). hU , V , W i are encod- q F (n) = n2 log n ing/decoding matrices of an hn,m,k; tiϕ,ψ,υ -algorithm if and only if ψ n0 D E n2 Uϕ, Vψ, Wυ−T are encoding/decoding matrices of an hn,m,k; ti- 0 algorithm. where q is the number of linear operations performed by ψ1.
Proof. Let Fψ (n) be the number of additions required by ψ . 3.1 Fast basis transformation 2 Each step of the recursion consists of computing n0 sub-problems Denition 3.4. Let R be a ring and let ψ : Rn0×m0 → Rn0×m0 be 1 and performs q additions of sub-matrices. erefore, Fψ (n) = a linear map. We recursively dene a linear map ψ : Rn×m → k+1 2 n n2 × ` ` n F + q and F (1) = 0. us, n m 1 2 ≤ 0 ψ n0 n0 ψ R (where n = n0 , m = m0 for some `1, `2 k + 1) by n m (ψk+1 (A))i, j = ψk (ψ1 (A))i, j , where Ai, j are n × m sub-matrices. ! !2 0 0 2 n n Fψ (n) = n0Fψ + q Note that ψk+1 is a linear map. For convenience, we omit the n0 n0 subscript of ψ when obvious from context. log (n)−1 n0 2 X 2k n Let R be a ring, let ψ Rn0×m0 → Rn0×m0 be a linear = n0 q Claim 3.5. 1 : nk+1 ∈ n×m k+1 k+1 ˜ k=0 0 map, and let A R (where n = n0 , m = m0 ). Dene A by * * + + log (n)−1. / ˜ ˜ n0 2 k A = ψk Ai, j . en ψ1 A = ψk+1 (A). q X , n, - q- i, j = n2 · 0 = n2 · log (n) n2 n2 n2 n0 0 k=0 0 0 Proof. ψ1 is a linear map. Hence, for any i ∈ [n0] , j ∈ [m0], * + (ψ1 (A)) , is a linear sum of elements of A. erefore, there exist i j , - (i, j) n ×n n ×n scalars x such that Claim 3.8. Let R be a ring and let ψ1 : R 0 0 → R 0 0 be a r,` ∈ , `∈ r [n0] [m0] ∈ n×n k linear map, and let A R where n = n0 . e IO-complexity of (ψk+1 (A))i, j =ψk (ψ1 (A))i, j computing ψ (A) is √ ! X (i, j) 3q 2 n =ψ x · A IOψ (n, M) ≤ n logn 2 √ + 2M k r,` r,` n2 0 M r,` 0 .* /+ By linearity of ψk where q is the number of linear operations performed by ψ1. X, - = x (i, j)ψ A = ψ A˜ Proof. Each step of the recursion consists of computing n2 sub- r,` k r,` 1 i, j 0 r,` problems and performs q linear operations. e base case occurs when the problem ts entirely in the fast memory (or local memory in parallel seing), namely 2n2 ≤ M. Each addition requires at n ×m n ×m Claim 3.6. Let R be a ring, let ψ1 : R 0 0 → R 0 0 be an most 3 data transfers (one of each input and one for writing the invertible linear map, and letψ as dened above. ψ is invertible output). Hence, a basis transformation which performs q linear k+1 k+1 and its inverse is ψ −1 (A) = ψ −1 ψ −1 (A) . operations at each recursive steps has the recurrence: k+1 i, j k 1 i, j 2 n2IO n , M + 3q · n 2n2 > M −1 0 ψ n0 n0 Proof. Dene A˜ by A˜ = ψ Ai, j and dene ψ by IOψ (n, M) ≤ i,j k k+1 2M otherwise ψ −1 (A) = ψ −1 ψ −1 (A) . en: k+1 i, j k 1 i, j erefore ! ! n n 2 −1 −1 −1 ≤ 2 ψk+1 (ψk+1 (A)) = ψk ψ1 (ψk+1 (A)) IOψ (n, M) n0IOψ , M + 3q i, j i, j n0 n0 By Claim 3.5 log √n −1 −1 −1 ˜ −1 ˜ n0 M = ψ ψ1 ψ1 A = ψ A 2 2 k i, j k i, j *X + k n = n2 3q + 2M By denition of A˜ , - 0 nk+1 k=0 * 0 + −1 √ . * + / = ψk ψk (A)i, j = Ai, j log 2· √n −1 n0 M , k - 3q X , n2 - = n2 · 0 + 2M n2 n2 0 k=0 0 We next analyze the arithmetic complexity and IO-complexity *! + 3q √ n of fast basis transformations. For convenience and readability, we = n2 · log 2 · √ , +-2M 2 n0 presented here the square case only. e analysis for rectangular n0 M matrices is similar. 3.2 Computing matrix multiplication in Each entry in the output C˜ is of the form: alternative basis X C˜i, j = Wr, (i, j)Pr n ×m m ×k Claim 3.9. Let ϕ1,ψ1,υ1 be automorphisms of R 0 0 , R 0 0 , r ∈[t] n ×k R 0 0 (respectively), and let ALG be an hn0,m0,k0; tiϕ ,ψ ,υ algo- X 1 1 1 = W , , υ` (Sr · Tr ) ∈ n×m ∈ m×k r (i j) rithm. For any A R , B R : r ∈[t] By linearity of υ` ALG (ϕ` (A) ,ψ` (B)) = υ` (A · B) X ` ` ` = υ W (S · T ) where n = n0, m = m0, k = k0 . ` r, (i, j) r r r ∈[t] .* /+ ˜ And, as noted in the base case: Proof. Denote C = ALG (ϕ`+1 (A) ,ψ`+1 (B)) and the encod- , - ing/decoding matrices of ALG by hU , V , W i. We prove by induction X on ` that C˜ = υ` (A · B). For r ∈ [t], denote (υ1 (A · B))i, j = Wr, (i, j) (Sr · Tr ) r ∈[t] (i, j) .* /+ X Hence, , - Sr = Ur, (i, j) (ϕ1 (A))i, j C˜i,j = υ` (υ1 (A · B))i, j i ∈[n0], j ∈[m0] X erefore, by Denition 3.4, C˜ = υ`+1 (A · B) Tr = Vr, (i, j) (ψ1 (B))i, j i ∈[m ], j ∈[k ] 0 0 Notation 3.10. When discussing an hn0,n0,n0; tiϕ,ψ,υ -algorithm, we denote ω = log t. e base case, ` = 1, holds by Lemma 3.2 since ALG is an 0 n0 h i n0,m0,k0; t ϕ1,ψ1,υ1 -algorithm. Note that this means that for any Claim 3.11. Let ALG be an hn0,n0,n0; tiϕ,ψ,υ -algorithm which i ∈ [n0] , j ∈ [k0] performs q linear operations at its base case. e arithmetic complexity of ALG is T (υ1 (AB)) = W ((U · ϕ1 (A)) (V · ψ1 (B))) i, j i, j q q X F (n) = 1 + nω0 − n2 = W (S · T ) ALG − 2 − 2 r, (i, j) r r t n0 t n0 r ∈[t] * + * + Proof. Each step of, the recursion- consists, of computing- t sub- Next, we assume the claim holds for ` ∈ N and show for ` + 1. problems and performs q linear operations (additions/multiplication 2 Given input A˜ = ϕ`+1 (A) , B˜ = ψ`+1 (B), ALG performs t multi- n n by scalar) of sub-matrices. erefore FALG (n) = tFψ n +q 0 plications P ,..., P . For each multiplication P , its le hand side 0 n 1 t r and FALG (1) = 1. us, multiplicand is of the form log n−1 n0 2 X X k n log n ˜ F (n) = t · q · + t n0 · F (1) Lr = Ur, (i, j)Ai, j ALG k+1 ALG k=0 n0 i ∈[n0], j ∈[m0] * + log n−1 n0 k q 2 X , t - ω By Denition 3.4, (ϕ`+1 (A))i, j = ϕ` (ϕ1 (A))i, j . Hence, = n · + n 0 n2 n2 0 k=0 0 X * + = U ϕ (ϕ (A)) log n r, (i, j) ` 1 i, j t n0 2 , −-1 ω 2 i ∈[n0], j ∈[m0] q n n 0 − n = n2 0 + nω0q + nω0 n2 t − 1 t − n2 ϕ 0 * n2 + 0 From linearity of ` . 0 / * + . / , - X , - = ϕ` Ur, (i, j) (ϕ1 (A))i, j Claim 3.12. Let ALG be an hn0,n0,n0; tiϕ,ψ,υ -algorithm which i ∈[n0],j ∈[m0] .* /+ performs q linear operations at its base case. e IO-complexity of = ϕ` (Sr ) , - ALG is !ω ! And similarly, the right hand multiplication R is of the form q √ n 0 r IO (n, M) ≤ M 3 · √ − 3n2 + 3M ALG − 2 t n0 M Rr = ψ` (Tr ) * + Proof. Each step, of the recursion- consists of computing t sub- ∈ ` × ` ` × ` problems and performs q linear operations. e base case occurs Note that for any r [t], Sr , Tr are n0 m0 and m0 k0 matrices, respectively. Hence, by the induction hypothesis, when the problem ts entirely in the fast memory (or local memory in parallel seing), namely 3n2 ≤ M. Each addition requires at Pr = ALG (ϕ` (Sr ) ,ψ` (Tr )) = υ` (Sr · Tr ) most 3 data transfers (one of each input and one for writing the output). Hence, a recursive bi-linear algorithm which performs q 3.3 Optimal h2, 2, 2; 7i-algorithm linear operations at each recursive steps has the recurrence: 4 4 We now present a basis transformation ψopt : R → R and an h2, 2, 2; 7i -algorithm which performs only 12 linear operations. n n 2 2 ψ t · IOψ n , M + 3q · n 3n > M IOALG (n, M) ≤ 0 0 Notation 3.15. Let, ψ refer to the following transformation: 3M otherwise opt 1 0 0 0 1 0 0 0 erefore 0 1 −1 1 −1 0 1 −1 0 ψopt = ψopt = log √n −1 *0 0 −1 1+ *0 −1 0 1+ n0 M . / . / 3 2 .0 1 0 1/ .0 −1 1 1/ X k n . / . / IOALG (n, M) ≤ t · 3q + 3M . / . / nk+1 For convenience, when applying ψ to matrices, we omit the k=0 0 , - , - * + vectorization and refer to it as ψ : R2×2 → R2×2: log √n −1 n0 M ! ! X 3 , k - A A A A − A + A 3q 2 t ψ (A) = ψ 1,1 1,2 = 1,1 1,2 2,1 2,2 = n + 3M opt 1 − n2 n2 A2,1 A2,2 A21 A2,2 A1,2 + A2,2 0 k=0 0 * + −1 log √n Where Ai, j can be ring elements or sub-matrices. ψ is dened n0 M opt t , 3 -− −1 2 1 analogously. Bothψopt andψ extend recursively as in Denition 3q 2 n0 opt = n + 3M 3.4. n2 t − 1 0 * n2 + . 0 / . / D E .√ ω0 / U , V , W = M 3 · √n − 3n2 opt opt opt , M - = q + 3M 0 0 0 1 0 0 0 1 0 0 0 1 t − n2 * 0 + 0 0 1 0 0 0 1 0 0 0 1 0 . / . / * 0 1 0 0 0 1 0 0 0 1 0 0 + * + * + * + , - . 1 0 0 0/ , . 1 0 0 0/ , .1 0 0 0/ . − / . − / . / . 0 1 1 0/ . 0 1 0 1/ .1 1 0 0/ Corollary 3.13. If ABS (Algorithm 1) performs q linear opera- .−1 1 0 0/ . 0 1 −1 0/ .0 −1 0 −1/ . / . / . / tions at the base case, then its arithmetic complexity is . 0 −1 0 1/ .−1 1 0 0/ .0 1 1 0/ . / . / . / q q F (n) = 1 + nω0 − n2 + O n2 logn Figure, 1: hU , V , W i- are, the encoding/decoding- , matrices- of ABS − 2 − 2 t n0 t n0 our h2, 2, 2; 7i -algorithm which performs 12 linear opera- * + * + ψopt tions Proof. e number, of ops- performed, - by the algorithm is the sum of: (1) the number of ops performed by the basis transforma- D E tions (denoted ϕ, ψ, υ) and (2) the number of ops performed by Claim 3.16. Uopt , Vopt , Wopt are encoding/decoding matrices the recursive bilinear algorithms ALG. h i of an 2, 2, 2; 7 ψopt -algorithm.
FABS (n) = FALG (n) + Fϕ (n) + Fψ (n) + Fυ (n) Proof. Observe that D −T E e result immediately follows from Claim 3.7 and Claim 3.11 Uopt · ψopt , Vopt · ψopt , Wopt · ψopt = 0 1 0 1 0 1 0 1 0 0 1 1 Corollary 3.14. If ABS (Algorithm 1) performs q linear opera- 0 0 −1 1 0 0 −1 1 0 −1 0 1 tions at the base case, then its IO-complexity is − − − − ** 0 1 1 1+ * 0 1 1 1+ *0 1 1 1++ √ . 1 0 0 0/ , . 1 0 0 0/ , .1 0 0 0/ !ω0 . / . / . / 3q 3 · n 2 . 0 1 0 0/ . 0 0 1 0/ .1 1 −1 −1/ IOALG (n, M) ≤ M √ − n . / . / . / t − n2 M .−1 1 −1 1/ . 0 1 0 0/ .0 −1 0 0/ 0 . / . / . / * + * ! + . 0 0 1 0/ .−1 1 −1 1/ .0 0 −1 0/ 2 n . / . / . / ,+ 3M +-O, n log √ - . / . D / . / E It is easy to verify that U · ψ , V · ψ , W · ψ −T M , - , opt opt - opt, opt opt opt- satisfy the triple product condition in Fact 2.4. Hence, they are en- Proof. e IO-complexity is the sum of: (1) the IO-complexity coding/decoding algorithm of an h2, 2, 2; 7i-algorithm. By Corollary of the recursive bilinear algorithms ALG and (2) the IO-complexity 3.3, the claim follows. of the basis transformations (denoted ϕ, ψ, υ). Claim 3.17. Let R be a ring, and let A ∈ Rn×n where n = 2k . e IOABS (n, M) =IOALG (n, M) + IOϕ (n, M) arithmetic complexity of computing ψopt (A) is + IO (n, M) + IOυ (n, M) 2 ψ Fψopt (n) = n log2 n −1 e result immediately follows from Claim 3.8 and Claim 3.12. e same holds for computing ψopt (A). −1 Proof. Both ψopt , ψopt perform q = 4 linear operations at each 4 BASIS-INVARIANT LOWER BOUND ON recursive step and has base case size of n0 = 2. e lemma follows ADDITIONS FOR 2 × 2 MATRIX immediately from Claim 3.7. MULTIPLICATION Claim 3.18. Let R be a ring, and let A ∈ Rn×n where n = 2k . e In this section we prove eorem 1.2 which says that 12 additions I/O-complexity of computing ψopt (A) is are necessary to compute 2 × 2 matrix multiplication recursively ! with base case of 2 × 2 and 7 multiplications, irrespective of basis. √ n ≤ 2 IOψopt (n, M) 2n log2 2 √ + 2M eorem 1.2 completes Probert’s lower bound which says that for M standard basis, 15 additions are required. ψ , ψ −1 q = Proof. Both opt opt perform 4 linear operations at each Denition 4.1. Denote the permutation matrix which swaps row- recursive step. e lemma follows immediately from Claim 3.8 with order for column-order of vectorization of an I × J matrix by PI ×J . base case n0 = 2. Lemma 4.2. [19] Let hU , V , W i be the encoding/decoding matrices Corollary 3.19. Our h2, 2, 2; 7i -algorithm’s arithmetic com- ψopt of an hm,k,n; ti-algorithm. en WPn×m, U , VPn×k are encod- log 7 2 plexity is Fopt (n) = 5n 2 − 4n . ing/decoding matrices of an hn,m,k; ti-algorithm. h i × Proof. Our 2, 2, 2; 7 ψopt -algorithm has a 2 2 base case and We use the following results, shown by Hopcro and Kerr [20]: performs 7 multiplications. Applying Fact 2.6 to its encoding/decoding D E Lemma 4.3. [20] If an algorithm for 2 × 2 matrix multiplica- matrices U , V , W , we see that it performs 12 linear op- opt opt opt tion has k le (right) hand side multiplicands from the set S = erations. e result follows immediately from Claim 3.11 A1,1, A1,2 + A2,1 , A1,1 + A1,2 + A2,1 , where additions are done h i modulo 2, then it requires at least 6 + k multiplications. Corollary 3.20. Our 2, 2, 2; 7 ψopt -algorithm’s IO-complexity is Corollary 4.4. [20] Lemma 4.3 also applies for the following !log 7 √ n 2 denitions of S: IO (n, M) ≤12 · 3 · M √ − 3n2 + 3M opt M (1) A1,1 + A2,1 , A1,2 + A2,1 + A2,2 , A1,1 + A1,2 + A2,2 * + (2) A1,1 + A1,2 , A1,2 + A2,1 + A2,2 , A1,1 + A2,1 + A2,2 Proof. Our algorithm has a 2 × 2 base case and performs 7 , - (3) A1,1 + A1,2 + A2,1 + A2,2 , A1,2 + A2,1 , A1,1 + A2,2 multiplications. By applying Fact 2.6 to its encoding/decoding (4) A2,1, A1,1 + A2,2 , A1,1 + A2,1 + A2,2 matrices (as shown in Figure 1), we see that it performs 12 linear (5) A2,1 + A2,2 , A1,1 + A1,2 + A2,2 , A1,1 + A1,2 + A2,1 operations. e result follows immediately from Claim 3.12. (6) A1,2, A1,1 + A2,2 , A1,1 + A1,2 + A2,2 Corollary 3.21. e arithmetic complexity of ABS (Algorithm 1) (7) A1,2 + A2,2 , A1,1 + A2,1 + A2,2 , A1,1 + A1,2 + A2,1 h i with our 2, 2, 2; 7 ψopt -algorithm is (8) A2,2, A1,2 + A2,1 , A1,2 + A2,1 + A2,2
log2 7 2 2 FABS (n) = 5n − 4n + 3n log2 n Corollary 4.5. Any 2 × 2 matrix multiplication algorithm where a le hand (or right hand) multiplicand appears at least twice (modulo Proof. e proof is similar to that of Corollary 3.13 2) requires 8 or more multiplications. Corollary 3.22. e IO-complexity of ABS (Algorithm 1) with Proof. Immediate from Lemma 4.3 and Corollary 4.4 since it our h2, 2, 2; 7iψ -algorithm is opt covers all possible linear sums of matrix elements, modulo 2. √ !log2 7 3 · n 2 IOALG (n, M) ≤4 · √ · M − 12n Fact 4.6. A simple counting argument shows that any 7 × 4 M binary matrix with less than 10 non-zero entries has a duplicate √ ! 2 n row (modulo 2) or an all zero row. + 3n · log2 2 · √ + 5M M Lemma 4.7. Irrespective of basis transformations ϕ, ψ , υ, the en- Proof. e proof is similar to that of Corollary 3.14 coding matrices U , V , of an h2, 2, 2; 7iϕ,ψ,υ -algorithm contain no h i duplicate rows. Theorem 3.23. Our 2, 2, 2; 7 ψopt -algorithm’s sequential and ω0 parallel IO-complexity is bound by Ω √n M (where P is the Proof. Let hU , V , W i be encoding/decoding matrices of an M P D −T E h2, 2, 2; 7iϕ,ψ,υ -algorithm. By Corollary 3.3, Uϕ, Vψ, Wυ are number of processors, 1 in the sequential case, and ω0 = log2 7). encoding/decoding matrices of a h2, 2, 2; 7i-algorithm. Assume, Proof. We refer to the undirected bipartite graph dened by w.l.o.g, that U contains a duplicate row, which means that Uϕ con- D E the decoding matrix of our a Strassen-like algorithm as its decod- tains a duplicate row as well. In that case, Uψ, Vψ, Wυ−T is a ing graph (i.e., the edge (i, j) exists if Wi, j , 0). In [2], Ballard h2, 2, 2; 7i-algorithm in which one of the encoding matrices contains et al. proved that for any square recursive-bilinear Strassen-like a duplicate row, in contradiction to Corollary 4.5. algorithm with n0 × n0 base case which performs t multiplications, if the decoding graph is connected then these bounds apply with Lemma 4.8. Irrespective of basis transformations ϕ, ψ, υ, the en- ω = t coding matrices U ,V , of a h , , i -algorithm have at least 0 logn0 . e decoding graph of our algorithm is connected. 2 2 2; 7 ϕ,ψ,υ 10 Hence, the claim is true. non-zero entries. Proof. No row in U ,V can be zeroed out since otherwise the algorithms proved useful. Our resulting alternative basis Strassen- algorithm would require less than 7 multiplications. e result then like algorithms are summarized in Table 2. Note, particularly, our follows from Fact 4.6 and Lemma 4.7. alternative basis version of Smirnov’s h6, 3, 3; 40i-algorithm, which is asymptotically faster than Strassen’s, where we have reduced Lemma 4.9. Irrespective of basis transformations ϕ, ψ, υ, the de- the number of linear operations in the bilinear-recursive algorithm coding matrix W of an h2, 2, 2; 7iϕ,ψ,υ -algorithm has at least 10 non- from 1246 to 202, thus reducing the leading coecient by 83.2%. zero entries. 6 IMPLEMENTATION AND PRACTICAL Proof. Let hU ,V ,W i be encoding/decoding matrices of an CONSIDERATIONS h2, 2, 2; 7iϕ,ψ,υ -algorithm and suppose by contradiction that W has less than 10 non-zero entries. W is a 7 × 4 matrix. By Fact 4.6, it 6.1 Recursion cuto point either has a duplicate row (modulo 2) or an all zero row. Corollary Implementations of fast matrix multiplication algorithms oen D −T E 3.3 states that Uϕ,Vψ,Wυ are encoding/decoding matrices of take several recursive steps then call the classical algorithm from a D E a h2, 2, 2; 7i-algorithm. By Lemma 4.2, because Uϕ,Vψ,Wυ−T vendor-tuned library. is gives beer performance in practice due D −T E to two main reasons: (1) the asymptotic improvement of Strassen- dene a h2, 2, 2; 7i-algorithm, so does Wυ P2×2,Uϕ,VψP2×2 . like algorithms makes them faster than classical algorithms by mar- Hence Wυ−T P is an encoding matrix of a h2, 2, 2; 7i-algorithm 2×2 gins which increase with matrix size, and (2) vendor-tuned libraries which has duplicate or all zero rows, contradicting Corollary 4.5. have extensive built-in optimization, which makes them perform beer than existing implementations of fast matrix multiplication Proof (of Theorem 1.2). Lemma 4.8 and 4.9 then show that algorithms on small matrices. each encoding/decoding matrix must contain at least 10 non-zero We next present theoretical analysis for nding the optimal entries. By Remark 2.6, the number of additions used by each of the number of recursive steps without tuning. encoding matrix is nnz (U ) − rows (U ) (and analogously for V ) and Claim 6.1. Let ALG be an hn,n,n; ti-algorithm with q linear the number of additions used by the decoding is nnz (W ) −cols (W ). operations at the base case. e arithmetic complexity of running Hence, 12 additions are necessary for an h2, 2, 2; 7iϕ,ψ,υ -algorithm ALG for ` steps, then switching to classical matrix multiplication is: irrespective of basis transformations ϕ, ψ, υ. ` q t F (n, `) = − 1 · n2+ 5 OPTIMAL ALTERNATIVE BASES ALG − 2 2 t n0 n0 ** + + To apply our alternative basis method to other Strassen-like matrix . 3 /2 multiplication algorithms, we nd bases which reduce the number n n + t ` 2 ,, −- - of linear operations performed by the algorithm. As we mentioned ` ` n0 n0 in Fact 2.6, the non-zero entries of the encoding/decoding matri- .* * + * + /+ ces determine the number of linear operations performed by an Proof. Each step of the recursion, , - consists, - of- computing t sub- algorithm. Hence, we want our encoding/decoding matrices to be problems and performsq linear operations. erefore, FALG (n, `) = as sparse as possible, and ideally to have only entries of the form n n 2 3 2 t · FALG , ` − 1 + q 0 and FALG (n, 0) = 2n − n . us −1, 0 and 1. From Lemma 3.2 and Corollary 3.3 we see that any n0 n hn,m,k; ti-algorithm and dimension compatible basis transforma- `−1 2 X n n tions ϕ, ψ, υ can be composed into an hn,m,k; ti -algorithm. F (n, `) = tk · q · + t ` · F ϕ,ψ,υ ALG k+1 ALG ` erefore, the problem of nding a basis in which a Strassen-like k=0 n0 n0 * + * + algorithm performs the least amount of linear operations is closely ` t , − 1- 3, - 2 tied to the Matrix Sparsication problem: q n2 n n = n2 0 + t ` 2 − n2 t − 1 ` ` Problem 5.1. Matrix Sparsication Problem (MS): Let U be an 0 * n2 + n0 n0 . 0 / * + m × n matrix of full rank, nd an invertible matrix A such that . / . * + * + / . ` / , - ,3 - 2 q t , n -n A = argmin (nnz (UA)) = , − 1- · n2 + t ` 2 − A∈GLn − 2 2 ` ` t n0 n0 n0 n0 .** + /+ .* * + * + /+ at is, nding basis transformations for a Strassen-like algo- rithm consists of three independent MS problems. Unfortunately, ,, - - , , - , - - MS is not only NP-Hard [28] to solve, but also NP-Hard to approxi- When running an alternative basis algorithm for a limited num- .5−o(1) mate to within a factor of 2log n [15] (Over Q, assuming NP ber of recursive steps, the basis transformation needs to be com- does not admit quasi-polynomial time deterministic algorithms). puted only for the same number of recursive steps. If the basis ere seem to be very few heuristics for matrix sparsication (e.g., transformation is computed for more steps than the alternative [9]), or algorithms under very limiting assumptions (e.g., [18]). basis multiplication, the classical algorithm will compute incorrect Nevertheless, for existing Strassen-like algorithms with small base results as it does not account for the input being represented in an cases, the use of search heuristics to nd bases which signicantly alternative basis. is introduces a small saving in the runtime of sparsify the encoding/decoding matrices of several Strassen-like basis transformation. n ×n n ×n Claim 6.2. Let R be a ring and let ψ1 : R 0 0 → R 0 0 be an ∈ n×n k ≤ invertible linear map, let A R where n = n0 , and let ` k. e arithmetic complexity of computing ψ` (A) is q F (n, `) = n2 · ` ψ 2 n0
Proof. Let Fψ` (n) be the number of additions required by ψ`. Each recursive consists of computing n2 sub-problems and perform- 0 ing q linear operations. erefore, F (n, `) = n2 · F n , ` − 1 + ψ 0 ψ n0 2 q · n and F (n, 0) = 0 n0 ψ `−1 2 X k n F (n) = n2 · q ψ` 0 k+1 k=0 n0 * + `− k q X1 n2 q = n2 · ,0 =- n2 · ` n2 n2 n2 0 k=0 0 0 Figure 2: Comparing the performance of our h2, 2, 2; 7i - * + ψopt algorithm to Strassen-Winograd’s on square matrices of , - xed dimension N = 32768. e graph shows our algo- 6.2 Performance experiments rithm’s runtime, normalized by Strassen-Winograd’s algo- h i rithm’s runtime, as a function of the number of recursive We next present performance results for our 2, 2, 2; 7 ψopt -algorithm. All experiments were conducted on a single compute node of steps taken before switching to Intel’s MKL DGEMM. e HLRS’s Hazel Hen, with two 12-core (24 threads) Intel Xeon CPU top horizontal (at 1) line represents Strassen-Winograd’s E5-2680 v3 and 128GB of memory. performance and the bottom horizontal line (at 0.83) repre- We used a straightforward implementation of both our algorithm sents the theoretical ratio when taking the maximal number and Strassen-Winograd’s [40] algorithm using OpenMP. Each al- of recursive steps. gorithm runs for a pre-selected number of recursive steps before switching to Intel’s MKL DGEMM routine. Each DGEMM call uses all threads, matrix additions are always fully parallelized. All results and Humanities) and grant 3-10891 from the Ministry of Science are the median over 6 experiments. and Technology, Israel. Research is also supported by the Einstein In Figure 2 we see that our algorithm outperforms Strassen- Foundation and the Minerva Foundation. is work was supported Winograd’s, with the margin of improvement increasing with each by the PetaCloud industry-academia consortium. is research was recursive step and nearing the theoretical improvement. supported by a grant from the United States-Israel Bi-national Sci- ence Foundation (BSF), Jerusalem, Israel. is work was supported 7 DISCUSSION by the HUJI Cyber Security Research Center in conjunction with Our method obtained novel variants of existing Strassen-like al- the Israel National Cyber Bureau in the Prime Minister’s Oce. gorithms, reducing the number of linear operations required. Our We acknowledge PRACE for awarding us access to Hazel Hen at algorithm also outperforms Strassen-Winograd’s algorithm for any GCS@HLRS, Germany. matrix dimension n ≥ 32. Furthermore, we’ve obtained an alterna- tive basis algorithm of Smirnov’s h6, 3, 3; 40i-algorithm, reducing REFERENCES the number of additions by 83.8%. While the problem of nding [1] Grey Ballard, James Demmel, Olga Holtz, Benjamin Lipshitz, and Oded Schwartz. 2012. Communication-optimal parallel algorithm for strassen’s matrix multipli- bases which optimally sparsify an algorithm’s encoding/decoding cation. In Proceedings of the twenty-fourth annual ACM symposium on Parallelism matrices is NP-Hard (see Section 5), it is still solvable for many in algorithms and architectures. ACM, 193–204. fast matrix multiplication algorithms with small base cases. Hence, [2] Grey Ballard, James Demmel, Olga Holtz, and Oded Schwartz. 2012. Graph expansion and communication costs of fast matrix multiplication. Journal of the nding basis transformations could be done in practice using search ACM (JACM) 59, 6 (2012), 32. heuristics, leading to further improvements. [3] Austin R Benson and Grey Ballard. 2015. A framework for practical parallel fast We leave large scale implementations for future research but matrix multiplication. ACM SIGPLAN Notices 50, 8 (2015), 42–53. [4] Gianfranco Bilardi and Lorenzo De Stefani. 2016. e I/O complexity of Strassen’s note that both kernels of our alternative basis algorithms (basis matrix multiplication with recomputation. arXiv preprint arXiv:1605.02224 (2016). transformation and recursive-bilinear algorithms) are known to [5] Gianfranco Bilardi, Michele Scquizzato, and Francesco Silvestri. 2012. A Lower Bound Technique for Communication on BSP with Application to the FFT. In be highly parallelizable recursive divide-and-conquer algorithms, European Conference on Parallel Processing. Springer, 676–687. and admit various communication minimizing parallelization tech- [6] Dario Bini, Milvio Capovani, Francesco Romani, and Grazia Loi. 1979. O niques (e.g., [1, 5]). (n2.7799) complexity for n× n approximate matrix multiplication. Information processing leers 8, 5 (1979), 234–235. [7] Marco Bodrato. 2010. A Strassen-like matrix multiplication suited for squaring ACKNOWLEDGMENTS and higher power computation. In Proceedings of the 2010 International Sympo- sium on Symbolic and Algebraic Computation. ACM, 273–280. Research is supported by grants 1878/14, and 1901/14 from the Is- [8] Murat Cenk and M Anwar Hasan. 2017. On the arithmetic complexity of strassen- rael Science Foundation (founded by the Israel Academy of Sciences like matrix multiplications. Journal of Symbolic Computation 80 (2017), 484–501. [9] S Frank Chang and S omas McCormick. 1992. A hierarchical algorithm for (1992), 557–588. making sparse matrices sparser. Mathematical Programming 56, 1 (1992), 1–30. [25] Julian D Laderman. 1976. A noncommutative algorithm for multiplying 3× 3 [10] Henry Cohn and Christopher Umans. 2003. A group-theoretic approach to fast matrices using 23 multiplications. In Am. Math. Soc, Vol. 82. 126–128. matrix multiplication. In Foundations of Computer Science, 2003. Proceedings. 44th [26] Franc¸ois Le Gall. 2014. Powers of tensors and fast matrix multiplication. In Annual IEEE Symposium on. IEEE, 438–449. Proceedings of the 39th international symposium on symbolic and algebraic com- [11] James W Cooley and John W Tukey. 1965. An algorithm for the machine cal- putation. ACM, 296–303. culation of complex Fourier series. Mathematics of computation 19, 90 (1965), [27] Benjamin Lipshitz, Grey Ballard, James Demmel, and Oded Schwartz. 2012. 297–301. Communication-avoiding parallel strassen: Implementation and performance. [12] Don Coppersmith and Shmuel Winograd. 1982. On the asymptotic complexity In Proceedings of the International Conference on High Performance Computing, of matrix multiplication. SIAM J. Comput. 11, 3 (1982), 472–492. Networking, Storage and Analysis. IEEE Computer Society Press, 101. [13] Don Coppersmith and Shmuel Winograd. 1990. Matrix multiplication via arith- [28] S omas McCormick. 1983. A Combinatorial Approach to Some Sparse Matrix metic progressions. Journal of symbolic computation 9, 3 (1990), 251–280. Problems. Technical Report. DTIC Document. [14] Paolo D’alberto, Marco Bodrato, and Alexandru Nicolau. 2011. Exploiting par- [29] V Ya Pan. 1978. Strassen’s algorithm is not optimal trilinear technique of ag- allelism in matrix-computation kernels for symmetric multiprocessor systems: gregating, uniting and canceling for constructing fast algorithms for matrix Matrix-multiplication and matrix-addition algorithm optimizations by soware operations. In Foundations of Computer Science, 1978., 19th Annual Symposium pipelining and threads allocation. ACM Transactions on Mathematical Soware on. IEEE, 166–176. (TOMS) 38, 1 (2011), 2. [30] V Ya Pan. 1982. Trilinear aggregating with implicit canceling for a new accelera- [15] Lee-Ad Golieb and Tyler Neylon. 2010. Matrix sparsication and the sparse tion of matrix multiplication. Computers & Mathematics with Applications 8, 1 null space problem. In Approximation, Randomization, and Combinatorial Opti- (1982), 23–34. mization. Algorithms and Techniques. Springer, 205–218. [31] Robert L Probert. 1976. On the additive complexity of matrix multiplication. [16] Brian Grayson and Robert Van De Geijn. 1996. A high performance parallel SIAM J. Comput. 5, 2 (1976), 187–203. Strassen implementation. Parallel Processing Leers 6, 01 (1996), 3–12. [32] Francesco Romani. 1982. Some properties of disjoint sums of tensors related to [17] Vince Grolmusz. 2008. Modular representations of polynomials: Hyperdense matrix multiplication. SIAM J. Comput. 11, 2 (1982), 263–267. coding and fast matrix multiplication. IEEE Transactions on Information eory [33] Arnold Schonhage.¨ 1981. Partial and total matrix multiplication. SIAM J. Comput. 54, 8 (2008), 3687–3692. 10, 3 (1981), 434–455. [18] Alan J Homan and ST McCormick. 1984. A fast algorithm that makes matrices [34] Jacob Sco, Olga Holtz, and Oded Schwartz. 2015. Matrix multiplication I/O- optimally sparse. Progress in Combinatorial Optimization (1984), 185–196. complexity by path routing. In Proceedings of the 27th ACM symposium on Paral- [19] John Hopcro and Jean Musinski. 1973. Duality applied to the complexity of lelism in Algorithms and Architectures. ACM, 35–45. matrix multiplications and other bilinear forms. In Proceedings of the h annual [35] AV Smirnov. 2013. e bilinear complexity and practical algorithms for matrix ACM symposium on eory of computing. ACM, 73–87. multiplication. Computational Mathematics and Mathematical Physics 53, 12 [20] John E Hopcro and Leslie R Kerr. 1971. On minimizing the number of multi- (2013), 1781–1795. plications necessary for matrix multiplication. SIAM J. Appl. Math. 20, 1 (1971), [36] Andrew James Stothers. 2010. On the complexity of matrix multiplication. (2010). 30–36. [37] Volker Strassen. 1969. Gaussian elimination is not optimal. Numerische mathe- [21] Igor Kaporin. 2004. e aggregation and cancellation techniques as a practical matik 13, 4 (1969), 354–356. tool for faster matrix multiplication. eoretical Computer Science 315, 2-3 (2004), [38] Volker Strassen. 1986. e asymptotic spectrum of tensors and the exponent 469–510. of matrix multiplication. In Foundations of Computer Science, 1986., 27th Annual [22] Donald E Knuth. 1981. e Art of Computer Programming, Volume 2: Seminu- Symposium on. IEEE, 49–54. merical Algorithms, Addison-Wesley. Reading, MA (1981). [39] Virginia Vassilevska Williams. 2012. Multiplying matrices faster than [23] Bharat Kumar, C-H Huang, P Sadayappan, and Rodney W Johnson. 1995. A Coppersmith-Winograd. In Proceedings of the forty-fourth annual ACM sym- tensor product formulation of Strassen’s matrix multiplication algorithm with posium on eory of computing. ACM, 887–898. memory reduction. Scientic Programming 4, 4 (1995), 275–289. [40] Shmuel Winograd. 1971. On multiplication of 2× 2 matrices. Linear algebra and [24] Julian Laderman, Victor Pan, and Xuan-He Sha. 1992. On practical algorithms its applications 4, 4 (1971), 381–388. for accelerated matrix multiplication. Linear Algebra and Its Applications 162