Sparsifying the Operators of Fast Multiplication Algorithms

Gal Beniamini Nathan Cheng Olga Holtz Œe Hebrew University of Jerusalem University of California at Berkeley University of California at Berkeley [email protected] [email protected] [email protected] Elaye Karstadt Oded Schwartz Œe Hebrew University of Jerusalem Œe Hebrew University of Jerusalem [email protected] [email protected] ABSTRACT 1.1 Previous work Fast algorithms may be useful, provided that Reducing the leading coecients. Winograd [43] reduced the lead- their running time is good in practice. Particularly, the leading coef- ing coecient of Strassen’s algorithm’s arithmetic complexity from €cient of their arithmetic complexity needs to be small. Many sub- 7 to 6 by decreasing the number of additions and subtractions in cubic algorithms have large leading coecients, rendering them the 2 × 2 base case from 18 to 151. Later, Bodrato [4] introduced impractical. Karstadt and Schwartz (SPAA’17, JACM’20) demon- the intermediate representation method, that successfully reduces strated how to reduce these coecients by sparsifying an algo- the leading coecient to 5, for repeated squaring and chain matrix rithm’s bilinear operator. Unfortunately, the problem of €nding multiplication. Cenk and Hasan [7] presented a non-uniform im- optimal sparsi€cations is NP-Hard. plementation of Strassen-Winograd’s algorithm [43], which also We obtain three new methods to this end, and apply them to reduces the leading coecient from 6 to 5, but incurs additional existing fast matrix multiplication algorithms, thus improving their penalties such as a larger memory footprint and higher communi- leading coecients. Œese methods have an exponential worst case cation costs. Independently, Karstadt and Schwartz [22, 23] used a running time, but run fast in practice and improve the performance technique similar to Bodrato’s, and obtained a matrix multiplication of many fast matrix multiplication algorithms. Two of the methods algorithm with a 2 × 2 base case, using 7 multiplications, and a are guaranteed to produce leading coecients that, under some leading coecient of 5. Œeir method also applies to other base assumptions, are optimal. cases, improving the leading coecients of multiple algorithms. Beniamini and Schwartz [1] introduced the decomposed recursive bilinear framework, which generalizes [22, 23]. Œeir technique 1 INTRODUCTION allows a further reduction of the leading coecient, yielding several Matrix multiplication is a fundamental computation kernel, used in fast matrix multiplication algorithms with a leading coecient of many €elds ranging from imaging to signal processing and arti€cial 2, matching that of the classical algorithm. neural networks. Œe need to improve performance has aŠracted Lower bounds on leading coecients. Probert [32] proved that 15 much aŠention from the science and engineering communities. additions are necessary for any recursive-bilinear matrix multipli- Strassen’s discovery of the €rst sub-cubic algorithm [38] sparked cation algorithm with a 2 × 2 base case using 7 multiplications over intensive research into the complexity of matrix multiplication F2, which corresponds to a leading coecient of 6. Œis was later algorithms (cf. [1–3, 9–13, 16, 18, 20–23, 25–27, 30, 31, 33–37, 39, matched by Bshouty [6], who used a di‚erent technique to obtain 42, 43]). the same lower bound over an arbitrary . Both cases have been Œe research e‚orts can be divided into two main branches. Œe interpreted as a proof of optimality for the leading coecient of €rst revolves around the search for asymptotic upper bounds on Winograd’s algorithm [43]. arXiv:2008.03759v1 [cs.DS] 9 Aug 2020 the arithmetic complexity of matrix multiplication (cf. [9–11, 16, Karstadt and Schwartz’s h2, 2, 2; 7i-algorithm2 [22, 23] requires 27, 37, 39, 42]). Œis approach focuses on asymptotics, typically dis- 12 additions (thus having a leading coecient of 5) and seemingly regarding the hidden constants of the algorithms and other aspects contradicts these lower bounds. Indeed, they showed that these of practical importance. Many of these algorithms remain highly lower bounds [6, 32] do not hold under alternative basis multipli- theoretical due to their large hidden constants, and furthermore, cation. In addition, they extended the lower bounds to apply to they apply only to matrices of very high dimensions. algorithms that utilize basis transformations, and showe that 12 In constrast, the second branch focuses on obtaining matrix additions are necessary for any recursive-bilinear matrix multipli- multiplication algorithms that are both asymptotically fast and cation algorithm with a 2 × 2 base case using 7 multiplications, practical. Œis requires the algorithms to have reasonable hidden regardless of basis. Œus proving a lower bound of 5 on the leading constants that are applicable even to small instances (cf., [1–3, 18, coecient of such algorithms. 20–23, 25, 26, 30, 31, 33–36, 43]).

, 1See Section 2.1 for the connection between the number of additions and the leading 2020. . coecient. DOI: 2See Section 2.1 for de€nition. 1 Table 1: Examples of improved leading coecients

Arithmetic Operations Leading Coecient Improvement Algorithm Leading Monomial Original [22, 23] Here Original [22, 23] Here [22, 23] Here h2, 2, 2; 7i [38] nlog2 7 ≈ n2.80735 18 12 12 7 5 5 28.57% 28.57% 3 h3, 2, 3; 15i [2] nlog18 15 ≈ n2.81076 64 52 39 9.61 7.94 6.17 17.37% 35.84% 3 h4, 2, 3; 20i [35] nlog24 20 ≈ n2.82789 78 58 51 8.9 7.46 5.88 16.17% 33.96% h3, 3, 3; 23i [2] nlog3 23 ≈ n2.85404 87 75 66 7.21 6.57 5.71 8.87% 20.79% 3 h6, 3, 3; 40i [35] nlog54 40 ≈ n2.77429 1246 202 190 55.63 9.36 8.9 83.17% 84.01% ‡e leading monomial of rectangular hn, m, k; t i-algorithms refers to their composition [19] into square nmk, nmk, nmk; t 3 -algorithms. ‡e improvement column is the ratio between the new and the original leading coecients of the arithmetic complexity. See Table 2 for a full list of results.

Beniamini and Schwartz [1] extended the lower bound to the and obtain novel alternative-basis algorithms, o‰en resulting in generalized seŠing, in which the input and output can be trans- arithmetic complexity with leading coecients superior to those formed to a basis of larger dimension. Œey also found that the known previously (See Table 1, Table 2, and Appendix A). leading coecient of any such algorithm with a 2 × 2 base case Œe €rst two methods were obtained by the introduction of new using 7 multiplications is at least 5. solutions to the Sparsest Independent Vector problem, which were then used as oracles for GoŠlieb and Neylon’s algorithm. As matrix Obtaining alternative basis algorithms. Recursive-bilinear algo- sparsi€cation is known to be NP-Hard, it is no surprise that these rithms can be described by a triplet of matrices, dubbed the encod- methods exhibit exponential worst case complexity. Nevertheless, ing and decoding matrices (see Section 2.1). Œe alternative basis they perform well in practice on the encoding/decoding matrices technique [22, 23] utilizes a decomposition of each of these matrices of fast matrix multiplication algorithms. pair sparse into a of matrices – a basis transformation, and a encod- Our third method for matrix sparsi€cation simultaneously mini- ing or decoding matrix. Once a decomposition is found, applying mizes the number of non-singular values in the matrix. Œis method the algorithm is straightforward (see Section 2.1). does not guarantee an optimal solution for matrix sparsi€cation. Œe leading coecient of the arithmetic complexity is deter- Nonetheless, it obtains solutions with the same (and, in some cases, mined by the number of non-zero (and non-singleton) entries in beŠer) leading coecients than the former two methods when each of the encoding/decoding matrices, while the basis transforma- applied to many of the fast matrix multiplication algorithms in tions only a‚ect the low order terms of the arithmetic complexity our corpus, and runs signi€cantly faster than the €rst two when (see Section 2.1). Œus, reducing the leading coecient of fast ma- implemented using Z3 [14]. For completeness, we also present the trix multiplication algorithms translates to the matrix sparsi€cation sparsi€cation heuristic used in [22, 23]. (MS) problem. Matrix sparsi€cation. Unfortunately, matrix sparsi€cation is NP- Hard to solve [28] and NP-Hard to approximate to within a factor 1.3 Paper Organization. .5−o(1) of 2log n [15] (Over Q, assuming NP does not admit quasi- In Section 2, we recall preliminaries regarding fast matrix mul- polynomial time deterministic algorithms). Despite the problem tiplication and recursive-bilinear algorithms, followed by a sum- being NP-hard, search heuristics can be leveraged to obtain bases mary of the Alternative Basis technique [22, 23]. We then present which signi€cantly sparsify the encoding/decoding matrices of fast Matrix Sparsi€cation (MS, Problem 2.13), alongside GoŠlieb and matrix multiplication algorithms with small base cases. Neylon’s [15] algorithm for solving MS by relying on an oracle for Œere are a few heuristics that can solve the problem, under Sparsest Independent Vector (SIV, Problem 2.15). In Section 3 we severe assumptions, such as the full rank of any square submatrix, present our two algorithms (Algorithms 3 and 4) for implementing and requiring that the rank of each submatrix be equal to the size SIV. In Section 4, we introduce Algorithm 5 - the sparsi€cation of the largest matching in the induced bipartite graph (cf., [8, 17, heuristic of [22, 23], and a new ecient heuristic for sparsifying 28, 29]). Œese assumptions rarely hold in practice, and speci€cally, matrices while simultaneously minimizing non-singular values do not apply to any matrix we know. (Algorithm 6). In Section 5 we present the resulting fast matrix GoŠlieb and Neylon’s algorithm [15] sparsi€es an n × m matrix multiplication algorithms. Section 6 contains a discussion and plans with no assumptions about the input. It does so by using calls to for future work. an oracle for the sparsest independent vector problem.

1.2 Our contribution. 2 PRELIMINARIES We obtain three new methods for matrix sparsi€cation, based on 2.1 Encoding and Decoding matrices. GoŠlieb and Neylon’s [15] matrix sparsi€cation algorithm. We Fast matrix multiplication algorithms are recursive divide-and- apply these methods to multiple matrix multiplication algorithms conquer algorithms, which utilize a small base case. We use the 2 notation hn0,m0,k0; t0i-algorithm to refer to an algorithm multi- Lemma 2.8. [19] Let hU , V , W i be the encoding/decoding matri- plying n0 × m0 by m0 × k0 matrices in its base case, using t0 scalar ces of an hm,k,n; ti-algorithm. Šen hWPn×m, U , VPn×k i are the multiplications, where n0,m0,k0 and t0 are €xed positive integers. encoding/decoding matrices of an hn,m,k; ti-algorithm. When multiplying n × m by m × k matrix multiplication, the algorithm splits each matrix into blocks (each of size n × m Remark 2.9. In addition to Lemma 2.8, Hopcro‰ and Musin- n0 m0 ski [19] proved that any hn,m,k; ti-algorithm de€nes algorithms and m × k , respectively), and works block-wise, according to m0 k0 for all permutations of n, m, and k. Note, however, that while the the base algorithm. Additions and subtractions in the base-case number of non-zero and non-singular entries does not change, it fol- algorithm become block-wise additions and subtractions. Similarly, lows from Remark 2.4 and Corollary 2.6 that the leading coecient multiplication by a scalar become multiplication of a varies according to the dimensions of the decoding matrix. by a scalar. Matrix multiplications in the algorithm are performed via recursion. 2.2 Alternative Basis Matrix Multiplication. Œroughout this paper, we refer to an algorithm by its base case. h i De€nition 2.10. [22, 23] Let R be a ring and let ϕ, ψ, υ be au- Hence, an n,m,k; t -algorithm may refer to either the algorithm’s n·m m·k n·k base case or the corresponding block recursive algorithm, as obvi- tomorphisms of R , R , R (respectively). We denote a ous from the context. recursive bilinear matrix multiplication algorithm which takes ϕ (A) , ψ (B) as inputs and outputs υ (A · B) using t multiplications n × m → k Fact 2.1. [22, 23] Let R be a ring, and let f : R R R by hn,m,k; tiϕ,ψ,υ . If n = m = k and ϕ = ψ = υ, we can use the no- be a bilinear function that performs t multiplications. Œere exist tation hn,n,n; tiϕ -algorithm. Œis notation extends the hn,m,k; ti- × × × U ∈ Rt n, V ∈ Rt m, W ∈ Rt k such that algorithm notation, as the laŠer applies when the three basis trans- x ∈ Rn, y ∈ Rm, f (x,y) = W T ((U · x) (V · y)) formations are the identity map. ∀ where is the element-wise product (Hadamard product). Given a recursive bilinear, hn,m,k; tiϕ,ψ,υ -algorithm ALG, an alternative basis matrix multiplication operates as follows: De€nition 2.2. [22, 23] (Encoding/Decoding matrices). We refer to the matrix triplet hU , V , W i of a recursive-bilinear algorithm (see Fact 2.1) as its encoding/decoding matrices (U , V are the encoding Algorithm 1 Alternative Basis Matrix Multiplication Algorithm matrices and W is the decoding matrix). Input: A ∈ Rn×m, Bm×k Notation 2.3. [1] Denote the number of nonzero entries in a Output: n × k matrix C = A · B matrix by nnz (A), and the number of non-singleton (i.e., not ±1) 1: function Mult(A, B) entries in a matrix by nns (A). Let the number of rows/columns be n×m 2: A˜ = ϕ(A) . R basis transformation nrows (A) and ncols (A), respectively. × 3: B˜ = ψ (B) . Rm k basis transformation Remark 2.4. [1] Œe number of linear operations used by a bi- 4: C˜ = ALG(A˜, B˜) . hn,m,k; tiϕ,ψ,υ -algorithm − n×k linear algorithm is determined by its encoding/decoding matrices. 5: C = υ 1(C˜) . R basis transformation Œe number of arithmetic operations performed by each of the 6: return C encodings is: OpsU = nnz (U ) + nns (U ) − nrows (U ) OpsV = nnz (V ) + nns (V ) − nrows (V ) Lemma 2.11. [22, 23] Let R be a ring, and let ϕ, ψ, υ be automor- n·m m·k n·k h i Œe number of operations performed by the decoding is: phisms of R , R , R (respectively). Šen U , V , W are en- coding/decoding matrices of an hn,m,k; tiϕ,ψ,υ -algorithm if and only OpsW = nnz (W ) + nns (W ) − ncols (W ) if Uϕ, Vψ, Wυ−T are encoding/decoding matrices of an hn,m,k; ti- Remark 2.5. We assume that none of the rows of the U , V , andW algorithm matrices is zero. Œis is because any zero row in U , V is equivalent Alternative basis multiplication is fast since the basis transfor- to an identically 0 multiplicand, and any zero row inW is equivalent mations are fast and incur an asymptotically negligible overhead: to a multiplication that is never used in the output. Hence, such rows can be omiŠed, resulting in asymptotically faster algorithms. Claim 2.12. [22, 23] Let R be a ring, let ψ : Rn0×m0 → Rn0×m0 Corollary 2.6. [1] Let ALG be an hn0,m0,k0; t0i-algorithm that be a linear map, and let A ∈ Rn×m where n = nk , m = mk . Še OpsU, OpsV, OpsW 0 0 performs linear operations at the base case and let complexity of ψ (A) is n = nl , m = ml , k = kl (l ∈ N). Še arithmetic complexity of ALG 0 0 0 q is: F (n,m) = nm · log (nm)   n m n0m0 OpsU OpsV OpsW l 0 0 F (n, m, k) = 1 + + + t0 t0 − n0m0 t0 − m0k0 t0 − n0k0 where q is the number of linear operations performed.  OpsU · nm OpsV · mk OpsW · nk  − + + t0 − n0m0 t0 − m0k0 t0 − n0k0 2.3 Matrix Sparsi€cation. De€nition 2.7. Let PI ×J denote the permutation matrix that ex- Finding a basis that minimizes the number of additions and sub- changes row-order for column-order of the vectorization of an I × J tractions performed by a fast matrix multiplication algorithm is matrix. equivalent, by Remark 2.4, to the Matrix Sparsi€cation problem: 3 Problem 2.13. Matrix Sparsi€cation Problem (MS): Let U be an Notation 3.3. Given an Ω-valid set S, we will refer to a vector n T n ×m matrix. Še objective is to €nd an A such that λ ∈ F with supp (λ) 1 Ω s.t. λ U:, S = 0 as an Ω-validator of S. A = argmin (nnz (AU )) Next, we provide a de€nition for vectors which are candidates A∈GL n for a solution of SIV: Remark 2.14. It is traditional to think of the matrices U , V , and W as “tall and skinny”, i.e., with n ≥ m. However, in the area of De€nition 3.4. A vector v in the row space of U is called Ω- matrix sparsi€cation, it is traditional to deal with matrices satisfying independent if v is not in the row space of UΩ, :. n ≤ m and transformations applied from the le‡. However, since Note that any solution to SIV (Problem 2.15) is, by de€nition, an T T T T nnz(AU ) = nnz(U A ), we can simply apply MS to U and use A optimally sparse Ω-independent vector. as our basis transformation. From now on, we will therefore switch to the convention n ≤ m used in matrix sparsi€cation. Remark 3.5. Note that given a set S ⊂ [m], it is possible to verify whether S is Ω-validand €nd an appropriate Ω-validator for it in To solve MS, we make use of GoŠlieb and Neylon’s algorithm [15], cubic time (e.g., via ). which solves the matrix sparsi€cation problem for n×m matrices, by repeatedly invoking an oracle for the Sparsest Independent Vector 3.1 Sparse Independent Vectors and maximal problem (Problem 2.15). Ω-valid sets. Problem 2.15. Sparsest Independent Vector Problem (SIV): Let Œe crux of our algorithms lies in the idea of €nding an Ω-valid set n×m U ∈ R (n ≤ m) and let Ω = {ω1,..., ωk } ⊂ [m]. Find a of maximal cardinality and using it to compute a solution for SIV, vector v ∈ Rn s.t. v is in the row space of U , v is not in the span of  which can then be used by Algorithm 2. Œe connection between Uω1 ,..., Uωk , and v has a minimal number of nonzero entries. Ω-valid sets and Ω-independent vectors is given by the following Given a subroutine SIV (U , Ω) which returns a pair (v, i), where lemmas: v is the sparse vector as required by SIV, and i ∈ [n] \ Ω is an Lemma 3.6. Let v ∈ Fn be an Ω-independent vector. Šen the set integer such that the i’th row of U can be replaced by v without  S = j : vj = 0 is an Ω-valid set of size zeros (v). changing the span of U . Œen Algorithm 2 returns an exact solution for MS [15]. Proof. By De€nition 3.4, there exists a vector λ ∈ Fn s.t. v = Ín T i=1 λiUi (i.e., v = λ U ) and λi0 , 0 for some i0 < Ω (hence Algorithm 2 MS via SIV [15] supp (λ) 1 Ω). Œus, λ is an Ω-validator of S, and therefore, S is Ω-valid.  1: procedure MS(U) 2: Ω ← ∅ Lemma 3.7. Let S ⊂ [m] be an Ω-valid set and let λ ∈ Fn an 3: for j = 1,...,n Ω-validator of S. Šen v = λT U is an Ω-independent vector with at 4: (vj ,i) ← SIV (U , Ω) least |S| zero entries. 5: Replace i’th row of U with vj n T 6: Ω ← Ω ∪ {i} Proof. Since S is valid, there exists λ ∈ F s.t. λ U:,S = 0 and return U supp (λ) 1 Ω. Denote v = λT U . By de€nition, v has at least |S| zero  T  entries since i ∈ S vi = λ U:, S = 0. Next we show that v is Ω- ∀ i T Ín 3 OPTIMAL SPARSIFICATION METHODS independent. Note that, v = λ U = i=1 λiUi, : is in the row space of U since it is a linear combination of the rows of U . Furthermore, In this section, we reframe SIV as a problem of €nding a maximal since supp (λ) 1 Ω, there exists i < Ω s.t. λ , 0. Œerefore, subset of columns of the input matrix U according to constraints 0 i0 v is not in the row span of U since we assume (Remark 2.14) given by Ω (see De€nition 3.2). We refer to such sets as Ω-valid sets Ω, : that all rows of U are linearly independent. Hence, v = λT U is an and show that Ω-valid sets are tied to sparse independent vectors Ω-independent vector with at least |S| zero entries.  (Section 3.1) and that any algorithm which €nds an Ω-valid set of maximal cardinality can be used as an oracle in Algorithm 2. Corollary 3.8. Let M ⊂ [m] be a maximal Ω-valid set (i.e., Finally, we show how to €nd maximal Ω-valid sets (Section 3.2), M is not a subset of any other Ω-valid set), and let v ∈ Fn be an and obtain two algorithms that solve SIV. Ω-independent vector s.t. i ∈ M vi = 0. Šen j < M, vj , 0. Recall that we use the convention that U ∈ Fn×m where n ≤ m ∀ ∀ (see Remark 2.14). Œroughout this section, we also assume that U Proof. Denote the set of indices of zero entries of v by M 0 =  0 is of full rank n and Ω ( [n]. j : vj = 0 . Since v is Ω-independent, Lemma 3.6 yields that M is valid. Hence, by maximality of M, M = M 0 and |M| = zeros (v). Notation 3.1. For a set S and an integer k, let Ck (S) denote the Œerefore, i ∈ [m] vi = 0 if, and only if, i ∈ M.  set of all subsets of S with k elements. ∀ ⊂ De€nition 3.2. S ⊂ [m] is Ω-valid if there exists i < Ω such that Corollary 3.9. Let M [m] be a maximal Ω-valid set and let   λ ∈ Fn be an Ω-validator of M. Šen v = λT U is an Ω-independent U is in the span of rows U . i, S [n]\{i }, S vector with exactly |M| zero entries. Formally, a set S ⊂ [m] is Ω-valid if exists λ ∈ Fn with supp (λ) 1 T Ω s.t. λ U:, S = 0 (where supp (λ) = {i : λi , 0}). Proof. Follows directly from Lemma 3.7 and Corollary 3.8  4 Œe €nal two claims will show how Ω-validity can serve as (since λT U (:, S) = 0), and any linear combinations of columns of S. an oracle for Algorithm 2. Recall that Algorithm 2 uses an or- Œis leads to the following extension of sets: acle which returns a pair (v,i), where v is an optimally sparse ⊂ [ ] ( ) Ω-independent vector, and replacing the i’th row of U with v does De€nition 3.12. Let S m . We de€ne the extension of S, E S , ⊂ [ ]   not change the row span of U . Œe next claim shows that a maxi- to be the largest set E m s.t. span col U:, S = span col U:, E . mally sparse Ω-independent vector is equivalent to an Ω-valid set Lemma 3.13. Let S ⊂ [m]. Šen S is Ω-valid if, and only if, E (S) of maximal cardinality. is Ω-valid. Claim 3.10. An Ω-independent vector v ∈ Fm is optimally sparse if, and only if, M = {i : v = 0} is an Ω-valid set of maximal cardi- Proof. Assume E (S) is Ω-valid. By de€nition of Ω-validity, i n T nality. exists a vector λ ∈ F s.t. supp (λ) 1 Ω and λ U:,E(S) = 0. Since T S ⊂ E (S), λ U:,S = 0, therefore, S is valid. ∈ Fm n Proof. First, assume that v is a maximally sparse Ω- Let S ⊂ [m] be an Ω-valid set, and let λ ∈ F with supp (λ) 1 Ω independent vector (i.e., for any Ω-independent vectoru, zeros (u) ≤ T s.t. λ U:, S = 0. Since col (U (:, E (S))) = col (U (:, S)), all columns zeros (v)). From Lemma 3.6, we know that M is Ω-valid. Lemma 3.7 indexed by E (S) are linear combinations of the columns indexed by shows that if there exists an Ω-valid set S s.t. |M| < |S|, then S. Since λ is orthogonal to all columns of U indexed by S, it is also there also exists an Ω-independent vector u ∈ Fm s.t. zeros (u) ≥ T orthogonal to all their linear combinations. Œerefore, λ U ( ) = | | ( ) :, E S S > zeros v . Œis contradicts v being a maximally sparse Ω- 0. Hence E (S) is valid.  independent vector. Next we show that the search for a maximal Ω-valid set can be Now, assume that M is an Ω-valid set of maximal cardinality (i.e., reduced to the search over maximal extensions of sets of size n − 1. for any Ω-valid set S, |S| ≤ |M|) and let λM be an Ω-validator of T  ≤ − M. By Corollary 3.9, vM = λ U is an Ω-independent vector with Remark 3.14. Note that rank U:, S n 1 for any Ω-valid set M  T exactly |M| zero entries. Assume by contradiction that exists u ∈ S. Œis is due to the fact that if rank U:, S = n then λ U:, S = 0 Fm with z > |M| zero entries, then by Lemma 3.6, there is an Ω- implies that λ = 0 since the rows of U are linearly independent. valid set S s.t. |M| < |S|, in contradiction to M being an Ω-valid set Lemma 3.15. Let S be an Ω-valid set and let λ ∈ Fn be an Ω- of maximal cardinality. Œerefore, v = λ U is a maximally sparse M M validator of S. Šen Ω-independent vector.  n  T  o Ω S E (S) ⊂ i : λ U = 0 Œe following claim shows that given an -valid set, , and i its corresponding Ω-independent vector v (as in Lemma 3.7), the n   o support of the Ω-validator of S can be used to €nd an index i s.t. Proof. Let D = i : λT U = 0 . By De€nition 3.12, columns the i’th row of U can be replaced with v without changing the row i indexed by E (S) are linear combinations of the columns indexed span of U . by S and λ is orthogonal to all columns of U:, S (and their linear T Claim 3.11. Let S be an Ω-valid set, let λ be an Ω-validator of S, combinations). Hence, λ U:, E(S) = 0 and E (S) ⊂ D.  and let v = λT U . Šen for any i ∈ supp (λ) \ Ω, replacing row i of U  with v does not the change row span of U . Šat is: Lemma 3.16. Let S be an Ω-valid set s.t. rank U:, S = n − 1, and     let D be an Ω-valid set s.t. S ⊂ D. Šen D ⊂ E (S). span (rows (U )) = span rows U[n]\{i },: ∪ {v}   ∈ ( ) \ Proof. Since S ⊂ D, n − 1 = rank U:, S ≤ rank U:, D . How- Proof. Fix i0 supp λ Ω. Since v is a linear combination      ever, from Remark 3.14, we know that rank U:, D ≤ n − 1, there- of rows of U and λ , 0, u ∈ span rows U ∪ {v} , for i0 [n]\{i0 },: fore, span col U  = span col U . Hence, by de€nition, n :, S :, D any u ∈ span (rows (U )). Now, let α ∈ F be the vector αj = −λj D ⊂ E (S).     (for j , i ) and α = 0. Œen w = αT U ∈ span rows U , 0 i0 [n]\{i0 },:      Corollary 3.17. Let S be an Ω-valid set s.t. rank U:, S = n − 1, w +v = λ U ∈ span rows U ∪ {v} n therefore i0 i0,: [n]\{i0 },: . Hence, and let λ ∈ F be an Ω-validator of S. Šen span (rows (U )) = n   o     E (S) = i : λT U = 0 span rows U[n]\{i },: ∪ {v} .  i

Œerefore, any algorithm which €nds an Ω-valid set of maximal Proof. Œis is a direct result of Lemma 3.15 and Lemma 3.16.  cardinality is an oracle for Algorithm 2. Note that Corollary 3.17 gives us the tools to quickly compute  3.2 Computing maximal Ω-valid sets. the extension of any Ω-valid set S such that rank U:, S = n − 1. Given a maximal Ω-valid set, we now have the tools to compute Next we prove that any maximal Ω-valid set is an extension of an optimally sparse Ω-independent vectors. As the next stage, we show Ω-valid set of n − 1 linearly independent columns of U : how to compute a maximal Ω-valid set M using a small subset of Claim 3.18. Let S ⊂ [m] be a maximal Ω-valid set, then columns S ⊂ M. Œe key intuition here is that if λ ∈ Fn is an  Ω-validator of S, then λ is orthogonal to all columns indexed by S rank U:, S = n − 1 5 Proof. Let S ⊂ [m] be a maximal Ω-valid set, and let i0 < Ω such Algorithm 3 Sparsest Independent Vector (1)    ∈ that Ui0, S span rows U[n]\i0, S (such i0 exists by de€nition of 1: procedure SIV (U, Ω)  2: sparsity ← 0 an Ω-valid set). Suppose, by contradiction, that rank U:, S = n − r for some r > 1. 3: sparsest ← null  4: i ← null Note that since Ui0, S is in the row span U[n]\i0, S , rank U:, S =   5: for C ∈ Cn−1({1,...,m}) rank U = n − r. Œerefore, exists S ⊂ S s.t. |S| = n − r [n]\i0, S 0 6: if rank U  < n − 1 or C is not Ω-valid  :,C and rank U:, S = n − r. 7: continue 0   8: ← Let Q ⊂ [m]\S s.t. |Q| = r −1, rank U[n]\i , Q = r −1, and each λ Ω-validator of C 0 ← T column indexed by Q is not in the column span of U . Such 9: v λ U [n]\i0, S ← { } Q exists because the matrix U has full rank n − 1 (since U 10: E i : vi = 0 [n]\{i0 }, : | | is of full row rank n). 11: if E > sparsity 12: sparsity ← |E| Since the matrix U[n]\{i }, S ∪Q is a square n − 1 ×n − 1 matrix of 0 0   13: sparsest ← v U row U full rank, i0,S0∪Q is in the span of [n]\{i0 }, S0∪Q . Œerefore, 14: i ← any element of supp (λ) \ Ω S0 ∪ Q is an Ω-valid set. return (v, i) By Lemma 3.13, the extension of S0 ∪Q is also valid. Furthermore, S ∪Q ⊂ E (S0 ∪ Q) because we have chosen S0 s.t. it spans the same column space as S. However, by construction of Q, we know that Theorem 3.22. Algorithm 3 produces an optimal solution to SIV, and is an oracle for Algorithm 2. S ∩ Q = ∅, meaning that |E (S0 ∪ Q)| ≥ |S ∪ Q| > |S|. Œis in contradiction to maximality of S.  Proof. By Lemma 3.21, Algorithm 3 iterates over all maximal Ω- valid sets. Lines 14-17 check whether a given Ω-valid set has greater Corollary 3.19. Let S ⊂ [m] be a maximal Ω-valid set and let cardinality than any previously found maximal Ω-valid set and if it C ⊂ S s.t. rank U  = n − 1. Šen S = E (C). :, C does, the algorithm choose this set as a working solution. Hence, at the end of the algorithm, the chosen vectorv correlates to a maximal Proof. C ⊂ S, Œerefore, span col U  ⊂ span col U . :, C :, S cardinality Ω-valid set. By Claim 3.10,v is an optimal solution to SIV Because S is maximal, Claim 3.18 shows that rank U  = n−1. We :, S (a maximally sparse Ω-independent vector) if, and only if, the set have, by rank equality, that span col U  = span col U . :, C :, S E = {i : v = 0} is an Ω-valid set of maximal cardinality. Œerefore, By de€nition, E (C) is the maximal set E s.t. span col U  ⊂ i :, C the vector chosen at the end of the algorithm is a maximally sparse span col U , therefore, S ⊂ E (C). However, by maximality of :, E Ω-independent vector. Finally, by Claim 3.11, the pair (v,i) serves S, we have S = E (C).  as the oracle for SIV required by Algorithm 2.  Corollary 3.20. Let S ⊂ [m] be a maximal Ω-valid set, then 3.4 Implementation of our €rst optimal exist C ∈ Cn−1 ([m]) s.t. S = E (C). algorithm. Proof. Œis is a direct result of Corollary 3.19  In order for Algorithm 3 to perform well, we have added a blacklist to the algorithm’s operation. Since the maximal Ω-valid sets are 3.3 First algorithm for SIV. generated by computing the extension (De€nition 3.12) of n − 1 Our €rst algorithm performs an exhaustive search over all maximal independent columns, once a given Ω-valid set is found, we wish Ω-valid sets in order €nd one with maximal cardinality. Œis is a to blacklist all of its subsets of size n − 1 since we need not revisit result of the observation given by Claim 3.10, which states that any that extension. However, in addition to memory costs, looking solution to SIV is tied to an Ω-valid set of maximal cardinality (and up an element in the blacklist incurs a signi€cant overhead as the vice versa). Œe search is done using by combining Corollary 3.20, blacklist grows. To address this problem, rather than storing all which states that any maximal Ω-valid set is the extension of an subsets Cn−1 (S) of a given set S, we store S itself in the blacklist, Ω-valid set of n −1 independent columns, and Corollary 3.17, which in which case C is not blacklisted if B ∈ blacklist C 1 B. Despite ∀ provides a method to compute said extension. this measure, in some cases the blacklist still grew too large, so we and imposed a limit on the maximum size of the blacklist, storing Lemma 3.21. Algorithm 3 iterates over all maximal Ω-valid sets. only the M largest sets found so far.

Proof. By Corollary 3.20, for any maximal Ω-valid set E, there  3.5 Second algorithm for SIV. exist an Ω-valid set C ∈ C − ([m]) s.t. rank U = n − 1 and n 1 :, C While our €rst algorithm performs well in many cases, we have E is the extension of C. Œerefore, the algorithm iterates over all  found that it performs poorly when the largest Ω-valid set is very Ω-valid sets C ∈ Cn−1 ([m]) s.t. rank U:, C = n − 1. Furthermore,  large. In such cases the algorithm quickly €nds the correct solution, by Corollary 3.17, if rank U:, C = n − 1 and λ is an Ω-validator of n   o but then continues its exhaustive search for a very long time. Our C then E (C) = i : λT U = 0 . Œe algorithm performs this i second algorithm is slightly simpler and avoids this ineciency computation at lines 11-13. Hence, the algorithm iterates over all by using a top-down approach, searching for Ω-valid sets in de- Ω-valid sets.  scending order of cardinality to €nd an Ω-valid set of maximal 6 cardinality. Just like our €rst algorithm, it relies on the observation Algorithm 5 Row basis sparsi€cation [22, 23] of Claim 3.10, which ties any solution of SIV (maximally sparse, 1: procedure KS-Sparsification(U) Ω-independent vector) to an Ω-valid set of maximal cardinality. 2: sparsity ← nnz (U ) 3: basis ← In Algorithm 4 Sparsest Independent Vector (2) 4: for C ∈ Cm (n) 5: if U:, C is of full rank 1: procedure SIV (U, Ω) − 6: sparsi€er ← U 1 2: for z = m − 1,...,n − 1 :, C 7: if nnz (sparsi f ier · U ) < sparsity 3: for C ∈ Cz ([m])  8: sparsity ← nnz (sparsi f ier · U ) 4: if rank U:,C = n − 1 and C is Ω-valid 9: basis ← sparsi€er 5: λ ← Ω-validator of C 6: v ← λT U 10: return basis 7: i ← any element of supp (λ) \ Ω 8: return (v, i) 4.2 Greedy sparsi€cation. A second heuristic for matrix sparsi€cation, inspired by GoŠlieb To prove the correctness of our Algorithm 4, we use the following and Neylon’s algorithm (Algorithm 2), employs an even simpler lemma, which provides bounds on the size of a maximal Ω-valid set. greedy approach. Recall that for a given n × m matrix U (n ≤ m), we seek an n × n Lemma 3.23. Let S ⊂ [m] be a maximal Ω-valid set, then n − 1 ≤ matrix A which minimizes nnz (AU ) + nns (AU ). For this purpose, |S| ≤ m − 1. rather than searching for the entire invertible matrix A achieving this objective, we could instead search for each row of A individually. Proof. First, we show that |S| < m. Assume, by contradiction, Concretely, we iteratively compose the matrix A row-wise; where n T that |S| = m and let λ ∈ F be an Ω-validator of S. Œen λ U = 0, at each step i, we obtain the sparsest row vector vi such that vi is Í which means that i ∈[n] λiUi,: = 0, in contradiction to U having independent of {v1,...,vi−1} and minimizes nnz (vU ) + nns (vU ). full row rank n. Hence, |S| ≤ m − 1. Œis yields the following algorithm: Next, by Claim 3.18, since S is a maximal Ω-valid set, its rank isn − 1, therefore, n − 1 ≤ |S|. Hence n − 1 ≤ |S| ≤ m − 1.  Algorithm 6 Greedy Sparsi€cation 1: procedure Greedy − Sparsi f ication(U ) Theorem 3.24. Algorithm 3 produces an optimal solution to SIV, 2: A ← ∅ and is an oracle for Algorithm 2. 3: for i = 1,..., n T  T  m 4: v ← argmin nnz v U + nns v U Proof. Claim 3.10 states that v ∈ F is a solution to SIV (an op- v ∈Fm { } r k({v1,...,vi−1,v })=i timally sparse, Ω-independent vector) if and only if S = i : vi = 0 T 5: A ← v is Ω-valid. Œe algorithm iterates all subsets of [m] in descending i,: order of cardinality. Œerefore, the €rst Ω-valid set found is an 6: return A Ω-valid set of maximal cardinality. Furthermore, Lemma 3.23 states that any maximal Ω-valid set is of size n − 1 ≤ z ≤ m − 1, hence, the In order to implement the subroutine for €nding each row vec- algorithm iterates all candidates S ⊂ [m] that could be Ω-valid sets tor vi , we encoded the objective as a MaxSAT instance and used of maximal cardinality. Œerefore, Algorithm 4 returns a sparsest Z3 [14], an SMT Œeorem Prover, to €nd the optimal solution. Our Ω-independent vector. Finally, by Claim 3.11, the pair (v,i) serves MaxSAT instance employs two types of “so‰” constraints: one as the oracle for SIV required by Algorithm 2.  which penalizes non-zero entries, and another which penalizes non-singleton entries. Œerefore, optimal solutions will minimize 4 ADDITIONAL SPARSIFICATION METHODS the sum of non-zero and non-singleton entries, thereby minimizing 4.1 Sparsi€cation via subset of rows. the associated arithmetic complexity (Remark 2.4). Œis algorithm, while not proven to be optimal, has the advan- Œe alternative bases presented in Karstadt and Schwartz’s [22, 23] tage of considering both non-zeros and non-singletons, and can paper were found using a straightforward heuristic of iterating over therefore produce decompositions resulting in a lower arithmetic all sets of n linearly independent rows of an n×m matrix of full rank complexity than the optimal algorithms (Algorithms 3, 4). For a (where n ≥ m). Œis heuristic was based on the observation that summary of these results, see Table 2. using the columns of the original matrix for sparsi€cation ensures that the sparsi€ed matrix contains n rows, each with only a single 5 APPLICATION AND RESULTING non-zero entry. m ALGORITHMS While this method is inecient, requiring ( n ) passes, it €nds sparsi€cations which signi€cantly improve the leading coecients Table 2 contains a list of alternative basis algorithms found using our new methods. All of the algorithms used were taken from of multiple algorithms. Œe re€nement of this method led to the 3 development of Algorithm 3. It is therefore presented here for the repository of Ballard and Benson [2] . Œe alternative basis completeness. 3Œe algorithms can be found at github.com/arbenson/fast-matmul 7 Table 2: Alternative Basis Algorithms

Arithmetic Operations Leading Coecient Algorithm Leading Monomial Original Here Original Here Improvement h2, 2, 2; 7i [38] nlog2 7 ≈ n2.80735 18 12 7 5 28.57% 3 h3, 2, 2; 11i [2] nlog12 11 ≈ n2.89495 22 18 5.06 4.26 15.82% 3 h2, 3, 2; 11i [40] nlog12 11 ≈ n2.89495 22 18 4.71 3.91 16.97% 3 h4, 2, 2; 14i [2] nlog16 14 ≈ n2.85551 48 28 8.33 5.27 36.8% 3 h3, 2, 3; 15i [18] nlog18 15 ≈ n2.81076 55 39 8.28 6.17 25.5% 3 h3, 2, 3; 15i [2] nlog18 15 ≈ n2.81076 64 39 9.61 6.17 35.84% 3 h5, 2, 2; 18i [2] nlog20 18 ≈ n2.89449 53 32 6.98 4.46 36.06% 3 h4, 2, 3; 20i [35] nlog24 20 ≈ n2.82789 78 51 8.9 5.88 33.96% 3 h4, 2, 3; 20i [2] nlog24 20 ≈ n2.82789 82 51 9.19 5.88 36.01% 3 h4, 2, 3; 20i [2] nlog24 20 ≈ n2.82789 86 54 9.38 6.12 34.77% 3 h4, 2, 3; 20i [2] nlog24 20 ≈ n2.82789 104 56 11.38 6.38 43.9% 3 h2, 3, 4; 20i [2] nlog24 20 ≈ n2.82789 96 58 9.96 6.12 38.59% h3, 3, 3; 23i [2] nlog3 23 ≈ n2.85404 87 66 7.21 5.71 20.79% h3, 3, 3; 23i [2] nlog3 23 ≈ n2.85404 88 65 7.29 5.64 22.55% h3, 3, 3; 23i [2] nlog3 23 ≈ n2.85404 89 65 7.36 5.64 23.3% h3, 3, 3; 23i [2] nlog3 23 ≈ n2.85404 97 61 7.93 5.36 32.43% h3, 3, 3; 23i [2] nlog3 23 ≈ n2.85404 166 73 12.86 6.21 51.67% h3, 3, 3; 23i [26] nlog3 23 ≈ n2.85404 98 74 8 6.29 21.43% h3, 3, 3; 23i [35] nlog3 23 ≈ n2.85404 84 68 7 5.86 16.33% 3 h4, 4, 2; 26i [2] nlog32 26 ≈ n2.82026 235 105 (?) 18.1 7.81 56.84% 3 h4, 3, 3; 29i [2] nlog36 29 ≈ n2.81898 164 102 10.27 6.73 34.49% 3 h3, 4, 3; 29i [36] nlog36 29 ≈ n2.81898 137 109 8.54 6.96 18.46% 3 h3, 4, 3; 29i [2] nlog36 29 ≈ n2.81898 167 105 10.27 6.73 34.49% 3 h3, 5, 3; 36i [36] nlog45 36 ≈ n2.82414 199 139 9.62 6.87 28.6% 3 h6, 3, 3; 40i [35] nlog54 40 ≈ n2.77429 1246 190 (?) 55.63 8.9 84.01% 3 h3, 3, 6; 40i [41] nlog54 40 ≈ n2.77429 1822 190 (?) 79.28 8.9 88.78% (?) Denotes algorithms with non-singular values, where the result of Algorithm 6 was beŠer than those of the exhaustive algorithms.

algorithms obtained represent a signi€cant improvement over the this reason, the decomposition of some of the larger instances re- original versions, with the reduction in the leading coecient rang- quired the use of Mira supercomputer. However a‰er some tuning ing between 15% and 88%. Almost all of the results were found of Algorithms 3 and 4 (see Section 3.4) and the implementation using our exhaustive methods (Algorithms 3 and 4). In certain cases of Algorithm 6 using Z3, all decompositions completed on a PC (marked (?)), where the U , V , W matrices contain non-singular within a reasonable time. Speci€cally, all runs of Algorithms 3 and values, our search heuristic’s (Algorithm 6) result exceeded those 4 completed within 40 minutes, while Algorithm 6 took less than of our exhaustive algorithms. For example, bases obtained for the one minute, on a PC4. It should be remembered that Algorithms 3 h4, 4, 2; 26i-algorithm by Algorithms 3 and 4 reduced the number of and 4 guarantee optimal sparsi€cation, while Algorithm 6 has no arithmetic operations from 235 to 110, while Algorithm 6 reduced such guarantee. However, in all cases, Algorithm 6 ran much faster the number of arithmetic operations even further, to 105. and produced an equally good decomposition, with beŠer results when there were non-singular values.

Comparison of di‚erent search methods. Œe exhaustive algo- 6 DISCUSSION AND FUTURE WORK rithms (Algorithms 3, 4) solve the SIV problem. Œeir proof of We have improved the leading coecient of several fast matrix correctness, coupled with that of GoŠlieb and Neylon’s algorithm, multiplication algorithms by introducing new methods to solve to guarantee that they obtain decompositions minimizing the number of non-zero entries. As MS and SIV are both NP-Hard problems, these algorithms exhibit an exponential worst-case complexity. For 4Matebook X (i7-7500U CPU and 8GB RAM) 8 sparsify the encoding/decoding matrices of fast matrix multipli- [6] Nader H Bshouty. 1995. On the additive complexity of 2×2 matrix multiplication. cation algorithms. Œe number of arithmetic operations depends Information processing leˆers 56, 6 (1995), 329–335. [7] Murat Cenk and M Anwar Hasan. 2017. On the arithmetic complexity of Strassen- on both both non-zero and non-singular entries. Œis means that like matrix multiplications. Journal of Symbolic Computation 80 (2017), 484–501. in order to minimize the arithmetic complexity, the sum of both [8] S Frank Chang and S Œomas McCormick. 1992. A hierarchical algorithm for and making sparse matrices sparser. Mathematical Programming 56, 1 (1992), 1–30. non-zero non-singular entries should be minimized, otherwise [9] Henry Cohn and Christopher Umans. 2003. A group-theoretic approach to fast an optimal sparsi€cation may result in a 2-approximation of the matrix multiplication. In Foundations of Computer Science, 2003. Proceedings. 44th minimal number of arithmetic operations when matrix entries are Annual IEEE Symposium on. IEEE, 438–449. [10] Don Coppersmith and Shmuel Winograd. 1982. On the asymptotic complexity not limited to 0, ±1. Further work is required in order to €nd a of matrix multiplication. SIAM J. Comput. 11, 3 (1982), 472–492. provably optimal algorithm which minimizes both non-zero and [11] Don Coppersmith and Shmuel Winograd. 1990. Matrix multiplication via arith- non-singleton values. metic progressions. Journal of symbolic computation 9, 3 (1990), 251–280. [12] Hans F de Groote. 1978. On varieties of optimal algorithms for the computation We aŠempted sparsi€cation of additional algorithms for larger of bilinear mappings I. the isotropy group of a bilinear mapping. Šeoretical dimensions (e.g., Pan’s h44, 44, 44; 36133i-algorithm [31], which is Computer Science 7, 1 (1978), 1–24. [13] Hans F de Groote. 1978. On varieties of optimal algorithms for the computa- asymptotically faster than those presented here). However, the tion of bilinear mappings II. Optimal algorithms for 2×2-matrix multiplication. size of the base case of these algorithms led to prohibitively long Šeoretical Computer Science 7, 2 (1978), 127–148. runtimes. [14] Leonardo De Moura and Nikolaj Bjørner. 2008. Z3: An ecient SMT solver. In International conference on Tools and Algorithms for the Construction and Analysis Œe methods presented in this paper apply to €nding square of Systems. Springer, 337–340. invertible matrices solving the MS problem. Other classes of sparse [15] Lee-Ad GoŠlieb and Tyler Neylon. 2010. Matrix sparsi€cation and the sparse decompositions exist which do not fall within this category. For ex- null space problem. In Approximation, Randomization, and Combinatorial Opti- mization. Algorithms and Techniques. Springer, 205–218. ample, Beniamini and Schwartz’s [1] decomposed recursive-bilinear [16] Vince Grolmusz. 2008. Modular representations of polynomials: Hyperdense framework relies upon decompositions in which the sparsifying coding and fast matrix multiplication. IEEE Transactions on Information Šeory 54, 8 (2008), 3687–3692. matrix may be rectangular, rather than square. Some of the leading [17] Alan J Ho‚man and ST McCormick. 1984. A fast algorithm that makes matrices coecients in [1] are beŠer than those presented here. For example, optimally sparse. Progress in Combinatorial Optimization (1984), 185–196. they obtained a leading coe‚cient of 2 for a h3, 3, 3; 23i-algorithm [18] John E Hopcro‰ and Leslie R Kerr. 1971. On minimizing the number of multi- plications necessary for matrix multiplication. SIAM J. Appl. Math. 20, 1 (1971), of [2] a h4, 3, 3; 29i-algorithm of [36], compared to our values 5.36 30–36. and 6.96 respectively. However, the arithmetic overhead of basis [19] John E Hopcro‰ and Jean Musinski. 1973. Duality applied to the complexity of transformation in Karstadt and Schwartz [22, 23] (and therefore matrix multiplications and other bilinear forms. In Proceedings of the €‡h annual 2  ACM symposium on Šeory of computing. ACM, 73–87. here as well) is O n logn , whereas in [1] it may be larger. Note [20] Rodney W Johnson and Aileen M McLoughlin. 1986. Noncommutative Bilinear also that the decomposition heuristic of [1] does not always guaran- Algorithms for 3×3 Matrix Multiplication. SIAM J. Comput. 15, 2 (1986), 595–603. [21] Igor Kaporin. 1999. A practical algorithm for faster matrix multiplication. Nu- tee optimality. Further work is required to €nd new decomposition merical with applications 6, 8 (1999), 687–700. methods for such seŠings. [22] Elaye Karstadt and Oded Schwartz. 2017. Matrix multiplication, a liŠle faster. In Proceedings of the 29th ACM Symposium on Parallelism in Algorithms and Architectures. ACM, 101–110. 7 ACKNOWLEDGEMENTS [23] Elaye Karstadt and Oded Schwartz. 2020. Matrix multiplication, a liŠle faster. We thank Austin R. Benson for providing details regarding the Journal of the ACM (JACM) 67, 1 (2020), 1–31. [24] Donald E Knuth. 1981. Œe Art of Computer Programming, Volume 2: Seminu- h2, 3, 2; 11i-algorithm. Œis research used resources of the Argonne merical Algorithms, Addison-Wesley. Reading, MA (1981). Leadership Computing Facility, which is a DOE Oce of Science [25] Julian Laderman, Victor Y Pan, and Xuan-He Sha. 1992. On practical algorithms User Facility supported under Contract DE-AC02-06CH11357. Œis for accelerated matrix multiplication. Linear Algebra and Its Applications 162 (1992), 557–588. work was supported by the PetaCloud industry-academia consor- [26] Julian D Laderman. 1976. A noncommutative algorithm for multiplying 3×3 tium. Œis research was supported by a grant from the United matrices using 23 multiplications. In Am. Math. Soc, Vol. 82. 126–128. [27] Franc¸ois Le Gall. 2014. Powers of tensors and fast matrix multiplication. In States-Israel Bi-national Science Foundation, Jerusalem, Israel. Œis Proceedings of the 39th international symposium on symbolic and algebraic com- work was supported by the HUJI Cyber Security Research Center putation. ACM, 296–303. in conjunction with the Israel National Cyber Bureau in the Prime [28] S Œomas McCormick. 1983. A Combinatorial Approach to Some Problems. Technical Report. Stanford university CA systems optimization lab. Minister’s Oce. Œis project has received funding from the Euro- [29] S Œomas McCormick. 1990. Making sparse matrices sparser: Computational pean Research Council (ERC) under the European Union’s Horizon results. Mathematical Programming 49, 1-3 (1990), 91–111. 2020 research and innovation programme (grant agreement No [30] Victor Y Pan. 1978. Strassen’s algorithm is not optimal trilinear technique of aggregating, uniting and canceling for constructing fast algorithms for matrix 818252). operations. In Foundations of Computer Science, 1978., 19th Annual Symposium on. IEEE, 166–176. REFERENCES [31] Victor Y Pan. 1982. Trilinear aggregating with implicit canceling for a new acceleration of matrix multiplication. Computers & Mathematics with Applications [1] Gal Beniamini and Oded Schwartz. 2019. Faster Matrix Multiplication via Sparse 8, 1 (1982), 23–34. Decomposition. In Proceedings of the 31st ACM Symposium on Parallelism in [32] Robert L Probert. 1976. On the additive complexity of matrix multiplication. Algorithms and Architectures. ACM, 11–22. SIAM J. Comput. 5, 2 (1976), 187–203. [2] Austin R Benson and Grey Ballard. 2015. A framework for practical parallel fast [33] Francesco Romani. 1982. Some properties of disjoint sums of tensors related to matrix multiplication. ACM SIGPLAN Notices 50, 8 (2015), 42–53. matrix multiplication. SIAM J. Comput. 11, 2 (1982), 263–267. 2.7799 [3] Dario Bini, Milvio Capovani, Francesco Romani, and Grazia LoŠi. 1979. O(n ) [34] Arnold Schonhage.¨ 1981. Partial and total matrix multiplication. SIAM J. Comput. complexity for n×n approximate matrix multiplication. Information processing 10, 3 (1981), 434–455. leˆers 8, 5 (1979), 234–235. [35] Alexey V Smirnov. 2013. Œe bilinear complexity and practical algorithms for [4] Marco Bodrato. 2010. A Strassen-like matrix multiplication suited for squaring matrix multiplication. Computational Mathematics and Mathematical Physics 53, and higher power computation. In Proceedings of the 2010 International Sympo- 12 (2013), 1781–1795. sium on Symbolic and Algebraic Computation. ACM, 273–280. [36] Alexey V Smirnov. 2017. Several bilinear algorithms for matrix multiplication. [5] Richard P Brent. 1970. Algorithms for matrix multiplication. Technical Report. Technical Report. Stanford university CA department of computer science.

9 [37] Andrew James Stothers. 2010. On the complexity of matrix multiplication. Šesis [41] Petr Tichavsky,` Anh-Huy Phan, and Andrzej Cichocki. 2017. Numerical CP (2010). decomposition of some dicult tensors. J. Comput. Appl. Math. 317 (2017), [38] . 1969. Gaussian elimination is not optimal. Numerische mathe- 362–370. matik 13, 4 (1969), 354–356. [42] Virginia V Williams. 2012. Multiplying matrices faster than Coppersmith- [39] Volker Strassen. 1986. Œe asymptotic spectrum of tensors and the exponent Winograd. In Proceedings of the forty-fourth annual ACM symposium on Šeory of matrix multiplication. In Foundations of Computer Science, 1986., 27th Annual of computing. ACM, 887–898. Symposium on. IEEE, 49–54. [43] Shmuel Winograd. 1971. On multiplication of 2×2 matrices. Linear algebra and [40] Petr Tichavsky` and Teodor Kova´c.ˇ 2015. Private communication with Ballard its applications 4, 4 (1971), 381–388. and Benson, see [2] for benchmarking. (2015).

10 A SAMPLES OF ALTERNATIVE BASIS Table 5: h3, 3, 3; 23i-algorithm [35] ALGORITHMS Uϕ Vψ Wυ 0 0 0 0 0 0 0 0 1 −1 0 0 0 0 0 0 −1 0 0 −1 0 0 −1 0 0 1 0 In this section we present the encoding/decoding matrices of the 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 −1 0 0 0 0 0 1 0 0 0 −1 0 0 1 0 0 −1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 −1 0 0 alternative basis algorithms listed in Table 2. To verify the correct- 1 0 0 0 0 0 −1 0 0 0 0 −1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 −1 0 0 ness of these algorithms, recall Corollary 2.11 and use the following 0 0 0 −1 0 1 0 0 0 0 0 0 0 0 −1 −1 0 0 0 0 −1 0 0 0 0 1 1 −1 0 0 0 0 0 0 0 −1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 −1 0 0 0 −1 0 −1 0 −1 0 1 0 0 0 0 0 0 0 1 0 0 −1 0 0 0 0 0 0 0 fact: 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 −1 0 0 −1 1 0 0 0 0 0 0 0 −1 0 0 0 0 1 0 0 0 0 0 −1 0 1 0 1 0 0 0 1 0 0 0 0 0 Fact A.1. R 0 0 −1 0 0 0 0 0 −1 0 0 0 0 0 −1 0 0 0 0 1 0 −1 0 −1 0 0 0 (Triple product condition). [5, 24] Let be a ring, and 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 t×n·m t×m·k t×n·k 0 0 0 −1 0 0 0 0 1 −1 0 0 1 0 −1 0 0 0 0 0 0 0 1 1 0 0 0 let U ∈ R , V ∈ R , W ∈ R . Œen hU , V , W i are 0 0 −1 0 −1 0 0 0 0 0 0 −1 0 0 0 0 1 −1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 −1 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 −1 0 0 0 0 encoding/decoding matrices of an hn,m,k; ti-algorithm if and only 0 0 0 0 0 1 0 −1 0 0 0 0 1 −1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 −1 0 0 0 1 0 0 0 0 0 0 0 −1 1 1 0 0 −1 0 0 0 0 0 0 0 1 if: 0 0 0 1 0 −1 0 1 1 1 0 0 0 0 0 −1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 −1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 −1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 1 0 −1 , ∈ [ ] , , ∈ [ ] , , ∈ [ ] 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 1 0 −1 i1 k1 n j1 i2 m j2 k2 k −1 0 0 0 0 0 0 −1 0 0 −1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 ∀ 0 −1 0 0 0 0 0 0 1 0 0 0 0 0 −1 0 0 0 0 0 0 0 −1 −1 0 0 0 t ϕ ψ υ−T Õ 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 −1 0 0 0 0 0 0 0 1 0 0 0 U ,( , )V ,( , )W ,( , ) = δ , δi , j δ , 0 1 0 0 0 0 0 0 0 0 −1 0 0 0 0 0 0 0 0 0 0 0 0 0 −1 0 1 r i1 i2 r j1 j2 r k1 k2 i1 k1 2 1 j2 k2 0 0 0 0 0 0 0 1 0 0 −1 −1 0 0 0 0 0 0 0 0 0 0 −1 1 0 0 0 r =1 0 0 −1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 −1 0 0 0 0 0 −1 −1 0 0 0 0 0 1 1 0 −1 0 1 0 0 0 0 0 0 0 −1 0 0 −1 0 1 1 0 0 0 0 0 0 −1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 where δi, j = 1 if i = j and 0 otherwise. 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 −1 0 0 1 −1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 A.1 A sample of Algorithms Table 6: h6, 3, 3; 40i-algorithm [35] Table 3: h3, 2, 3; 15i-algorithm [2]

Uϕ Vψ Wυ 0 0 0 0 0 0 −100100000000 0 0 0 0 0 1 −1 0 −1 000000100000000000 0 0 0 0 0 0 0 0 0 −1 0 0 0 0 0 0 0 0 0 0 0 0 −1 0 0 0 0 000000000100000000 0 0 0 0 0 0 −101000000000 0 0 0 1 −1 −1 0 −1 0 000000000000100000 0 0 0 0 0 0 0 0 0 0 −1 0 0 1 0 1 0 0 0 0 0 0 0 −1 0 −1 0 000000001000001000 U V W 000000000000000100 0 0 0 −1 0 1 0 0 0 000000000000000010 ϕ ψ υ 00000000000 −1 0 0 0 0 0 0 0 0 0 0 −1 0 −1 −1 0 0 0 0 0 0 0 0 0 0 0 −1 0 0 0 0 0 0 −1 000000000000100000 0 0 0 1 −1 0 −1 0 0 000000000000000 −1 0 0 0 −1 0 1 0 1 1 0 0 0 0 −1 0 0 1 0 0 0 0 0 0 000000000000 −1 0 1 0 0 0 0 0 0 0 0 0 0 −1 −1 000000000001 −1 0 0 0 0 0 000000000000010000 0 0 0 0 0 0 0 −1 0 0 0 0 0 0 0 0 0 0 1 −1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 000000000010000000 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 −1 0 0 0 0 0 0 −1 0 −1 0 0 0000000100000000 −1 0 0 0 0 0 1 0 0 0 −1 000000000001000010 0 0 1 1 0 0 0 0 −1 0 1 0 −1 0 0 0 0 0 −1 0 0 0 0 0 0 0 0 0 −10000000001 0 0 0 0 1 1 0 0 −1 0000000000000 −1 0 0 0 0 000000000000000001 0 0 0 0 0 0 −1 0 0 00000000000000000 −1 0 0 0 0 −1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 −1 0000000000000000 −1 0 0 0 0 0 0 1 −1 0 0 000000000000001000 0 0 0 0 0 0 0 0 −1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 −1 −1 −1 000000010000000000 − − − 00000000000 −1 0 0 −1 0 0 0 0 0 0 1 0 0 0 0 −1 0 0 0 0 0 0 0 −1 −1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 01000000100000 −1 0 0 0 0 −1 0 −1 0 0 0 1 1 0 0 −1 0 0 0 0 −1 −1 0 0 0 1 0 0 0 0 0 0 0 −1000001000000100 0 0 −1 −1 0 1 0 1 0 001000001000000000 1 0 0 0 −1 0 0 0 −1 0 0 0 0 0 0 0 0 −1 0 0 0 0 1 0 0 0 0 1 0 0 −1 0 1 −1 0 0 0 0 0 0 −1 0 0 −1 0 −1 0 0 0 0 0 −100000000000 −1 0 0 00010000000000001 −1 0 0 −1 0 0 1 −1 0 −1 010000100000000000 0 −1 0 0 1 0 0 −1 0 −1 0 1 0 0 0 1 0 0 0 0 0 10000000000 −1 0 1 0 0 0 0 0 1 0 0 0 0 0 −1 0 0 0 0 0 −1 0 0 0 −1 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 −1 −10100000010 0 −1 0 0 −1 −1 0 0 1 000001000000 −1 0 0 0 −1 0 0 −1 0 0 0 0 1 0 0 −1 0 0 0 0 0 0 0 −1 −1 0 −1 010000100000 −1 0 0 0 0 0 1 0 0 0 1 0 0 1 0 −1 0 0 0 0 0 0 0 0 −1 0 1 −1 0 0 0 0 0 010000000001000000 −1 0 0 −1 0 0 1 0 1 0 −10000010000000000 1 0 0 1 0 0 0 0 1 0 −1 1 0 0 1 0 0 0 0 −1 0 0 0 −1000000110000000 0 0 −1 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 −1 0 0 0 0 0 1 0 0 100000000000000000 0 0 1 0 0 0 0 −1 0 1 0 0 0 0 0 0 0 0 1 −1 0 0 0 0 0 0 −1 − − − − − 0 0 −1000100000000000 −1 0 0 −1 0 1 0 0 0 0 0 0 0 0 −1100000000010 1 0 0 0 1 1 0 1 0 0 1 0 0 1 0 0 0 0 1 0 0 0 0 0 0 −1 0 0 1 0 0 0 0 −1 0 0 0 0 0 −1 0 0 0 −1 0 0 0 1 000001000001 −1 1 0 0 0 0 00001000000 −1 0 0 0 0 0 0 −1 0 0 0 0 0 1 0 0 0 0 0 0 −1001000000000 −1 0 0 −1 0 0 0 1 0 0 0 0 0 0 0 0 0 −1 0 0 0 0 0 0 0 0 0 −1000000010000 1 0 0 0 0 0 0 0 0 10000000000000001 −1 0 0 0 0 0 1 0 1 0 0 −1 0 0 0 0 0 0 0 1 0 0 0 0 −1 0 0 0 01000010000001 −1 0 0 0 1 0 0 0 0 0 0 −1 0 0 0 1 0 0 0 0 −1 0 0 1 −1 0 0 −10000000100 −1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 −1 0 0 0 0 −1 0 0 0 0 0 0 0 −10000000000001 0 0 1 0 1 0 1 0 0 0 0 0 1 0 0 0 0 0 0 −1 0 0 0 0 0 0 −1 0 0 −1 −1 0 −1 1 0 0 −1 −1 0 0 −1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 −1 0 0 0 0 0 0 0 0 −1 0 1 0 0 0 0 1 0 0 0 0 0 0 −10010000001000 0 0 0 1 0 0 0 0 −1 1 0 0 0 0 0 0 0 0 0 0 −1 0 −1 0 0 0 0 −100000000001000000 0 0 1 0 −1 0 0 0 −1 −1 0 0 0 0 0 1 0 0 0 0 −1 0 0 0 0 −1 0 0 0 0 0 0 0 0 0 −1 0 1 0 0 0 −1 0 0 0 0 0 −1 0 0 −1000000001000000 0 0 0 0 0 −100000000000 −1 0 −1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 −1 0 0 −1 0 0 0 0 − − − 0000010100000 −1 0 −1 −1 0 0 −1 0 0 0 −1 0 0 0 0 0 −10000000000010 −1 0 0 1 0 0 0 1 0 1 0 0 0 0 1 0 0 0 0 0 0 1 0 100000000010001 −1 0 0 0 0 1 1 0 0 0 0 0 0 −1 0 0 0 0 0 0 −1 0 0 0 0 0 0 1 0 0 −T 100000000000 −1 0 1 −1 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 −1 0 0 0 0 0 0 0 −1 0 −1 0 0 ϕ ψ υ ϕ ψ υ−T 1 1 1 1 1 1 1 1 − 8 8 0 −1 0 −1 −1 0 1 1 0 0 8 − 8 0 −1 0 0 0 0 0 0 1 −1 0 −1 1 0 0 1 8 0 0 8 0 0 0 − 8 0 0 −2 1 0 8 0 − − − − 1 1 − 1 − − 1 1 1 − 1 − 1 − 1 0 0 0 0 1 −1 0 0 1 0 0 1 0 0 0 0 1 0 0 −1 0 0 0 0.25 0 0 2 1 1 1 1 1 1 8 8 8 0 0 0 0 0 0 1 1 0 1 1 0 1 1 1 8 8 8 0 0 0 0 0 0 0 0 0 8 8 8 1 1 1 1 1 1 0 0 − 8 0 −1 0 0 −1 0 1 0 0 0 0 − 8 −1 0 2 1 −1 0 0 0 0 −1 1 0 0 0 1 8 0.25 0 8 0 0 0 8 0 0 0 1 0 8 0 1 1 1 1 1 1 1 1 − − − 0 0 − 8 2 1 0 0 −1 0 1 0 0 0 0 8 −1 0 0 −1 0 −1 0 1 1 1 −1 0 0 0 0 8 − 8 − 8 0 0 0 − 8 8 8 −1 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 1 −1 1 1 −1 1 8 − 8 8 0 0 0 −1 0 1 0 −1 1 1 −1 0 1 1 0 0 8 8 0 8 8 0 8 0 −1 1 2 0 − 8 0 1 1 1 1 1 1 1 0 0 0 −1 1 −1 0 0 0 0 0 0 − 8 8 − 8 1 −1 1 0 0 0 0 2 0 0 −2 0 0 0 −1 − 8 0 0 − 8 0 0 0 8 0 2 0 1 0 − 8 0 0 0 0 0 0 −1 −1 0 0 0 0 0 0 0 −1 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 1 −1 −1 1 −1 −1 1 −1 −1 0 0 0 −1 1 1 1 0 −1 0 1 −1 1 −1 0 0 0 −1 − 8 0 0 0 8 8 0 8 0 1 −1 0 8 0 8 1 1 1 1 1 1 0 0 0 1 −1 1 1 −1 1 1 −1 1 0 0 0 −1 1 −1 1 0 −1 0 −1 1 −1 1 0 1 1 0 8 0 0 0 8 8 8 0 − 8 0 0 1 0 − 8 0 −1 0 1 0 0 1 0 0 1 −1 0 1 0 0 −1 0 0 −1 0 0 0 1 1 1 1 1 1 1 1 1 0 0 − 8 1 0 −1 0 −1 0 1 0 0 − 8 − 8 0 0 1 1 −2 0 0 0 0 0 0 −2 0 −1 −1 −1 − 8 − 8 − 8 − 8 − 8 − 8 0 0 0 −1 −1 −1 0 0 0 1 1 1 1 1 1 1 1 1 − 8 − 8 0 0 −1 0 1 0 −1 1 0 0 0 0 − 8 0 1 1 0 0 −1 0 8 − 8 − 8 0 0 − 8 0 8 −1 1 0 0 − 8 0 0 0 −1 0 1 −1 −1 −1 0 0 −1 0 0 0 −1 0 0 0 −1 0 −1 1 1 1 1 1 1 1 1 1 0 0 − 8 −1 0 1 0 −1 0 1 0 0 − 8 8 0 0 −1 1 1 −1 −1 0 0 0 0 0 0 − 8 8 8 −1 1 1 8 − 8 − 8 1 1 1 1 1 1 1 1 1 0 0 − 8 −1 0 −1 0 −1 0 1 0 0 8 − 8 0 0 −1 −1 −1 −1 1 8 8 − 8 8 8 − 8 0 0 0 −1 −1 1 0 0 0 − − − − − − 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 1 0 1 0 1 1 0 1 1 0 1 0 0 0 0 8 0 −1 0 −1 0 −1 0 1 1 − 8 8 0 −1 0 0 −1 −1 0 0 8 − 8 8 0 0 0 8 0 0 0 1 − 8 0 − 8 1 1 1 1 1 1 1 1 1 0 0 − 8 0 1 0 1 0 −1 0 −1 1 − 8 8 0 1 0 0 0 0 1 0 − 8 − 8 8 0 0 8 0 8 −1 1 0 0 8 0 − − 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 0 0 0 0 8 − 8 8 1 −1 1 −1 1 −1 0 0 0 − 8 8 − 8 0 0 0 1 1 0 0 8 8 8 0 0 0 8 0 0 0 1 − 8 0 8 1 1 1 1 1 1 1 1 1 − 8 8 0 0 −1 0 −1 0 1 1 0 0 0 0 − 8 0 −1 1 1 −1 0 0 8 8 − 8 0 0 0 − 8 0 0 0 −1 − 8 0 − 8 1 1 1 1 1 1 1 1 1 0 0 −1 0 0 0 0 −1 0 0 0 8 0 −1 0 1 0 1 0 −1 1 − 8 − 8 0 −1 0 0 0 0 −1 0 − 8 8 − 8 0 0 8 0 8 1 1 0 0 − 8 0 − 1 − 1 − − 1 − 1 − 1 1 − 1 − − 1 1 0 0 −1 0 0 −1 1 0 0 8 8 0 1 0 1 0 1 0 0 1 1 0 0 8 1 0 0 0 0 1 8 0 0 0 8 8 0 8 0 1 1 0 8 0 8 Table 4: h4, 2, 3; 20i-algorithm [35]

Uϕ Vψ Wυ 0 0 −1 0 0 1 0 0 −1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 −1 0 0 0 −1 0 0 0 0 1 0 1 0 0 1 0 0 0 0 0 0 −1 0 0 0 0 1 0 0 0 0 0 0 0 −1 1 0 0 0 1 −1 0 000000000100 0 0 0 0 1 0 0 0 0 1 0 0 0 1 00001000100 −1 0 0 0 0 0 −1 0 0 0 0 0 0 0 1 000000001000 0 −1 0 0 0 0 0 −1 0 −1 0 0 0 0 −100000000000 0 0 0 −1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 −1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 −1 0 −1 0 1 −1 0 0 0 0 −1 0 0 0 0 0 1 −1 0 0 0 0 −1 0 −1 0 0 0 0 1 −1 0 1 010000000000 0 −1 0 0 0 0 −1 0 −1 0 0 0 −1 0 0 0 0 0 0 0 1 0 0 −1 0 1 −1 0 0 0 0 −1 0 0 0 0 −1 0 0 0 0 0 0 0 0 −1 0 0 0 0 0 0 0 0 0 0 0 0 −1 0 0 0 0 0 −1 1 00000000000 −1 0 0 0 0 0 1 0 −1 0 0 0 1 0 0 000000000010 −1 0 0 0 0 0 0 −1 −1 0 0 −1 0 0 0 0 0 −1 0 0 0 0 0 0 0 0 0 0 0 0 1 −1 0 0 0 0 −1 1 0 0 0 0 0 0 0 0 0 1 −1 0 0 0 0 0 0 −1 0 0 −1 0 0 0 −1 0 0 0 001000000000 1 0 0 0 0 0 1 0 0 0 −1 1 0 −1 −100000000100 0 0 1 0 0 0 0 0 0 −1 0 0 0 0 010000001000 0 0 0 0 −1 0 1 0 0 0 1 0 −1 1 000000100000 0 −1 0 0 0 0 0 0 0 −1 0 0 −1 0 00100000000 −1 ϕ ψ υ−T −1 0 0 0 0 0 0 −1 −1 0 0 0 0 0 0 0 0 −1 0 0 0 0 0 0 0 0 −1 0 −1 0 0 0 0 0 −1 −1 0 0 0 0 100000000000 1 0 0 0 1 0 0 0 0 0 0 0 0 −1 001000000000 −1 −1 0 0 0 0 0 0 1 0 0 −1 0 −1 0 0 0 0 0 0 0 0 0 −1 1 0 1 0 0 0 0 0 0 0 1 1 −1 0 0 0 1 −10000000000 1 0 0 0 0 −1 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 −1 −1 1 0 0 −1 0 0 0 0 0 0 −1 −1 0 1 0 0 0 0 1 0 1 0 0 0 0 0 −1 0 0 0 0 0 0 0 −1 1 1 1 −1 −1 −100000010000 0 0 0 −1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 1 −1 −1 0 0 1 1 −1 −1 0 0 0 0 0 0

11