Coded Sparse Multiplication

Sinong Wang 1 Jiashang Liu 1 Ness Shroff 2

Abstract and the master node has to collect the results from all work- ers to output matrix C. As a result, a major performance In a large-scale and distributed matrix multiplica- bottleneck is the latency in waiting for a few slow or faulty tion problem C = A|B, where C ∈ r×t, the R processors – called “stragglers” to finish their tasks (Dean coded computation plays an important role to ef- & Barroso, 2013). To alleviate this problem, current frame- fectively deal with “stragglers” (distributed com- works such as Hadoop deploy various straggler detection putations that may get delayed due to few slow techniques and usually replicate the straggling task on an- or faulty processors). However, existing coded other available node. schemes could destroy the significant sparsity that exists in large-scale machine learning problems, Recently, forward error correction or coding techniques pro- and could result in much higher computation over- vide a more effective way to deal with the “straggler” in head, i.e., O(rt) decoding time. In this paper, we the distributed tasks (Dutta et al., 2016; Lee et al., 2017a; develop a new coded computation strategy, we Tandon et al., 2017; Yu et al., 2017; Li et al., 2018; Wang call sparse code, which achieves near optimal re- et al., 2018). It creates and exploits coding redundancy in covery threshold, low computation overhead, and local computation to enable the matrix C recoverable from linear decoding time O(nnz(C)). We implement the results of partial finished workers, and can therefore alle- our scheme and demonstrate the advantage of the viate some straggling workers. For example, consider a dis- approach over both uncoded and current fastest tributed system with 3 worker nodes, the coding scheme first coded strategies. splits the matrix A into two submatrices, i.e., A = [A1,A2]. | | | Then each worker computes A1 B, A2 B and (A1 + A2) B. The master node can compute A|B as soon as any 2 out of 1. Introduction the 3 workers finish, and can overcome one straggler. In this paper, we consider a distributed matrix multiplica- In a general setting with N workers, each input matrix A, B is divided into m, n submatrices, respectively. The recovery tion problem, where we aim to compute C = A|B from threshold is defined as the minimum number of workers that input matrices A ∈ Rs×r and B ∈ Rs×t for some integers r, s, t. This problem is the key building block in machine the master needs to wait for in order to compute C. The learning and signal processing problems, and has been used above MDS coded scheme is shown to achieve a recovery in a large variety of application areas including classifica- threshold Θ(N). An improved scheme proposed in (Lee tion, regression, clustering and feature selection problems. et al., 2017c) referred to as the product code, can offer a Many such applications have large-scale datasets and mas- recovery threshold of Θ(mn) with high probability. More sive computation tasks, which forces practitioners to adopt recently, the work (Yu et al., 2017) designs a type of polyno- distributed computing frameworks such as Hadoop (Dean mial code. It achieves the recovery threshold of mn, which & Ghemawat, 2008) and Spark (Zaharia et al., 2010) to exactly matches the information theoretical lower bound. increase the learning speed. However, many problems in machine learning exhibit both extremely large-scale targeting data and a sparse structure, Classical approaches of distributed i.e., nnz(A)  rs, nnz(B)  st and nnz(C)  rt. The rely on dividing the input matrices equally among all avail- key question that arises in this scenario is: is coding really able worker nodes. Each worker computes a partial result, an efficient way to mitigate the straggler in the distributed multiplication problem? 1Department of ECE, The Ohio State University, Colum- bus, USA 2Departments of ECE and CSE, The Ohio State Uni- versity, Columbus, USA. Correspondence to: Sinong Wang 1.1. Motivation: Coding Straggler . To answer the aforementioned question, we first briefly in- Proceedings of the 35 th International Conference on Machine troduce the current coded matrix multiplication schemes. In Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 these schemes, each local worker calculates a coded version by the author(s). Coded Sparse Matrix Multiplication

50 30 uncoded scheme mn=9 can we find a coded matrix multiplication scheme that has polynomial code 25 mn=16 40 mn=25 small recovery threshold, low computation overhead and 20 30 decoding complexity only dependent on nnz(C)? 15 20 Frequency 10 1.2. Main Contribution

10 ratio of round trip time 5 In this paper, we answer this question positively by de- 0 0 0 2 4 6 8 10 12 1 2 3 4 5 signing a novel coded computation strategy, we call sparse (a) Computation and communication time (s) (b)density density 1E-4 10-3 code. It achieves near optimal recovery threshold Θ(mn) by exploiting the coding advantage in local computation. Figure 1. Measured local computation and communication time. of submatrix multiplication. For example, kth worker of the Moreover, such a coding scheme can exploit the sparsity of polynomial code (Yu et al., 2017) essentially calculates both input and output matrices, which leads to low compu- tation load, i.e., O(ln(mn)) times of uncoded scheme and, m n X | i X jm nearly linear decoding time O(nnz(C) ln(mn)). A xk Bjx , (1) i=1 i j=1 k | {z } | {z } The basic idea in sparse code is: each worker chooses a ran- A˜| ˜ k Bk dom number of input submatrices based on a given degree where Ai,Bj are the corresponding submatrics of the input distribution P ; then computes a weighted linear combina- P | matrices A and B, respectively, and xk is a given integer. tion ij wijAi Bj, where the weights wij are randomly One can observe that, if A and B are sparse, due to the ma- drawn from a finite set S. When the master node receives trices additions, the density of the coded matrices A˜k and a bunch of finished tasks such that the coefficient matrix B˜k will increase at most m and n times, respectively. There- formed by weights wij is full , it starts to operate a ˜| ˜ hybrid decoding algorithm between peeling decoding and fore, the time of matrix multiplication AkBk will increase | Gaussian elimination to recover the resultant matrix C. roughly O(mn) times of the simple uncoded one Ai Bj. In the Figure1(a), we experiment large and sparse matrix We prove the optimality of the sparse code by carefully de- multiplication from two random Bernoulli square matrices signing the degree distribution P and the algebraic structure with dimension roughly equal to 1.5×105 and the number of of set S. The recovery threshold of the sparse code is mainly nonzero elements equal to 6 × 105. We show measurements determined by how many tasks are required such that the on local computation and communication time required for coefficient matrix is full rank and the hybrid decoding al- N = 16 workers to operate the polynomial code and un- gorithm recovers all the results. We design a type of Wave coded scheme. Our finding is that the final job completion Soliton distribution (definition is given in Section4), and time of the polynomial code is significantly increased com- show that, under such a distribution, when Θ(mn) tasks are pared to that of the uncoded scheme. The main reason is finished, the hybrid decoding algorithm will successfully de- that the increased density of input matrix leads to the in- code all the results with decoding time O(nnz(C) ln(mn)). creased computation time, which further incurs the even Moreover, we reduce the full rank analysis of the coefficient larger data transmission and higher I/O contention. In the matrix to the determinant analysis of a in 5 Figure1(b), we generate two 10 random square Bernoulli mn×mn R . The state-of-the-art in this field is limited to the matrices with different densities p. We plot the ratio of Bernoulli case (Tao & Vu, 2007; Bourgain et al., 2010), in the average local computation time between the polynomial which each element is identically and independently dis- code and the uncoded scheme versus the matrix density p. tributed random variable. However, in our proposed sparse It can be observed that the coded scheme requires 5 or more code, the matrix is generated from a degree distribution, computation time vs the uncoded scheme, and this ratio is which leads to dependencies among the elements in the particularly large, i.e., O(mn), when the matrix is sparse. same row. To overcome this difficulty, we find a differ- In certain suboptimal coded computation schemes such as ent technical path: we first utilize the Schwartz-Zeppel MDS code, product code (Lee et al., 2017a;c) and short Lemma (Schwartz, 1980) to reduce the determinant analysis dot (Dutta et al., 2016), the generator matrices are generally problem to the analysis of the probability that a random bi- dense, which also implies a heavy workload for each local partite graph contains a . Then we combine worker. Moreover, the decoding algorithm of polynomial the combinatoric and the probabilistic method code and MDS type of codes is based on the fast polynomial to show that when number of mn tasks are collected, the interpolation algorithm in the finite field. Although it leads coefficient matrix is full rank with high probability. 2 to nearly linear decoding time, i.e., O(rt ln (mn ln(mn))), We further utilize the above analysis to formulate an opti- it still incurs extremely high cost when dealing with current mization problem to determine the optimal degree distribu- large-scale and sparse data. Therefore, inspired by this phe- tion P when mn is small. We finally implement and bench- nomenon, we are interested in the following key problem: Coded Sparse Matrix Multiplication mark the sparse code at Ohio Supercomputer Center (Center, Table 1. Comparison of Existing Coding Schemes 1987), and empirically demonstrate its performance gain recovery computation decoding scheme 1 compared with the existing strategies. threshold overhead time MDS Θ(N) Θ(mn) O˜(rt)2 sparse MDS Θ∗(mn)2 Θ(ln(mn)) O˜(mn · nnz(C)) 2. Preliminary product code Θ∗(mn) Θ(mn) O˜(rt) Θ∗(mn) Θ(ln(mn)) O˜(rt) We are interested in a matrix multiplication problem with LDPC code polynomial mn mn O˜(rt) two input matrices A ∈ s×r, B ∈ s×t for some integers R R our scheme Θ∗(mn) Θ(ln(mn)) O˜(nnz(C)) r, s, t. Each input matrix A and B is evenly divided along 1 Computation overhead is the time of local computation over uncoded scheme. the column side into m and n submatrices, respectively. 2 O˜(·) omits the logarithmic terms and O∗(·) refers the high probability result. of the recovery threshold, computation overhead and de- A = [A1,A2,...,Am] and B = [B1,B2,...,Bn]. (2) coding complexity. Specifically, the decoding time of Then computing the matrix C is equivalent to comput- MDS code (Lee et al., 2017a), product code (Lee et al., | ing mn blocks Cij = Ai Bj. Let the set W = {Cij = 2017c), LDPC code and polynomial code (Yu et al., 2017) | 2 Ai Bj|1 ≤ i ≤ m, 1 ≤ j ≤ n} denote these components. is O(rt ln (mn ln(mn))), which is dependent on the di- Given this notation, the coded distributed matrix multiplica- mension of the output matrix. Instead, the proposed sparse tion problem can be described as follows: define N coded code actually exhibits a decoding complexity that is nearly computation functions, denoted by linear time in number of nonzero elements of the output matrix C, which is extremely less than the product of di- f = (f1, f2, . . . , fN ). mension. Although the decoding complexity of sparse MDS code is linear to nnz(C), it is also dependent on the mn. To Each local function fi is used by worker i to compute a ˜ r × t the best of our knowledge, this is the first coded distributed submatrix Ci ∈ R m n = fi(W ) and return it to the mas- ter node. The master node waits only for the results of the matrix multiplication scheme with complexity independent of the dimension. partial workers {C˜i|i ∈ I ⊆ {1,...,N}} to recover the final output C using certain decoding functions. For any in- Regarding the recovery threshold, the existing work (Yu teger k, the recovery threshold k(f) of a coded computation et al., 2017) has applied a cut-set type argument to show strategy f is defined as the minimum integer k such that the that the minimum recovery threshold of any scheme is master node can recover matrix C from results of the any k K∗ = min k(f) = mn. (3) workers. The framework is illustrated in Figure2. f

data assignment Master node The proposed sparse code matches this lower bound with a constant gap and high probability. … worker 1 A1 A2 Am 3. Sparse Codes decode In this section, we first demonstrate the main idea of the � � worker 2 �1 �1 … �1 � sparse code through a motivating example. We then formally describe the construction of the general sparse code and its B B … B � � 1 2 n … � � … 1 � � decoding algorithm. … … � � ��1 … � � 3.1. Motivating Example worker N Consider a distributed matrix multiplication task C = A|B Figure 2. Framework of coded distributed matrix multiplication. using N = 6 workers. Let m = 2 and n = 2 and each input 2.1. Main Results matrix A and B be evenly divided as

The main result of this paper is the design of a new coded A = [A1,A2] and B = [B1,B2]. computation scheme, we call the sparse code, that has the following performance. Then computing the matrix C is equivalent to computing following 4 blocks. Theorem 1. The sparse code achieves a recovery threshold A|B A|B  Θ(mn) with high probability, while allowing nearly linear | 1 1 1 2 C = A B = | | decoding time O(nnz(C) ln(mn)) at the master node. A2 B1 A2 B2

As shown in TABLE1, compared to the state of the art, We design a coded computation strategy via the following the sparse code provides order-wise improvement in terms procedure: each worker i locally computes a weighted sum Coded Sparse Matrix Multiplication

(a) (b)

Figure 3. Example of the hybrid peeling and Gaussian decoding process of sparse code. of four components in matrix C. Based on the above graphical illustration, the key point of successful decoding is the existence of the ripple during C˜ = wi A|B + wi A|B + wi A|B + wi A|B , i 1 1 1 2 1 2 3 2 1 4 2 2 the edge removal process. Clearly, this is not always true i from the design of our coding scheme and the uncertainty Each weight wj is independently and identically distributed Bernoulli random variable with parameter p. For example, in the cloud. For example, if both the 3rd and 4th workers Let p = 1/3, then on average, 2/3 of these weights are are stragglers and the master node has collected the results equal to 0. We randomly generate the following N = 6 from node {1, 2, 5, 6}, even though the coefficient matrix is local computation tasks. full rank, there exists no ripple in the graph. To avoid this problem, we can randomly pick one block and recover it ˜ | | ˜ | | C1 = A1 B1 + A1 B2, C2 = A1 B2 + A2 B1 through a linear combination of the collected results, then ˜ | ˜ | | use this block to continue the decoding process. This par- C3 = A1 B1, C4 = A1 B2 + A2 B2 ticular linear combination can be determined by solving a ˜ | | ˜ | | C5 = A2 B1 + A2 B2, C6 = A1 B1 + A2 B1 | linear system. Suppose that we choose to recover A1 B2, then we can recover it via the following linear combination. Suppose that both the 2rd and 6th workers are stragglers 1 1 1 and the master node has collected the results from nodes | A B2 = C˜1 + C˜2 − C˜6. {1, 3, 4, 5}. According to the designed computation strategy, 1 2 2 2 we have following group of linear systems. As illustrated in the Figure3(b), we can recover the rest of the blocks using the same peeling decoding process. The C˜  1 1 0 0 A|B  1 1 1 above decoding algorithm only involves simple matrix addi- C˜ 1 0 0 0 A|B  3 =   ·  1 2 tions and the total decoding time is O(nnz(C)).  ˜     |  C4 0 1 0 1 A2 B1 | C˜ 0 0 1 1 A B2 5 2 3.2. General Sparse Code One can easily check that the above coefficient matrix is Now we present the construction and decoding of the sparse full rank. Therefore, one straightforward way to recover code in a general setting. We first evenly divide the input C is to solve rt/4 linear systems, which proves decodabil- matrices A and B along the column side into m and n ity. However, the complexity of this decoding algorithm is submatrices, as defined in (2). Then we define a set S expensive, i.e., O(rt) in this case. that contains m2n2 distinct elements except zero element. 2 2 2 2 Interestingly, we can use a type of peeling algorithm to One simplest example of S is [m n ] , {1, 2, . . . , m n }. recover the matrix C with only three sparse matrix additions: Under this setting, we define the following class of coded | first, we can straightforwardly recover the block A1 B1 from computation strategies. worker 3. Then we can use the result of worker 1 to recover Definition 1. (Sparse Code) Given the parameter P ∈ Rmn | ˜ | block A1 B2 = C1 − A1 B1. Further, we can use the results and set S, we define the (P,S)−sparse code as: for each | ˜ | of worker 4 to recover block A2 B2 = C4 − A1 B2 and worker k ∈ [N], compute | ˜ use the results of worker 5 to obtain block A2 B1 = C5 − m n | A B2. Actually, the above peeling decoding algorithm can X X 2 C˜ = f (W ) = wk A|B . (4) be viewed as an edge-removal process in a . k k ij i j We construct a bipartite graph with one partition being the i=1 j=1 original blocks W and the other partition being the finished Here the parameter P = [p1, p2, . . . , pmn] is the degree ˜ coded computation tasks {Ci}. Two nodes are connected distribution, where the pl is the probability that there exists k if such computation task contains that block. As shown in number of l nonzero weights wij in each worker k. The the Figure3(a), in each iteration, we find a ripple (degree k value of each nonzero weight wij is picked from set S inde- one node) in the right that can be used to recover one node pendently and uniformly at random. of left. We remove the adjacent edges of that left node, which might produce some new ripples in the right. Then Without loss of generality, suppose that the master node we iterate this process until we decode all blocks. collects results from the first K workers with K ≤ N. Coded Sparse Matrix Multiplication

Given the above coding scheme, we have Algorithm 1 Sparse code (master node’s protocol) repeat  C˜  w1 w1 ··· w1   A|B  1 11 12 mn 1 1 The master node assign the coded computation tasks ˜ 2 2 2 |  C2  w11 w12 ··· wmn  A1 B2    =   ·   . according to Definition1.  .   . . .. .   .   .   . . . .   .  until the master node collects results with rank(M)= mn K K K | and K is larger than a given threshold. C˜K w11 w12 ··· wmn AmBn repeat K×mn We use M ∈ R to represent the above coefficient Find a row Mk0 in matrix M with kMk0 k0 = 1. matrix. To guarantee decodability, the master node should if such row does not exist then collect results from enough number of workers such that Randomly pick a k0 ∈ {1, . . . , mn} and recover | the coefficient matrix M is of column full rank. Then the corresponding block Ai Bj by (5). master node goes through a peeling decoding process: it else | ˜ first finds a ripple worker to recover one block. Then for Recover the block Ai Bj from Ck0 . each collected results, it subtracts this block if the computa- end if tion task contains this block. If there exists no ripple in our Suppose that the column index of the recovered block | peeling decoding process, we go to rooting step: randomly Ai Bj in matrix M is k0. | pick a particular block Ai Bj. The following lemma shows for each computation results C˜k do that we can recover this block via a linear combination of if Mkk is nonzero then K 0 the results {C˜k} . ˜ ˜ | k=1 Ck = Ck − Mkk0 Ai Bj and set Mkk0 = 0. Lemma 1. (rooting step) If rank(M) = mn, for any k0 ∈ end if | {1, 2, . . . , mn}, we can recover a particular block Ai Bj end for with column index k0 in matrix M via the following linear until every block of matrix C is recovered. combination. K | X A Bj = ukC˜k. (5) i k=1 4. Theoretical Analysis

The vector u = [u1, . . . , uK ] can be determined by solving As discussed in the preceding section, to reduce the decod- T K M u = ek0 , where ek0 ∈ R is a unit vector with unique ing complexity, it is good to make the coefficient matrix M 1 locating at the index k0. as sparse as possible. However, the lower density will re- quire that the master node collects a larger number of work- The basic intuition is to find a linear combination of row ers to enable the full rank of matrix M. For example, in the vectors of matrix M such that the row vectors eliminate all extreme case, if we randomly assign one nonzero element | other blocks except the particular block Ai Bj. The whole in each row of M. The analysis of the classical balls and procedure is listed in Algorithm1. bins process implies that, when K = O(mn ln(mn)), the Here we conduct some analysis of the complexity of Algo- matrix M is full rank, which is far from the optimal recovery rithm1. During each iteration, the complexity of operation threshold. On the other hand, the polynomial code (Yu et al., ˜ ˜ | | 2017) achieves the optimal recovery threshold. Nonetheless, Ck = Ck − Mkk0 Ai Bj is O(nnz(Ai Bj)). Suppose that the number of average nonzero elements in each row of co- it exhibits the densest matrix M, i.e., K × mn nonzero ele- | ments, which significantly increases the local computation, efficient matrix M is α. Then each block Ai Bj will be used O(αK/mn) times in average. Further, suppose that there communication and final decoding time. exists number of c blocks requiring the rooting step (5) to re- In this section, we will design the sparse code between P ˜ cover, the complexity in each step is O( k nnz(Ck)). On these two extremes. This code has near optimal recovery ˜ average, each coding block Ck is equal to the sum of O(α) threshold K = Θ(mn) and constant number of rooting original blocks. Therefore, the complexity of Algorithm1 steps (5) with high probability and extremely sparse matrix is M with α = Θ(ln(mn)) nonzero elements in each row. The αK    main idea is to choose the following degree distribution. X | X O nnz(A Bj) + O c nnz(C˜k) mn i,j i k Definition 2. (Wave Soliton distribution) The Wave Soliton =O ((c + 1)αK/mn · nnz(C)) . (6) distribution Pw = [p1, p2, . . . , pmn] is defined as follows.  τ τ We can observe that the decoding time is linear in the density , k = 1; , k = 2 mn 70 of matrix M, the recovery threshold K and the number of pk = τ . (7)  , 3 ≤ k ≤ mn rooting steps (5). In the next section, we will show that, k(k − 1) under a good choice of degree distribution P and set S, we can achieve the result in Theorem1. The parameter τ = 35/18 is the normalizing factor. Coded Sparse Matrix Multiplication

The above degree distribution is modified from the Soliton Theorem 2. (Existence of perfect matching) If the graph distribution (Luby, 2002). In particular, we cap the original G(V1,V2,P ) is generated under the Wave Soliton distribu- Soliton distribution at the maximum degree mn, and remove tion (7), then there exists a constant c > 0 such that a constant weight from degree 2 to other larger degrees. (G contains a perfect matching) > 1 − c(mn)−0.94. It can be observed that the recovery threshold K of the P proposed sparse code depends on two factors: (i) the full Proof. Here we sketch the proof. Details can be seen in rank of coefficient matrix M; (ii) the successful decoding the supplementary material. The basic technique is to uti- of peeling algorithm with constant number of rooting steps. lize Hall’s theorem to show that such a probability is lower bounded by the probability that G does not contain a struc- 4.1. Full Rank Probability ture S ⊂ V1 or S ⊂ V2 such that |S| > |N(S)|, where Our first main result is to show that when K = mn, the N(S) is the neighboring set of S. We show that, if S ⊂ V1, coefficient matrix M is full rank with high probability. Sup- the probability that S exists is upper bounded by pose that K = mn, we can regard the formation of the s=o(mn) X 1 X  s 0.94s X matrix M via the following random graph model. + + cmn. (mn)0.94s mn 1 Definition 3. (Random balanced bipartite graph) Let s=Θ(1) s=Ω(1) s=Θ(mn) G(V1,V2,P ) be a random blanced bipartite graph, in which If S ⊂ V2, the probability that S exists is upper bounded by |V1| = |V2| = mn. Each node v ∈ V2 independently and X 1 X scs X randomly connects to l nodes in partition V1 with probabil- + 2 + cmn. mn mn 3 ity pl. s=Θ(1) s=o(mn) s=Θ(mn)

Define an Edmonds matrix M(x) ∈ Rmn×mn of graph where the constants c1, c2, c3 are strictly less than 1. Com- G(V1,V2,P ) with [M(x)]ij = xij if vertices vi ∈ V1, bining these results together, gives us Theorem2. vj ∈ V2 are connected, and [M(x)]ij = 0, otherwise. The coefficient matrix M can be obtained by assigning each Analyzing the existence of perfect matching in a random xij a value from S independently and uniformly at random. bipartite graph has been developed since (Erdos & Renyi, Then the probability that matrix M is full rank is equal to 1964). However, existing analysis is limited to the indepen- the probability that the determinant of the Edmonds matrix dent generation model. For example, the Erdos-Renyi model M(x) is nonzero at the assigning values xij. The following assumes each edge exists independently with probability technical lemma (Schwartz, 1980) provides a simple lower p. The κ−out model (Walkup, 1980) assumes each vertex bound of the such an event. v ∈ V1 independently and randomly chooses κ neighbors in V . The minimum degree model (Frieze & Pittel, 2004) Lemma 2. (Schwartz-Zeppel Lemma) Let f(x , . . . , x ) 2 1 N assumes each vertex has a minimum degree and edges are be a nonzero polynomial with degree d. Let S be a finite set uniformly distributed among all allowable classes. There in with |S| = d2. If we assign each variable a value from R exists no work in analyzing such a probability in a random S independently and uniformly at random, then bipartite graph generated by a given degree distribution. In

−1 this case, one technical difficulty is that each node in the par- P(f(x1, x2, . . . , xN ) 6= 0) ≥ 1 − d . (8) tition V1 is dependent. All analysis should be carried from the nodes of the right partition, which exhibits an intrinsic A classic result in graph theory is that a balanced bipar- complicated statistical model. tite graph contains a perfect matching if and only if the determinant of Edmonds matrix, i.e., |M(x)|, is a nonzero 4.2. Optimality of Recovery Threshold polynomial. Combining this result with Schwartz-Zeppel We now focus on quantitatively analyzing the impact of Lemma, we can finally reduce the analysis of the full rank the recovery threshold K on the peeling decoding process probability of the coefficient matrix M to the probability and the number of rooting steps (5). Intuitively, a larger K that the random graph G(V ,V ,P ) contains a perfect 1 2 w implies a larger number of ripples, which leads to higher matching. successful peeling decoding probability and therefore less number of rooting steps. The key question is: how large (|M|= 6 0) = (|M|= 6 0 |M(x)| 6≡ 0) · (|M(x)| 6≡ 0) P P P must K be such that all mn blocks are recovered with only | {z } | {z } S-Z Lemma: ≥1−1/mn perfect matching constant number of rooting steps. To answer this question, we first define a distribution generation function of Pw as Formally, we have the following result about the existence of mn τ τ X xk the perfect matching in the above defined random bipartite Ω (x) = x + x2 + τ . (9) w mn 70 k(k − 1) graph. k=3 Coded Sparse Matrix Multiplication

The following technical lemma is useful in our analysis. 80 70 sparse code Lemma 3. If the degree distribution Ωw(x) and recovery LT code 60 lower bound threshold K satisfy average degree 50

0 K−1 40 [1 − Ωw(1 − x)/mn] ≤ x, for x ∈ [b/mn, 1], (10) 30 then the peeling decoding process in Algorithm1 can re- recovery threshold 20 −cmn cover mn − b blocks with probability at least 1 − e , 10 where b, c are constants. 0 10 20 30 40 50 Lemma3 is tailored from applying a martingale argument number of blocks (mn) to the peeling decoding process (Luby et al., 2001). This Figure 4. Recovery threshold versus the number of blocks mn. result provides a quantitative recovery condition on the de- far from tight when m, n are small. In this subsection, we gree generation function. It remains to be shown that the focus on determining the optimal degree distribution based proposed Wave Soliton distribution (9) satisfies the above on our analysis in Section 4.1 and 4.2. Formally, we can inequality with a specific choice of K. formulate the following optimization problem. Theorem 3. (Recovery threshold) Given the sparse code mn 2 2 X with parameter(Pw, [m n ]), if K = Θ(mn), then there min kpk (11) exists a constant c such that with probability at least 1 − k=1 −cmn e , Algorithm1 is sufficient to recover all mn blocks s.t. P(M is full rank) > pc, h 0 imn+c q with Θ(1) blocks recovering from rooting step (5). Ωw(x) 1−x 1 − mn ≤ 1 − x − c0 mn , Combining the results of Theorem2 and Theorem3, x ∈ [0, 1 − b/mn] , [pk] ∈ ∆mn, we conclude that the recovery threshold of sparse code The objective is to minimize the average degree, namely, to 2 2 (Pw, [m n ]) is Θ(mn) with high probability. Moreover, minimize the computation and communication overhead at since the average degree of Wave Soliton Distribution is each worker. The first constraint represents that the proba- O(ln(mn)), combining these results with (6), the complex- bility of full rank is at least pc. Since it is difficult to obtain ity of Algorithm1 is therefore O(nnz(C) ln(mn)). the exact form of such a probability, we can use the analy- Remark 1. Although the recovery threshold of the proposed sis in Section 4.1 to replace this condition by requiring the scheme exhibits a constant gap to the information theoret- probability that the balanced bipartite graph G(V1,V2,P ) ical lower bound, the practical performance is very close contains a perfect matching is larger than a given threshold, to such a bound. This mainly comes from the pessimistic which can be calculated exactly. The second inequality rep- estimation in Theorem3. As illustrated in Figure4, we resents the decodability condition that when K = mn+c+1 generate the sparse code under Robust Soliton distribution, results are received, mn−b blocks are recovered through the and plot the average recovery threshold versus the num- peeling decoding process and b blocks are recovered from ber of blocks mn. It can be observed that the overhead of the rooting step (5). This condition is modified from (10) by proposed sparse code is less than 15%. adding an additional term, which is useful in increasing the Remark 2. Existing codes such as Tornado code (Luby, expected ripple size (Shokrollahi, 2006). By discretizing the 1998) and LT code (Luby, 2002) also utilize the peeling interval [0, 1 − b/mn] and requiring the above inequality to decoding algorithm and can provide a recovery threshold hold on the discretization points, we obtain a set of linear Θ(mn). However, they exhibit a large constant, especially inequalities constraints. Details regarding the exact form of when mn is less than 103. Figure4 compares the practical the above optimization model and solutions are provided in recovery threshold among these codes. We can see that the supplementary material. our proposed sparse code results in a much lower recovery threshold. Moreover, the intrinsic cascading structure of 5. Experimental Results these codes will also destroy the sparsity of input matrices. In this section, we present experimental results at Ohio Supercomputer Center (Center, 1987). We compare our 4.3. Optimal Design of Sparse Code proposed coding scheme against the following schemes: The proposed Wave Soliton distribution (7) is asymptotically (i) uncoded scheme: the input matrices are divided uni- optimal, however, it is far from optimal in practice when formly across all workers and the master waits for all work- m, n is small. The analysis of the full rank probability and ers to send their results; (ii) sparse MDS code (Lee et al., the decoding process relies on an asymptotic argument to 2017b): the is a sparse random Bernoulli enable upper bounds of error probability. Such bounds are matrix with average computation overhead Θ(ln(mn)), Coded Sparse Matrix Multiplication

Table 2. Timing Results for Different Sparse Matrix Multiplications (in sec) Data uncoded LT code sparse MDS code product code polynomial code sparse code square 6.81 3.91 6.42 6.11 18.44 2.17 tall 7.03 2.69 6.25 5.50 18.51 2.04 fat 6.49 1.36 3.89 3.08 9.02 1.22 amazon-08 / web-google 15.35 17.59 46.26 38.61 161.6 11.01 cont1 / cont11 7.05 5.54 9.32 14.66 61.47 3.23 cit-patents / patents 22.10 29.15 69.86 56.59 1592 21.07 hugetrace-00 / -01 18.06 21.76 51.15 37.36 951.3 14.16 recovery threshold of Θ(mn) and decoding complexity sparse code requires the minimum time, and outperforms LT O˜(mn · nnz(C)). (iii) product code (Lee et al., 2017c): code (in 20-30% the time), sparse MDS code and product two-layer MDS code that can achieves the probabilistic code (in 30-50% the time) and polynomial code (in 15-20% recovery threshold of Θ(mn) and decoding complexity the time). The uncoded scheme is faster than the polynomial O˜(rt). We use the above sparse MDS code to ensemble code. The main reason is that, due to the increased num- the product code to reduce the computation overhead. (iv) ber of nonzero elements of coded matrices, the per-worker polynomial code (Yu et al., 2017): coded matrix multipli- computation time for these codes is increased. Moreover, cation scheme with optimum recovery threshold; (v) LT the data transmission time is also greatly increased, which code (Luby, 2002): rateless code widely used in broadcast leads to additional I/O contention at the master node. We communication. It has low decoding complexity due to the further compare our proposed sparse code with the exist- peeling decoding algorithm. To simulate straggler effects in ing schemes from the point of view of the time required large-scale system, we randomly pick number of s workers to communicate inputs to each worker, compute the matrix that are running a background thread which increases the multiplication in parallel, fetch the required outputs, and computation time. decode. Results can be seen in the supplementary material. We implement all methods in python using MPI4py. To We finally compare our scheme with these schemes for other simplify the simulation, we fix the number of workers N type of matrices and larger matrices. The data statistics and randomly generate a coefficient matrix M ∈ RN×mn can be seen in the supplementary material. The first three under given degree distribution offline such that it can resist data sets square, tall and fat are randomly generated one straggler. Then, each worker loads a certain number of square, fat and tall matrices. We also consider 8 sparse partitions of input matrices according to the coefficient ma- matrices from real data sets (Davis & Hu, 2011). We evenly trix M. In the computation stage, each worker computes the divide each input matrices into m = n = 4 submatrices and product of their assigned submatrices and returns the results number of stragglers is equal to 2. We match the column using Isend(). Then the master node actively listens to dimension of A and row dimension of B using the smaller the responses from each worker via Irecv(), and uses one. The timing results are results averaged over 20 exper- Waitany() to keep polling for the earliest finished tasks. imental runs. Among all experiments, we can observe in Upon receiving enough results, the master stops listening Table2 that our proposed sparse code speeds up 1 − 3× of and starts decoding the results. uncoded scheme and outperforms the existing codes, with mn = 9 mn = 16 12 20 uncoded scheme uncoded scheme the effects being more pronounced for the real data sets. LT code LT code 10 sparse MDS sparse MDS The job completion of the LT code, random sparse, product product code product code sparse code 15 sparse code code is smaller than uncoded scheme in square, tall 8 polynomial code polynomial code and fat matrix and larger than uncoded scheme in those 6 10 real data sets. time (s) time (s)

4 5 2 6. Conclusion

0 0 s=2 s=3 s=2 s=3 In this paper, we proposed a new coded matrix multiplica- Figure 5. Job completion time for two 1.5E5 × 1.5E5 matrices tion scheme, which achieves near optimal recovery thresh- with 6E5 nonzero elements. old, low computation overhead, and decoding time linear We first generate two random Bernoulli sparse matrices with in the number of nonzero elements. Both theoretical and r = s = t = 150000 and 600000 nonzero elements. Fig- simulation results exhibit order-wise improvement of the ure.5 reports the job completion time under m = n = 3, proposed sparse code compared with the existing schemes. m = n = 4 and number of stragglers s = 2, 3, based on In the future, we will extend this idea to the case of higher- 20 experimental runs. It can be observed that our proposed dimensional linear operations such as tensor operations. Coded Sparse Matrix Multiplication

References Luby, Michael. Tornado codes: Practical erasure codes based on random irregular graphs. In International Work- Bourgain, Jean, Vu, Van H, and Wood, Philip Matchett. On shop on Randomization and Approximation Techniques the singularity probability of discrete random matrices. in Computer Science, pp. 171–171. Springer, 1998. Journal of Functional Analysis, 258(2):559–603, 2010. Luby, Michael. Lt codes. In Foundations of Computer Center, Ohio Supercomputer. Ohio supercomputer cen- Science, 2002. Proceedings. The 43rd Annual IEEE Sym- ter. http://osc.edu/ark:/19495/f5s1ph73, posium on, pp. 271–280. IEEE, 2002. 1987. Luby, Michael G, Mitzenmacher, Michael, Shokrollahi, Mo- Davis, Timothy A and Hu, Yifan. The university of florida hammad Amin, and Spielman, Daniel A. Efficient erasure sparse matrix collection. ACM Transactions on Mathe- correcting codes. IEEE Transactions on Information The- matical Software (TOMS), 38(1):1, 2011. ory, 47(2):569–584, 2001.

Dean, Jeffrey and Barroso, Luiz Andre.´ The tail at scale. Schwartz, Jacob T. Fast probabilistic algorithms for ver- Communications of the ACM, 56(2):74–80, 2013. ification of polynomial identities. Journal of the ACM (JACM), 27(4):701–717, 1980. Dean, Jeffrey and Ghemawat, Sanjay. Mapreduce: simpli- Shokrollahi, Amin. Raptor codes. IEEE transactions on fied data processing on large clusters. Communications information theory, 52(6):2551–2567, 2006. of the ACM, 51(1):107–113, 2008. Tandon, Rashish, Lei, Qi, Dimakis, Alexandros G, and Dutta, Sanghamitra, Cadambe, Viveck, and Grover, Pulkit. Karampatziakis, Nikos. Gradient coding: Avoiding strag- Short-dot: Computing large linear transforms distribut- glers in distributed learning. In International Conference edly using coded short dot products. In Advances In on Machine Learning, pp. 3368–3376, 2017. Neural Information Processing Systems, pp. 2100–2108, 2016. Tao, Terence and Vu, Van. On the singularity probability of random bernoulli matrices. Journal of the American Erdos, Paul and Renyi, Alfred. On random matrices. Mag- Mathematical Society, 20(3):603–628, 2007. yar Tud. Akad. Mat. Kutato´ Int. Kozl¨ , 8(455-461):1964, 1964. Walkup, David W. Matchings in random regular bipartite digraphs. Discrete Mathematics, 31(1):59–64, 1980. Frieze, Alan and Pittel, Boris. Perfect matchings in random Wang, Sinong, Liu, Jiashang, Shroff, Ness, and Yang, graphs with prescribed minimal degree. In Mathematics Pengyu. Fundamental limits of coded linear transform. and Computer Science III, pp. 95–132. Springer, 2004. arXiv preprint arXiv:1804.09791, 2018. Lee, Kangwook, Lam, Maximilian, Pedarsani, Ramtin, Yu, Qian, Maddah-Ali, Mohammad, and Avestimehr, Papailiopoulos, Dimitris, and Ramchandran, Kannan. Salman. Polynomial codes: an optimal design for high- Speeding up distributed machine learning using codes. dimensional coded matrix multiplication. In Advances in IEEE Transactions on Information Theory, 2017a. Neural Information Processing Systems, pp. 4406–4416, 2017. Lee, Kangwook, Pedarsani, Ramtin, Papailiopoulos, Dim- itris, and Ramchandran, Kannan. Coded computation Zaharia, Matei, Chowdhury, Mosharaf, Franklin, Michael J, for multicore setups. In Information Theory (ISIT), 2017 Shenker, Scott, and Stoica, Ion. Spark: Cluster computing IEEE International Symposium on, pp. 2413–2417. IEEE, with working sets. HotCloud, 10(10-10):95, 2010. 2017b.

Lee, Kangwook, Suh, Changho, and Ramchandran, Kan- nan. High-dimensional coded matrix multiplication. In Information Theory (ISIT), 2017 IEEE International Sym- posium on, pp. 2418–2422. IEEE, 2017c.

Li, Songze, Maddah-Ali, Mohammad Ali, Yu, Qian, and Avestimehr, A Salman. A fundamental tradeoff between computation and communication in distributed comput- ing. IEEE Transactions on Information Theory, 64(1): 109–128, 2018.