Computation Efficient Coded Linear Transform

Sinong Wang1, Jiashang Liu1, Ness Shroff1,2 Pengyu Yang3 1Department of Electrical and Computer Engineering, The Ohio State University 2Department of Computer Science and Engineering, The Ohio State University 3Department of Mathematics, The Ohio State University

Abstract computing frameworks such as Hadoop [Dean and Ghe- mawat, 2008] and Spark [Zaharia et al., 2010] to increase In large-scale distributed linear transform prob- the learning speed. lems, coded computation plays an important role Classical approaches of distributed linear transforms rely to reduce the delay caused by slow machines. on dividing the input A equally among all available However, existing coded schemes could end up worker nodes, and the master node has to collect the results destroying the significant sparsity that exists in from all workers to output y. As a result, a major perfor- large-scale machine learning problems, and in mance bottleneck is the latency incurred in waiting for a turn increase the computational delay. In this few slow or faulty processors – called “stragglers” to finish paper, we propose a coded computation strategy, their tasks [Dean et al., 2012, Dean and Barroso, 2013]. Re- referred to as diagonal code, that achieves the opti- cently, forward error correction and other coding techniques mum recovery threshold and the optimum compu- have shown to be effective in dealing with the stragglers tation load. Furthermore, by leveraging the ideas in distributed computation tasks [Dutta et al., 2016, Lee from random proposal , we design et al., 2017a,b, Tandon et al., 2017, Yu et al., 2017, Karakus a random code that achieves a constant compu- et al., 2017, Yang et al., 2017, Wang et al.]. By exploiting tation load, which significantly outperforms the coding redundancy, the vector y is recoverable even if all existing best known result. We apply our schemes workers have not finished their computations, thus reducing to the distributed gradient descent problem and the delay caused by straggler nodes. For example, con- demonstrate the advantage of the approach over sider a distributed system with 3 worker nodes, the coding current fastest coded schemes. scheme first splits the matrix A into two submatrices, i.e., A = [A1; A2]. Then each worker computes A1x, A2x and (A1 + A2)x. The master node can compute Ax as soon 1 Introduction as any 2 out of the 3 workers finish, and can overcome one straggler (slow worker). In this paper, we consider a distributed linear transformation problem, where we aim to compute y = Ax from input In a general setting with m workers, the input matrix A is r t matrix r t and vector t. This problem is evenly divided into n (n m) submatrices Ai R n × A R × x R ≤ ∈ the key building∈ block in machine∈ learning and signal pro- along the row side. Each worker computes a partially coded cessing problems, and has been used in a large variety of linear transform and returns the result to the master node. application areas. Optimization-based training algorithms Given a computation strategy, the recovery threshold is de- such as gradient descent in regression and classification fined as the minimum number of workers that the master problems and backpropagation algorithms in deep neural needs to wait for in order to compute Ax. The above MDS networks, require the computation of large linear transforms coded scheme is shown to achieve a recovery threshold of high-dimensional data. It is also the critical step in the Θ(m) [Lee et al., 2017a]. An improved scheme proposed dimensionality reduction techniques such as principal com- in [Dutta et al., 2016], referred to as the short dot code, can ponent analysis and linear discriminant analysis. Many such offer a larger recovery threshold Θ(m(1+)) but with a con- applications have large-scale datasets and massive compu- stant reduction of the computation load by imposing some tational tasks that force practitioners to adopt distributed sparsity of the encoded submatrices. More recently, the work in [Yu et al., 2017] designs a type of polynomial code, which achieves the information-theoretical optimal recovery nd Proceedings of the 22 International Conference on Artificial In- threshold n. However, many problems in machine learn- telligence and Statistics (AISTATS) 2019, Naha, Okinawa, Japan. ing exhibit both extremely large-scale and sparse targeting PMLR: Volume 89. Copyright 2019 by the author(s). Computation Efficient Coded Linear Transform Table 1: Comparison of Existing Schemes in Coded Computation. MDS Short-dot Polynomial s-MDS Sparse s-diagonal (d , d )-cross Scheme 1 2 code code code code code code code Recovery threshold Θ(m) Θ(m(1 + )) n n* Θ(n)* n n* Computation load/m O(n) O(n(1 )) n Θ(log(n))* Θ(log(n))* O(s) Θ(1)* − * The result holds with high probability, i.e., 1 e−cn. − data, i.e., nnz(A) rt. The key question that arises in to 107. Our observation is that the final job completion time this scenario is: is coding really an efficient way to mitigate of the polynomial code is significantly increased compared the straggler in the distributed sparse linear transformation to that of the uncoded scheme. The main reason is that the problem? increased density of input matrix leads to increased compu- tation time. In the Figure 1(b), we generate a 105 random 1.1 Motivation: Coded Computation Overhead square Bernoulli matrix with different densities p. We plot the ratio of the average local computation time between the To answer the aforementioned question, we use the PageR- polynomial code and the uncoded scheme versus the matrix ank problem [Page et al., 1999] and existing polynomial density p. It can be observed that this ratio is generally large code [Yu et al., 2017] as an example. This problem aims and equal to the order of O(n), when the matrix is sparse. to measure the importance score of the nodes on a graph, Therefore, inspired by this phenomenon, we propose a new which is typically solved by the following power-iteration, metric, named as computation load, which is defined as total number of submatrices the local works access (for- xt = cr + (1 c)Axt 1. (1) − − mal definition can be seen in Section 2). For example, the where c = 0.15 is a constant and A is the graph adjacency polynomial code achieves optimal recovery threshold but a matrix. In practical applications such as web searching, the large computation load of mn. In certain suboptimal codes graph adjacent matrix A is extremely large and sparse. The such as short-dot code, they achieve slightly lower computa- naive way to solve (1) is to partition A into n equal blocks tion load but still on the order of O(mn). Specifically, we A n and store them in the memory of several workers. are interested in the following key problem: can we find a { i}i=1 In each iteration, the master node broadcasts the xt into all coded linear transformation scheme that achieves optimal workers, each worker computes a partial linear transform recovery threshold and low computation load? Aixt and sends it back to the master node. Then the master node collects all the partial results and updates the vector. 1.2 Main Contribution The polynomial code [Yu et al., 2017] works as follows: in each iteration, the kth worker essentially computes In this paper, we provide a fundamental limit of the mini- n mum computation load: given n partitions and s stragglers, ˜ i y˜k = Akx Aixk x, (2) , i=1 the minimum computation load of any coded computation X  scheme is n(s + 1). We then design a novel scheme, we and xk is a given integer. One can observe that, due to the call s-diagonal code, that achieves both the optimum recov- sparsity of matrix A and matrices additions, the density of ery threshold and the minimum computation load. We also the coded submatrices Ak will increase at most n times, and exploit the diagonal structure to design a hybrid decoding ˜ the time of sublinear transform Akx will increase roughly algorithm between peeling decoding and Gaussian elimi- O(n) times of the simple uncoded one. nation that achieves the nearly linear decoding time of the

70 18 output dimension O(r). uncoded scheme n=30 60 polynomial code 16 n=20 We further show that, under random code designs, the com- n=10 50 14 putation time can be reduced even lower to a constant 40 12 with high probability. Specifically, we define a perfor- 30 10 mance metric probabilistic recovery threshold that allows Frequency 20 8 the coded computation scheme to provide the decodabil-

10 ratio of computation time 6 ity with high probability. Based on this metric, there exist

0 4 several schemes, i.e., sparse MDS code [Lee et al., 2017b] 0.2 0.4 0.6 0.8 2 3 4 5 −3 Computation and communication time (s) density p x 10 and sparse code [Wang et al.] achieving the optimal prob- Figure 1: Measured local computation time per worker. abilistic recovery threshold but a small computation load Θ(n log(n)). In this work, we construct a new (d , d )- In the Figure 1(a), we show measurements on local com- 1 2 cross code with optimal probabilistic recovery threshold putation and communication time required for m = 20 but constant computation load Θ(n). The comparison of workers to operate the polynomial code and naive uncoded existing and our results are listed in TABLE 1. distributed scheme for the above problem with dimension roughly equal to 106 and number of nonzero elements equal The theoretical analysis to show the optimality of the re- Sinong Wang1, Jiashang Liu1, Ness Shroff1,2 Pengyu Yang3 covery threshold is based on a determinant analysis of a Data matrix Data encoding m n A1 in R × . The state-of-the-art in this field master is limited to the Bernoulli case [Tao and Vu, 2007, Bour- A2 gain et al., 2010], in which each element is identically and A3 A + A An 2 + An independently distributed random variable. However, in 1 2 A2 + A4 + A5 our proposed (d1, d2)-cross code, the underlying coding … matrix is generated based on a hypergeometric degree dis- A tribution, which leads to dependencies among the elements n-1 … in the same row. To overcome this difficulty, we propose An ˜ ˜ ˜ a new technical path: we first utilize the Schwartz-Zeppel A1 A2 Am Lemma [Schwartz, 1980] to reduce the determinant analysis Coded computation and decode

˜y i1 problem to the analysis of the probability that a random master Multiple rounds ˜y decode contains a . Then we com- i2 (gradient descent) 2 . 3 Ax bine the random proposal graph theory and the probabilistic . ) 6 7 ˜y 1 ˜y m x 6˜y 7 method to show that when n tasks are collected, the coeffi- x x 6 ik 7 cient matrix is full with high probability. 4 5 Further, we apply our proposed random codes to the gradi- ent coding problem. This problem has wide applicability worker … ˜ in mitigating the stragglers of distributed machine learn- ˜y 1 = A˜ 1x ˜y 2 = A˜ 2x ˜y m = Amx ing, and was first investigated in [Tandon et al., 2017]. It designs a cyclic code that achieves the optimum recovery Figure 2: Framework of coded distributed linear transform. threshold n and computation load s + 1 per worker. The ˜ [mij]i [m],j [n] that is used to compute each Ai, work [Maity et al., 2018] proposes an LDPC code to further ∈ ∈ reduce the average computation load to Θ(log(n)). Another n line of works [Karakus et al., 2017, Charles et al., 2017, A˜ = m A , i [m]. (4) i ij j ∀ ∈ Wang et al., 2019] try to reduce the computation load by j=1 approximated gradient computation. In this paper, we show X that our constructed (d , d )-cross code can not only exactly 1 2 Then each worker i computes y˜i = A˜ ix. recover the gradient (sum of functions) but also provides a constant computation load Θ(1). This is a general definition of a large class of coded compu- Finally, we implement the constructed codes and demon- tation schemes. For example, in the polynomial code [Yu strate their improvement compared with existing strategies. et al., 2017], the coding matrix M is the Vandermonde ma- trix. In the MDS type of codes [Dutta et al., 2016, Lee et al., 2017a,b], M is a specific form of corresponding generator 2 Problem Formulation matrices. Definition 2. (Recovery threshold) A coded computation We are interested in distributedly computing a linear trans- strategy M is k-recoverable if for any subset I [m] with r t t form with matrix A R × and input vector x R for ⊆ ∈ ∈ I = k, the master node can recover Ax from y˜i i I . some integers r, t. The matrix A is evenly divided along the The| | recovery threshold κ(M) is defined as the{ minimum| ∈ } row side into n submatrices. integer k such that strategy M is k-recoverable.

T T T T T A = [A1 , A2 , A3 ,..., An ] (3) Regarding the recovery threshold, the existing work [Yu et al., 2017] has applied a cut-set type argument to show Suppose that we have a master node and m worker nodes. that the minimum recovery threshold of any scheme is Worker i first stores 1/n fraction of matrix A, defined as r ˜ n t Ai R × . Then it can compute a partial linear trans- κ∗ = min κ(M) n. (5) ∈ ˜ M m×n ≥ form y˜i = Aix and return it to the master node. The ∈R master node waits only for the results of a subset of workers Definition 3. (Computation load) The computation load of A˜ x i I [m] to recover the final output y using i strategy M is defined as l(M) = M , the number of the certain{ | decoding∈ ⊆ algorithms.} The main framework is illus- 0 nonzero elements of coding matrix.k k trated in Figure 2. Given the above system model, we can formulate the coded distributed linear transform problem The goal of this paper is to design a coded computation based on the following definitions. strategy M that achieves optimal recovery threshold n with Definition 1. (Coded computation strategy) A coded com- low computation load l(M). Due to the space limit, all the putation strategy is an m n coding matrix M = technical proofs in the sequel are provided in the Appendix. × Computation Efficient Coded Linear Transform

3 Fundamental Limits and Diagonal Code divide the matrix A along the row side into 4 submatri- T T T T T ces: A = [A1 , A2 , A3 , A4 ]. Given this notation, we In this section, we will describe the optimum computation need to compute the following 4 uncoded components load and the s-diagonal code that exactly matches such A1x, A2x, A3x, A4x . Based on the definition of the lower bound. Then we will provide a fast decoding algo- 1{-diagonal code, each worker} stores the following submatri- rithm with nearly linear decoding time. ces: A˜ 1 = A1, A˜ 2 = A1 + A2, A˜ 3 = A2 + A3, A˜ 4 = A3 + A4, A˜ 5 = A4. Suppose that the first worker is a 3.1 Fundamental Limits on Computation Load straggler and the master node receives results from worker 2, 3, 4, 5 . According to the above coded computation We first establish the lower bound on the computation load, strategy,{ we} have i.e., the density of coding matrix M. y˜ 1 1 0 0 A x Theorem 1. (Optimum computation load) For any coded 2 1 y˜ 0 1 1 0 A x computation scheme M m n using m workers that can 3 = 2 (8) R × y˜  0 0 1 1 A x each store 1 fraction of ∈A, to resist s stragglers, we have 4 3 n y˜  0 0 0 1 A x  5    4  l(M) n(s + 1). (6)       ≥ The coefficient matrix is an upper , which is invertible since the elements in the main diagonal are The polynomial code [Yu et al., 2017] achieves the optimum nonzero. Then we can recover the uncoded components recover threshold of n. However, the coding matrix (Van- A x by direct inversion of the above coefficient matrix. dermonde matrix) is fully dense, i.e., l(M) = nm, and far i The{ decodability} for the other 4 possible scenarios can be beyond the above bound. The short-dot code [Dutta et al., proved similarly. Therefore, this code achieves the optimum 2016] can reduce the computation load but sacrifices the recovery threshold of 4. recovery threshold. Therefore, a natural question that arises is, can we design a code that achieves such lower bound? Obviously, the computation load of the s-diagonal code is We will answer this question in the sequel of this paper. optimal and can be easily obtained by counting the number of nonzero elements of coding matrix M. The following 3.2 Diagonal Code result gives us the recovery threshold of the s-diagonal code. 2 n Theorem 2. Let finite set S satisfy S 2n Cm. Then We now present the s-diagonal code that achieves both the there exists an s-diagonal code that| achieves| ≥ the recovery optimum recovery threshold and optimum computation load threshold n, and can be constructed, on average, in 2 trails. for any given parameter values of n, m and s. Definition 4. (s-diagonal code) Given parameters m,n and The theoretical framework on analyzing the recovery thresh- s, the s-diagonal code is defined as old is provided in Section 4.

min i,n { } A˜ i = mijAj, i [m], (7) 3.3 Fast Decoding Algorithm j=max 1,i s { − } ∀ ∈ X where each coefficient mij is chosen from a finite set S The original decoding algorithm of s-diagonal code is based independently and uniformly at random. on inverting the coding matrix, and the mappings from

[˜yi1 , ˜yi2 ,..., ˜yin ] to vector [y1, y2,..., yn] incurs a com- The reason we name this code as the s-diagonal code is that plexity of O(nr). We next show that it can be further re- the nonzero positions of the coding matrix M exhibit the duced by a hybrid decoding algorithm between the peeling following block diagonal structure. decoding and Gaussian elimination. The key idea is to s + 1 utilize the peeling decoding algorithm to reduce the num- 0 0 0 0 ber of blocks recovered by above mapping, and Gaussian 0∗ ∗ · · · ∗ 0 0 ··· 0 elimination to guarantee the existence of the ripple node. z ∗}| ∗ · ·{ · ∗ ···  T 0 0 0 0 M = ∗ ∗ · · · ∗ ··· , Suppose that the master node receives results from work- ......  ......  ers indexed by U = i1, . . . , in [n + s] with 1   { } ⊆U ≤ 0 0 0  i1 in n + s. Let M be a n n sub-  ··· ∗ ∗ · · · ∗ ∗ ≤ · · · ≤ ≤ ×   matrix consisting of rows of M index by U. Let the in- where indicates the nonzero entries of M. Before we dex k [n] be i n < i .Then recover the blocks ∗ ∈ k ≤ k+1 analyze the recovery threshold of the diagonal code, the indexed by [n] i1, . . . , ik from the following rooting following example provides an instantiation for n = 4, step, and add\{ the recovered} results into the ripples set s and m . = 1 = 5 = Aix i [n] i ,...,i . R { } ∈ \{ 1 k} Example 1:(1-Diagonal code) Consider a distributed lin- Lemma 1. (Rooting step) If rank(MU ) = n, then for any ear transform task Ax using m = 5 workers. We evenly k 1, 2, . . . , n , we can recover a particular block A x 0 ∈ { } i Sinong Wang1, Jiashang Liu1, Ness Shroff1,2 Pengyu Yang3

Algorithm 1 Fast decoding algorithm for s-diagonal code 4 Graph-based Analysis of Recovery Receive n results with coefficient matrix. Find the index Threshold k [n] with ik n < ik+1. Recover∈ the blocks≤ indexed by [n] i , . . . , i by (16). In this section, we will present our main technical tool to \{ 1 k} Construct ripples set = Aix i [n] i1,...,ik analyze the recovery threshold of proposed code, which repeat R { } ∈ \{ } is also the basis in our random code construction of next if is not empty then section. RChoose a result A x from . i For each subset U [m] with U = n, let MU be an n n for each computation resultsR ˜y do j submatrix consisting⊆ of rows| of|M index by U. To prove× if M is nonzero then ji that our s-diagonal code achieves the recovery threshold of ˜y = ˜y m A x and set m = 0. j j ji i ji n, we need to show that all the n n submatrices MU are end if − full rank. The basic idea is to reduce× the full rank analysis end for to the analysis of the existence of perfect matching in the else U U U corresponding bipartite graph. We first define the following Find a row M 0 in matrix M with M 0 = 1. i k i k0 bipartite graph model between the set of m workers indexed = ˜y 0 . i by [m] and the set of n data partitions indexed by [n]. endR if R ∪ D until every block of vector y is recovered. Definition 5. Let G (V1,V2) be a bipartite graph with V1 = [m] and V2 = [n]. Each node i V1 is connected to nodes j V if M = 0. ∈ ⊆ 2 ij 6 U with column index k0 in matrix M via the following linear Definition 6. Define an Edmonds matrix M(x) m n D ∈ combination. R(x) × of graph G (V1,V2) with [M(x)]ij = xij if n nodes i V1 and j V2 are connected; [M(x)]ij = 0, Aix = uk˜yk. (9) ∈ ∈ k=1 otherwise. X The vector u = [u1, . . . , un] can be determined by solving Based on this notation, our main result is given by the fol- | 1 n u M = ek0 , where ek0 R × is a unit vector with lowing theorem. ∈ unique 1 locating at the index k0. m n Theorem 4. For a coded computation scheme M R × , if every subgraph GD(U, V ) of GD(V ,V ) with ∈U [m] Then the master node goes to a peeling decoding process: 2 1 2 and U = n contain a perfect matching, the recovery thresh-⊆ the master node first finds a ripple in set , i.e., Aix. For | | 2 m R old κ(M) = n with probability at least 1 n n / S . each collected results ˜yj, it subtracts this block if the com- − | | putation task A˜ x contains this block (M = 0). If the set  j ji Proof. Based on the above definition, the coding matrix is empty, the master node finds a new ripple6 that com- M of the s-diagonal code can be obtained by assigning each putesR a uncoded task in the rest workers, add it to set and intermediate x of the Edmonds matrix a value from continue above process. R ij M(x) set S independently and uniformly at random. Given a sub- D D U Example 2: (Fast decoding algorithm) Consider a similar graph G (U, V2) of G (V1,V2), let M (x) be the corre- setup with Example 1. Suppose that we receive the re- sponding Edmonds matrix. Then the probability that matrix sults from workers index by U = 2, 3, 4, 5 . The naive MU is full rank is equal to the probability that the determi- { } inverse decoding algorithm will lead to decoding complex- nant of the Edmonds matrix MU (x) is nonzero at the given ity 4r (complexity of inverse mapping of (8)). In the value x. The following technical lemma from [Schwartz, fast decoding algorithm, we first recover the block A1x 1980] provides a simple lower bound of such an event. ( 1, 2, 3, 4 2, 3, 4 ) by rooting step A1x = y˜1 y˜2 + { }\{ } − Lemma 2. (Schwartz-Zeppel Lemma) Let f(x , . . . , x 2 ) y˜3 y˜4. Then we start the peeling decoding process: (i) 1 n − be a nonzero polynomial with degree n2. Let S be a fi- recover the block A2x by y˜2 A1x, add the result to set ; − R nite set in . If we assign each variable a value from S (ii) recover the block A3x by y˜3 A2x, add the result to R set ; (iii) recover the block A x −by y˜ A x. Obviously, independently and uniformly at random, then R 4 4 − 3 the whole procedure of new algorithm leads to complexity 2 P(f(x1, x2, . . . , xN ) = 0) 1 n / S . (10) 7r/4, which is smaller than 4r. 6 ≥ − | | The main procedure is listed in Algorithm 1, and the decod- A classical result in graph theory is that a bipartite graph ing complexity is given by the following theorem. D G (U, V2) contains a perfect matching if and only if the Theorem 3. (Nearly linear time decoding of s-diagonal determinant of the Edmonds matrix, i.e., MU (x) , is a codes) The Algorithm 1 uses the at most s times of the nonzero polynomial. Combining this result with| Schwartz-| rooting steps (16) and the total decoding time is O(rs). Zeppel Lemma, we can reduce the analysis of the full rank Computation Efficient Coded Linear Transform probability of the submatrix MU to the probability that the workers. The recovery threshold of 20 implies that all D 20 subgraph G (U, V2) contains a perfect matching. C25 = 53130 square 20 20 submatrices of coding matrix are full rank. Suppose that× a worker being a straggler is U P( M = 0) = | | 6 identically and independently Bernoulli random variable U U U P( M = 0 M (x) 0) P( M (x) 0) + with probability 10%. Then, the probability that workers | | 6 | | 6≡ · | | 6≡ 5 1, 2, 3, 4, 5 are all stragglers is 10− . Now, if there exists m contains perfect matching S-Z Lemma: 1 1/2Cn { } ≥ − a scheme that can guarantee that the master can decode the U U U |P( M = 0{zM (x) 0)} P|( M ({zx) 0)} (11) results in all configurations except the straggling configura- | | 6 | | ≡ · | | ≡ 0 tion 1, 2, 3, 4, 5 , we can argue that this scheme achieves a { } 5 recovery threshold 20 with probability 1 10− . Therefore,| utilizing{z the union bound,} we conclude that the − probability that there exists a submatrix MU is not full rank Example 4: Consider the same scenario of Example 1 2 m (n = 4, s = 1, m = 5). We change the coded computation is upper bound by n n / S . | | strategy of the second worker from A˜ 2 = A1 +A2 to A˜ 2 =  The next technical lemma shows that for all subsets U A2. Based on the similar analysis, we can show that the new D ⊆ strategy can recover the final result from 4 workers except [m] with U = n, the subgraph G (U, V2) exactly contains a perfect| matching| for diagonal code. Therefore, let S the scenario that the first worker is a straggler. Therefore, the 2 m | | ≥ new strategy achieves recovery threshold 4 with probability 2n n , we conclude that, with probability at least 1/2, all the n n submatrices of M are full rank. Since we can 0.75, and reduces the computation load by 1. × generate the coding matrix M offline, with a few rounds of Based on the above two examples, we observe that the com- trials (2 on average), we can find a coding matrix with all putation load can be reduced when we allow the coded com- n n submatrices being full rank. Therefore, we arrive the putation strategy to fail in some specific scenarios. Formally, × Theorem 2. it motivates us to define the following metric. Lemma 3. Let M be constructed as Definition 4 and the Definition 7. (Probabilistic recovery threshold) A coded D bipartite graph G (V1,V2) be constructed as Definition 5. computation strategy M is probabilistic k-recoverable if for D For each U [m] with U = n, the subgraph G (U, V2) each subset I [m] with I = κ(M), the master node can ⊆ | | ⊆ | | contains a perfect matching. recover Ax from A˜ ix i I with high probability, i.e., n { | ∈ } 1 O(2− ). The probabilistic recovery threshold κ(M) is One special case of the s-diagonal code is that, when we defined− as the minimum integer k such that strategy M is are required to resist only one straggler, all the nonzero probabilistic k-recoverable. elements of matrix M can be equal to 1. Corollary 1. Given the parameters n and m = n+1, define The new definition provides a probabilistic relaxation such the 1-diagonal code: that a small vanishing percentage (as n ) of all strag- giling configurations, are allowed to be→ unrecoverable. ∞ In A , i = 1; A , i = n + 1 the sequel, we show that, under such a relaxation, one can ˜ 1 n Ai = . construct a coded computation scheme that achieves a prob- (Ai 1 + Ai, 2 i n − ≤ ≤ abilistic recovery threshold n and a constant (regarding It achieves the optimum computation load 2n and optimum parameter s) computation load. recovery threshold n. 5.2 Construction of the Random Code

5 Random Code: “Break” the Limits Based on our previous analysis of the recovery threshold of the s diagonal code, we show that, for any subset U [m] In this section, we utilize our proposed theoretical frame- with −U = n, the probability that MU is full rank is⊆ lower work in Theorem 4 to construct a random code that achieve bounded| | by the probability (multiplied by 1 o(1)) that the optimal recovery threshold with high probability but − the corresponding subgraph G(U, V2) contains a perfect with constant computation load. matching. This technical path motivates us to utilize the random proposal graph to construct the coded computation 5.1 Probabilistic Recovery Threshold scheme. The first one is the following p-Bernoulli code, which is constructed from the ER random bipartite graph In practice, the stragglers randomly occur in each worker, model [Erdos and Renyi, 1964]. and any specific straggling configuration happens with very low probability. We first demonstrate the main idea through Definition 8. (p-Bernoulli code) Given parameters m, n, the following motivating examples. construct the coding matrix M as follows: Example 3: Consider a distributed linear transform task tij, with probability p mij = . (12) with n = 20 data partitions, s = 5 stragglers and m = 25 0, with probability 1 p ( − Sinong Wang1, Jiashang Liu1, Ness Shroff1,2 Pengyu Yang3

where tij is picked independently and uniformly from the where l( ) is a task-specific loss function and R( ) is the finite set S. regularization· function. Typically, this problem is solved· by Theorem 5. For any parameters m, n and s, if p = the following gradient-based approaches. n /n, the p-Bernoulli code achieves the probabilistic 2 log( ) n recovery threshold n. w = h w η l(w ; x , y ) , (14) t+1 R t − t ∇ t i i i=1 ! This result implies that each worker of the p-Bernoulli code X requires accessing 2 log(n) submatrices on average, which where hR( ) is the proximal mapping, which depends on is independent of number of the stragglers. Note that the ex- the regularization· function. Several methods such as gra- isting work in distributed functional computation [Lee et al., dient descent, accelerated gradient, Frank-Wolfe, proximal 2017b] proposes a random sparse code that also utilizes methods, LBFGS, and bundle methods fits in this frame- Bernoulli random variables to construct the coding matrix. work. The gradient coding framework is: (i) Initially, each There exist two key differences: (i) the elements of our ma- data blocks are assigned to each worker based on the given trix can be integer valued, while the random sparse code coefficient matrix; (ii) In each iteration t, the master node adopts real-valued matrix; (ii) the density of the p-Bernoulli broadcasts the vector wt to each worker, and each worker i code is 2 log(n), while the density of the random sparse calculates a single linear combination of gradient vectors code is an unknown constant. The second random code is n the following (d1, d2)-cross code. g˜i = mij l(wt; xj, yj), i [m]. (15) Definition 9. ((d1, d2)-cross code) Given parameters m, n, ∇ ∀ ∈ j=1 construct the coding matrix M as follows: (1) Each row X (column) independently and uniformly chooses d1 (d2) (iii) The master node collects a subset of results from m nonzero positions; (2) For those nonzero positions, assign workers, decodes the full gradient, and update the weight the value independently and uniformly from the finite set S. vector based on the proximal gradient descent step (14). The algorithm is terminated until converged, i.e., the gradient The computation load of (d , d )-cross code is upper 1 2 vanishes. One instantiation of this framework is the linear bounded by d m + d n. The numerical analysis of pro- 1 2 regression problem that is provided in subsection H.2. We posed random codes can be seen in Appendix. The next can see that, in this problem, the defined computation load theorem shows that a constant choice of d and d can guar- 1 2 l(M) is exactly the number of partial gradients evaluations antee the probabilistic recovery threshold n. locally. Using the proposed (d1, d2)-cross code, we can ob- Theorem 6. For any parameters m, n, if s = poly(log(n)), tain a gradient coding scheme that each worker only requires the (2, 3)-cross code achieves the probabilistic recovery to store a constant number of data blocks, and the recovery threshold n. If s = Θ(nα), α < 1, the (2, 2/(1 α))-cross − threshold is n with high probability. The straightforward in- code achieves the probabilistic recovery threshold n. verse decoding will lead to a complexity of O(n2r) (decode The proof of this theorem is based on analyzing the existence n partial gradients from the results of n workers). However, of perfect matching in a random bipartite graph constructed the following technical lemma shows that we can actually as follows: (i) each node in the left partition randomly and decode in O(nr) time. uniformly connects to d1 nodes in the opposite class; (ii) Lemma 4. (gradient decoding) Assume received coding each node in the right partition randomly and uniformly matrix MU . If rank(MU ) = n, then we can recover full connects to l nodes in the opposite class, where l is chosen gradient n l(w ; x , y ) via the following linear com- i=1 ∇ t i i under a specific degree distribution. This random graph bination. P model can be regarded as a generalization of Walkup’s 2-out n n bipartite graph model [Walkup, 1980]. The main technical l(wt; xi, yi) = uig˜i (16) i=1 ∇ i=1 difficulty in this case derives from the intrinsic complicated X X statistical model of the node degree of the right partition. The vector u = [u1, . . . , un] can be determined by solving U 1 n u|M = 1n, where 1n R × is an all one vector. ∈ 6 Applications to Gradient Coding The basic intuition is to find a linear combination of row vectors of matrix MU such that the row vectors span the all In this section, we discuss the application of proposed one vector 1n. coded schemes into the distributed gradient descent problem. Given data sets (x1, y1),..., (xn, yn) , where each block kp { k } 7 Experimental Results (xi, yi) R R consists of k data samples. Several machine∈ learning× tasks aim to solve the following problem n In this section, we present the experimental results on Ohio Supercomputer Center Center [1987]. We compare our pro- w∗ = arg min l(w; xi, yi) + λR(w), (13) w Rr posed coding schemes including the s diagonal code and ∈ i=1 X − Computation Efficient Coded Linear Transform

(a) n=12 (b) n=20 (c) n=12 (d) n=20 33 14 12 uncoded scheme uncoded scheme s-diagonal code s-diagonal code 26 LT code LT code (2,2)-cross code (2,2)-cross code sparse MDS sparse MDS 12 s-diagonal code 10 30 s-diagonal code 23 cross code cross code polynomial code 10 polynomial code 8 12 9 8 6 9 time (s) time (s) time (s) time (s) 6 6 4 6 4 3 3 2 2

0 0 0 0 s=2 s=4 s=2 s=4 s=1 s=2 s=3 s=4 s=5 s=1 s=2 s=3 s=4 s=5 Figure 3: Comparison of total time including transmission, computation and decoding time for n = 12, 20 and s = 2, 4. -1 -1 -1 10 10 10 10-1 polynomial code polynomial code polynomial code polynomial code uncoded uncoded short dot short dot LT short dot uncoded uncoded short dot LT LT LT sparse MDS s-diagonal code sparse MDS sparse MDS s-diagonal code sparse MDS s-diagonal code s-diagonal code

cross code cross code (Ax-b)| cross code cross code (Ax-b)| (Ax-b)| (Ax-b)| T T T T A A A A | | | |

-2 -2 -2 10 10 10 10-2

0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 time (s) time (s) time (s) time (s) Figure 4: Magnitude of gradient versus time for number of data partitions n = 12, 20 and number of stragglers s = 2,4.

(d1, d2)-cross codes against the following existing schemes irregularity of the work load can decrease the I/O contention. in both single matrix vector multiplication and gradient For example, when s = 2, the computation load of the 2- coding problem: (i) uncoded scheme: the input matrix is diagonal code is similar as (2, 2)-cross code, which is equal divided uniformly across all workers without replication to 36 in the case of n = 12. However, the (2, 2)-cross code and the master waits for all workers to send their results; (ii) cost less time due to the random worker load. sparse MDS code [Lee et al., 2017b]: the We finally compare our proposed codes with existing is a sparse random Bernoulli matrix with average computa- schemes in a gradient coding problems. Details can be seen tion overhead Θ(log(n)); (iii) polynomial code [Yu et al., in the Appendix. We use data from LIBSVM dataset reposi- 2017]: coded scheme with optimum tory with r = 19264097 samples and t = 1163024 features. recovery threshold and nearly linear decoding time; (iv) We evenly divide the data matrix A into n = 12 subma- short dot code [Dutta et al., 2016]: appending the dummy trices. In Fig. 7 (c)(d) , we plot the magnitude of scaled vectors to data matrix A before applying the MDS code, gradient ηA|(Ax b) versus the running time of the which provides certain sparsity of encoded data matrix with above sevenk different− schemesk under n = 12 and s = 2, 4. cost of increased recovery threshold; (v) LT code [Luby, Among all experiments, we can see that the (2, 2)-cross 2002]: rateless code widely used in broadcast communica- code converges at least 30% faster than sparse MDS code, tion. It achieves an average computation load of Θ(log(n)) 2 times faster than both uncoded scheme and LT code and and a nearly linear decoding time using peeling decoder. at least 4 times faster than short dot and polynomial code. To simulate straggler effects in large-scale system, we ran- The (2, 2)-cross code performs similar with s-diagonal code domly pick s workers that are running a background thread. when s = 2 and converges 30% faster than s-diagonal code More details can be seen in Appendix. when number of stragglers increases to s = 4. We first use a matrix with r = t = 1048576 and nnz(A) = 89239674 from data sets [Davis and Hu, 2011] , and evenly divide this matrix into n = 12 and 20 partitions. In Figure 7 8 Conclusion (a)(b), we report the job completion time under s = 2 and s = 4, based on 20 experimental runs. It can be observed that both (2, 2)-cross code outperforms uncoded scheme (in In this paper, we characterized the fundamental limits of 50% the time), LT code (in 70% the time), sparse MDS code the minimum computation load in the coded computation (in 60% the time), polynomial code (in 20% the time) and schemes, and proposed a new code, we call s-diagonal code, our s-diagonal code. Moreover, we compare our proposed exactly achieves such lower bound and optimal recovery s-diagonal code with (2, 2)-cross code versus the number of threshold. Furthermore, we exploited our proposed theoreti- stragglers s. As shown in Figure 7 (a)(b), when the number cal framework to design two random codes that achieves the of stragglers s increases, the job completion time of the optimal recovery threshold with high probability but with s-diagonal code increases while that of (2, 2)-cross code constant computation load. We implemented our proposed does not change. Another interesting observation is that the schemes in the distributed gradient descent problem and demonstrated the advantages over existing fastest schemes. Sinong Wang1, Jiashang Liu1, Ness Shroff1,2 Pengyu Yang3

References Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The pagerank citation ranking: Bringing order Jean Bourgain, Van H Vu, and Philip Matchett Wood. On to the web. Technical report, Stanford InfoLab, 1999. the singularity probability of discrete random matrices. Journal of Functional Analysis, 258(2):559–603, 2010. Jacob T Schwartz. Fast probabilistic algorithms for veri- fication of polynomial identities. Journal of the ACM Ohio Supercomputer Center. Ohio supercomputer cen- (JACM), 27(4):701–717, 1980. ter. http://osc.edu/ark:/19495/f5s1ph73, Rashish Tandon, Qi Lei, Alexandros G Dimakis, and Nikos 1987. Karampatziakis. Gradient coding: Avoiding stragglers Zachary Charles, Dimitris Papailiopoulos, and Jordan El- in distributed learning. In International Conference on lenberg. Approximate gradient coding via sparse random Machine Learning, pages 3368–3376, 2017. graphs. arXiv preprint arXiv:1711.06771, 2017. Terence Tao and Van Vu. On the singularity probability Timothy A Davis and Yifan Hu. The university of florida of random bernoulli matrices. Journal of the American collection. ACM Transactions on Mathe- Mathematical Society, 20(3):603–628, 2007. matical Software (TOMS), 38(1):1, 2011. David W Walkup. Matchings in random regular bipartite Jeffrey Dean and Luiz André Barroso. The tail at scale. digraphs. Discrete Mathematics, 31(1):59–64, 1980. Communications of the ACM, 56(2):74–80, 2013. Sinong Wang, Jiashang Liu, and Ness Shroff. Coded sparse Jeffrey Dean and Sanjay Ghemawat. Mapreduce: simplified matrix multiplication. International Conference on Ma- data processing on large clusters. Communications of the chine Learning, pages 5152–5160. ACM, 51(1):107–113, 2008. Sinong Wang, Jiashang Liu, and Ness Shroff. Fundamental limits of approximate gradient coding. arXiv preprint Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, arXiv:1901.08166, 2019. Matthieu Devin, Mark Mao, Andrew Senior, Paul Tucker, Ke Yang, Quoc V Le, et al. Large scale distributed deep Yaoqing Yang, Pulkit Grover, and Soummya Kar. Coded networks. In Advances in neural information processing distributed computing for inverse problems. In Advances systems, pages 1223–1231, 2012. in Neural Information Processing Systems, pages 709– 719, 2017. Sanghamitra Dutta, Viveck Cadambe, and Pulkit Grover. Short-dot: Computing large linear transforms distribut- Qian Yu, Mohammad Maddah-Ali, and Salman Aves- edly using coded short dot products. In Advances In Neu- timehr. Polynomial codes: an optimal design for high- ral Information Processing Systems, pages 2100–2108, dimensional coded matrix multiplication. In Advances 2016. in Neural Information Processing Systems, pages 4406– 4416, 2017. Paul Erdos and Alfred Renyi. On random matrices. Magyar Matei Zaharia, Mosharaf Chowdhury, Michael J Franklin, Tud. Akad. Mat. Kutató Int. Közl, 8(455-461):1964, 1964. Scott Shenker, and Ion Stoica. Spark: Cluster computing Can Karakus, Yifan Sun, Suhas Diggavi, and Wotao Yin. with working sets. HotCloud, 10(10-10):95, 2010. Straggler mitigation in distributed optimization through data encoding. In Advances in Neural Information Pro- cessing Systems, pages 5440–5448, 2017. Kangwook Lee, Maximilian Lam, Ramtin Pedarsani, Dim- itris Papailiopoulos, and Kannan Ramchandran. Speeding up distributed machine learning using codes. IEEE Trans- actions on Information Theory, 2017a. Kangwook Lee, Ramtin Pedarsani, Dimitris Papailiopoulos, and Kannan Ramchandran. Coded computation for mul- ticore setups. In Information Theory (ISIT), 2017 IEEE International Symposium on, pages 2413–2417. IEEE, 2017b. Michael Luby. Lt codes. In Foundations of Computer Science, 2002. Proceedings. The 43rd Annual IEEE Sym- posium on, pages 271–280. IEEE, 2002. Raj Kumar Maity, Ankit Singh Rawat, and Arya Mazumdar. Robust gradient descent via moment encoding with ldpc codes. In SysML, 2018.