Coded Sparse Matrix Multiplication
Total Page:16
File Type:pdf, Size:1020Kb
Coded Sparse Matrix Multiplication Sinong Wang 1 Jiashang Liu 1 Ness Shroff 2 Abstract and the master node has to collect the results from all work- ers to output matrix C. As a result, a major performance In a large-scale and distributed matrix multiplica- bottleneck is the latency in waiting for a few slow or faulty tion problem C = A|B, where C 2 r×t, the R processors – called “stragglers” to finish their tasks (Dean coded computation plays an important role to ef- & Barroso, 2013). To alleviate this problem, current frame- fectively deal with “stragglers” (distributed com- works such as Hadoop deploy various straggler detection putations that may get delayed due to few slow techniques and usually replicate the straggling task on an- or faulty processors). However, existing coded other available node. schemes could destroy the significant sparsity that exists in large-scale machine learning problems, Recently, forward error correction or coding techniques pro- and could result in much higher computation over- vide a more effective way to deal with the “straggler” in head, i.e., O(rt) decoding time. In this paper, we the distributed tasks (Dutta et al., 2016; Lee et al., 2017a; develop a new coded computation strategy, we Tandon et al., 2017; Yu et al., 2017; Li et al., 2018; Wang call sparse code, which achieves near optimal re- et al., 2018). It creates and exploits coding redundancy in covery threshold, low computation overhead, and local computation to enable the matrix C recoverable from linear decoding time O(nnz(C)). We implement the results of partial finished workers, and can therefore alle- our scheme and demonstrate the advantage of the viate some straggling workers. For example, consider a dis- approach over both uncoded and current fastest tributed system with 3 worker nodes, the coding scheme first coded strategies. splits the matrix A into two submatrices, i.e., A = [A1;A2]. | | | Then each worker computes A1 B, A2 B and (A1 + A2) B. The master node can compute A|B as soon as any 2 out of 1. Introduction the 3 workers finish, and can overcome one straggler. In this paper, we consider a distributed matrix multiplica- In a general setting with N workers, each input matrix A, B is divided into m, n submatrices, respectively. The recovery tion problem, where we aim to compute C = A|B from threshold is defined as the minimum number of workers that input matrices A 2 Rs×r and B 2 Rs×t for some integers r; s; t. This problem is the key building block in machine the master needs to wait for in order to compute C. The learning and signal processing problems, and has been used above MDS coded scheme is shown to achieve a recovery in a large variety of application areas including classifica- threshold Θ(N). An improved scheme proposed in (Lee tion, regression, clustering and feature selection problems. et al., 2017c) referred to as the product code, can offer a Many such applications have large-scale datasets and mas- recovery threshold of Θ(mn) with high probability. More sive computation tasks, which forces practitioners to adopt recently, the work (Yu et al., 2017) designs a type of polyno- distributed computing frameworks such as Hadoop (Dean mial code. It achieves the recovery threshold of mn, which & Ghemawat, 2008) and Spark (Zaharia et al., 2010) to exactly matches the information theoretical lower bound. increase the learning speed. However, many problems in machine learning exhibit both extremely large-scale targeting data and a sparse structure, Classical approaches of distributed matrix multiplication i.e., nnz(A) rs, nnz(B) st and nnz(C) rt. The rely on dividing the input matrices equally among all avail- key question that arises in this scenario is: is coding really able worker nodes. Each worker computes a partial result, an efficient way to mitigate the straggler in the distributed sparse matrix multiplication problem? 1Department of ECE, The Ohio State University, Colum- bus, USA 2Departments of ECE and CSE, The Ohio State Uni- versity, Columbus, USA. Correspondence to: Sinong Wang 1.1. Motivation: Coding Straggler <[email protected]>. To answer the aforementioned question, we first briefly in- Proceedings of the 35 th International Conference on Machine troduce the current coded matrix multiplication schemes. In Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 these schemes, each local worker calculates a coded version by the author(s). Coded Sparse Matrix Multiplication 50 30 uncoded scheme mn=9 can we find a coded matrix multiplication scheme that has polynomial code 25 mn=16 40 mn=25 small recovery threshold, low computation overhead and 20 30 decoding complexity only dependent on nnz(C)? 15 20 Frequency 10 1.2. Main Contribution 10 ratio of round trip time 5 In this paper, we answer this question positively by de- 0 0 0 2 4 6 8 10 12 1 2 3 4 5 signing a novel coded computation strategy, we call sparse (a) Computation and communication time (s) (b)density density 1E-4 10-3 code. It achieves near optimal recovery threshold Θ(mn) by exploiting the coding advantage in local computation. Figure 1. Measured local computation and communication time. of submatrix multiplication. For example, kth worker of the Moreover, such a coding scheme can exploit the sparsity of polynomial code (Yu et al., 2017) essentially calculates both input and output matrices, which leads to low compu- tation load, i.e., O(ln(mn)) times of uncoded scheme and, m n X | i X jm nearly linear decoding time O(nnz(C) ln(mn)). A xk Bjx ; (1) i=1 i j=1 k | {z } | {z } The basic idea in sparse code is: each worker chooses a ran- A~| ~ k Bk dom number of input submatrices based on a given degree where Ai;Bj are the corresponding submatrics of the input distribution P ; then computes a weighted linear combina- P | matrices A and B, respectively, and xk is a given integer. tion ij wijAi Bj, where the weights wij are randomly One can observe that, if A and B are sparse, due to the ma- drawn from a finite set S. When the master node receives trices additions, the density of the coded matrices A~k and a bunch of finished tasks such that the coefficient matrix B~k will increase at most m and n times, respectively. There- formed by weights wij is full rank, it starts to operate a ~| ~ hybrid decoding algorithm between peeling decoding and fore, the time of matrix multiplication AkBk will increase | Gaussian elimination to recover the resultant matrix C. roughly O(mn) times of the simple uncoded one Ai Bj. In the Figure1(a), we experiment large and sparse matrix We prove the optimality of the sparse code by carefully de- multiplication from two random Bernoulli square matrices signing the degree distribution P and the algebraic structure with dimension roughly equal to 1:5×105 and the number of of set S. The recovery threshold of the sparse code is mainly nonzero elements equal to 6 × 105. We show measurements determined by how many tasks are required such that the on local computation and communication time required for coefficient matrix is full rank and the hybrid decoding al- N = 16 workers to operate the polynomial code and un- gorithm recovers all the results. We design a type of Wave coded scheme. Our finding is that the final job completion Soliton distribution (definition is given in Section4), and time of the polynomial code is significantly increased com- show that, under such a distribution, when Θ(mn) tasks are pared to that of the uncoded scheme. The main reason is finished, the hybrid decoding algorithm will successfully de- that the increased density of input matrix leads to the in- code all the results with decoding time O(nnz(C) ln(mn)). creased computation time, which further incurs the even Moreover, we reduce the full rank analysis of the coefficient larger data transmission and higher I/O contention. In the matrix to the determinant analysis of a random matrix in 5 Figure1(b), we generate two 10 random square Bernoulli mn×mn R . The state-of-the-art in this field is limited to the matrices with different densities p. We plot the ratio of Bernoulli case (Tao & Vu, 2007; Bourgain et al., 2010), in the average local computation time between the polynomial which each element is identically and independently dis- code and the uncoded scheme versus the matrix density p. tributed random variable. However, in our proposed sparse It can be observed that the coded scheme requires 5 or more code, the matrix is generated from a degree distribution, computation time vs the uncoded scheme, and this ratio is which leads to dependencies among the elements in the particularly large, i.e., O(mn), when the matrix is sparse. same row. To overcome this difficulty, we find a differ- In certain suboptimal coded computation schemes such as ent technical path: we first utilize the Schwartz-Zeppel MDS code, product code (Lee et al., 2017a;c) and short Lemma (Schwartz, 1980) to reduce the determinant analysis dot (Dutta et al., 2016), the generator matrices are generally problem to the analysis of the probability that a random bi- dense, which also implies a heavy workload for each local partite graph contains a perfect matching. Then we combine worker. Moreover, the decoding algorithm of polynomial the combinatoric graph theory and the probabilistic method code and MDS type of codes is based on the fast polynomial to show that when number of mn tasks are collected, the interpolation algorithm in the finite field.