CODED COMPUTATION FOR SPEEDING UP DISTRIBUTED MACHINE LEARNING

DISSERTATION

Presented in Partial Fulfillment of the Requirements for the Degree Doctor of

Philosophy in the Graduate School of the Ohio State University

By

Sinong Wang

Graduate Program in Electrical and Computer Engineering

The Ohio State University

2019

Dissertation Committee:

Ness B. Shroff, Advisor

Atilla Eryilmaz

Abhishek Gupta

Andrea Serrani c Copyrighted by

Sinong Wang

2019 ABSTRACT

Large-scale machine learning has shown great promise for solving many practical ap- plications. Such applications require massive training datasets and model parameters, and force practitioners to adopt distributed computing frameworks such as Hadoop and Spark to increase the learning speed. However, the speedup gain is far from ideal due to the latency incurred in waiting for a few slow or faulty processors, called

“straggler” to complete their tasks. To alleviate this problem, current frameworks such as Hadoop deploy various straggler detection techniques and usually replicate the straggling task on another available node, which creates a large computation overhead.

In this dissertation, we focus on a new and more effective technique, called coded computation to deal with stragglers in the distributed computation problems. It creates and exploits coding redundancy in local computation to enable the final output to be recoverable from the results of partially finished workers, and can therefore alleviate the impact of straggling workers. However, we observe that current coded computation techniques are not suitable for large-scale machine learning application.

The reason is that the input training data exhibit both extremely large-scale targeting data and a sparse structure. However, the existing coded computation schemes destroy the sparsity and creates large computation redundancy. Thus, while these schemes reduce delays due to the stragglers in the system, they create additional delays because they end up increasing the computational load on each machine. This fact motivates

ii us to focus on designing more efficient coded computation scheme for machine learning applications.

We begin by investigating the linear transformation problem. We analyze the minimum computation load (number of redundant computations in each worker) of any coded computation scheme for this problem, and construct a code we name ”diagonal code” that achieves the above minimum computation load. An important feature in this part is that we construct a new theoretical framework to relate the construction of coded computation scheme to the design of random that contains a . Based on this framework, we further construct several random codes that can provide even lower computation load with high probability.

We next consider a more complex problem that is also useful in a number of machine learning applications: multiplication. We show that previous constructed coded computation scheme for the linear transformation problem will lead to a large decoding overhead for the problem. To handle this issue, we design a new sparse code that is generated by a specifically designed degree distribution, we call

“wave soliton distribution”. We further design a type of hybrid decoding algorithm between peeling decoding and Gaussian elimination process, which can provide a fast decoding algorithm for this problem.

Finally, we shift our focus on the distributed optimization problem for the gradient coding problem. We observe that the existing gradient coding scheme that is designed for the worst-case scenario will yield a large computation redundancy. To overcome this challenge, we propose the idea of approximate gradient coding, which aims to approximately compute the sum of functions. We analyze the minimum computation load for approximate gradient coding problem and further construct two approximate gradient coding schemes: fractional repetition code and batch raptor code that asymptotically achieves the minimum computation load. We apply our proposed

iii scheme into a classical gradient descent algorithm in solving the logistic regression problem.

These works go to illustrate the power of designing efficient codes that are tailored to solve large-scale machine learning problems. In the future research, we will focus on more complex machine learning problem such as distributed training of deep neural network and the system-level optimization of the coded computation scheme.

iv To my parents, Xinhua Wang and Mingzhen Du

my wife, Xiaochi Li

v ACKNOWLEDGMENTS

First and foremost, I would like to sincerely thank my Ph.D. advisor, Prof. Ness

B. Shroff, for all the guidance and support he gave during my Ph.D. pursuit. As a great advisor, he not only pinpoints the correct direction and provide thought- provoking feedbacks when I was lost in the numerous challenges during research, but also sheds light on me how to think problem critically from the view of a researcher and find solutions effectively. In the past several years, he gave me immense help and encouragement without whom I wouldn’t be able to stand at this point.

I would like to thank Prof. Atilla Eryilmaz, Prof. Abhishek Gupta and Prof.

Andrea Serrani for serving in my candidacy and dissertation committee. Their valuable suggestions and insightful comments have helped me significantly on the improvement of this dissertation. I am grateful to all my friends and colleagues in the IPS lab and the ECE department. Thanks to Jeri for nicely organizing all the activities and administrative issues.

vi VITA

Oct, 1992 ...... Born in Shaanxi, China

2010-2014 ...... B.S., Telecommunication Engineering, Xi- dian University.

2015-Present ...... Electrical and Computer Engineering, The Ohio State University

PUBLICATIONS

S. Wang, Jiashang Liu, N. Shroff and P. Yang,“Computation Efficient Coded Linear Transform”. AISTATS 2019.

S. Wang, Sinong, J. Liu, and N. Shroff, “Coded Multiplication”. ICML 2018.

S. Wang and N. Shroff,“A New Alternating Direction Method for Linear Programming”. NIPS 2017

F. Liu, S. Wang, S. Buccapatnam, and N. Shroff, “UCBoost: A Boosting Approach to Tame Complexity and Optimality for Stochastic Bandits”. IJCAI 2018.

S. Wang, F. Liu and N. Shroff, “Non-additive Security Game”. AAAI, 2017

S. Wang and N. Shroff,“Towards Fast-Convergence, Low-Delay and Low-Complexity Network Optimization”. ACM SIGMETRICS 2018.

S. Wang and N. Shroff, “A Fresh Look at An Old Problem: Network Utility Maxi- mization Convergence, Delay, and Complexity”, invited paper, Allerton 2017. vii S. Wang and N. Shroff, “Security Game with Non-additive Utilities and Multiple At- tacker Resources”, Kenneth C. Sevcik Outstanding Paper Award, ACM SIGMETRICS, 2017.

FIELDS OF STUDY

Major Field: Electrical and Computer Engineering

Specialization: Machine Learning, Optimization, Distributed System, Game Theory

viii TABLE OF CONTENTS

Abstract ...... ii

Dedication ...... iv

Acknowledgments ...... vi

Vita...... vii

List of Tables ...... xii

List of Figures ...... xiii

CHAPTER PAGE

1 Introduction ...... 1

1.1 Curse of the Straggler ...... 2 1.2 New Frontier: Coded Computation ...... 3 1.2.1 A Toy Example ...... 3 1.2.2 Historical Development of Coded Computation ...... 5 1.3 Challenges in Coded Computation ...... 6 1.4 Contribution and Thesis Organization ...... 9

2 Coded Distributed Linear Transform ...... 12

2.1 Introduction ...... 12 2.2 Problem Formulation ...... 15 2.3 Fundamental Limits and Optimal Code ...... 17 2.3.1 Fundamental Limits on Computation Load ...... 17 2.3.2 Diagonal Code ...... 18 2.3.3 Fast Decoding Algorithm ...... 20 2.4 Graph-based Analysis of Recovery Threshold ...... 22 2.5 Random Code: “Break” the Limits ...... 24 2.5.1 Probabilistic Recovery Threshold ...... 24 2.5.2 Construction of the Random Code ...... 26 2.5.3 Numerical Results of Random Code ...... 27

ix 2.6 Experimental Results ...... 28

3 Coded Distributed Matrix Multiplication ...... 33

3.1 Introduction ...... 33 3.2 Preliminary ...... 35 3.3 Sparse Codes ...... 37 3.3.1 Motivating Example ...... 38 3.3.2 General Sparse Code ...... 41 3.4 Theoretical Analysis ...... 43 3.5 Experimental Results ...... 49

4 Coded Distributed Optimization ...... 53

4.1 Introduction ...... 53 4.2 Preliminaries ...... 57 4.2.1 Problem Formulation ...... 57 4.2.2 Main Results ...... 60 4.3 0-Approximate Gradient Code ...... 61 4.3.1 Minimum Computation Load ...... 62 4.3.2 d-Fractional Repetition Code ...... 63 4.4 -Approximate Gradient Code ...... 65 4.4.1 Fundamental Three-fold Tradeoff ...... 66 4.4.2 Random Code Design ...... 66 4.5 Simulation Results ...... 70 4.5.1 Experiment Setup ...... 70 4.5.2 Generalization Error ...... 72 4.5.3 Impact of Straggler Tolerance ...... 72

5 Conclusion ...... 74

Bibliography ...... 79

Appendix A: Proofs for Chapter 2 ...... 84

A.1 Proof of Theorem 2.3.1 ...... 84 A.2 Proof of Theorem 2.3.3 ...... 85 A.3 Proof of Lemma 2.4.2 ...... 85 A.4 Proof of Corollary 2.4.1.1 ...... 86 A.5 Proof of Theorem 2.5.1 ...... 87 A.6 Proof of Theorem 2.5.2 ...... 90

Appendix B: Proofs for Chapter 3 ...... 99

B.1 Proof of Theorem 3.4.1 ...... 99 B.2 Proof of Lemma 3.4.1 ...... 109

x B.3 Proof of Theorem 3.4.2 ...... 111

Appendix C: Proofs for Chapter 4 ...... 112

C.1 Proof of Lemma 4.3.1 ...... 112 C.2 Proof of Lemma 4.3.2 ...... 113 C.3 Proof of Theorem 4.3.1 ...... 114 C.4 Proof of Theorem 4.3.2 ...... 118 C.5 Proof of Theorem 4.4.1 ...... 120 C.6 Proof of Theorem 4.4.2 ...... 123

xi LIST OF TABLES

TABLE PAGE

2.1 Comparison of Existing Schemes in Coded Computation...... 14

3.1 Comparison of Existing Coding Schemes ...... 35

3.2 Timing Results for Different Sparse Matrix Multiplications (in sec) . 52

4.1 Comparison of Existing Gradient Coding Schemes ...... 55

xii LIST OF FIGURES

FIGURE PAGE

1.1 Example of replication and coded computation scheme in distributed linear transform ...... 4

1.2 Measured local computation time per worker...... 8

2.1 Framework of coded distributed linear transform...... 16

2.2 Statistical convergence speed of full probability and average computation load of random code under number of stragglers s = 2. . 29

2.3 Statistical convergence speed of full rank probability and average computation load of (d1, d2)-cross code under number of stragglers s = 3, 4...... 30

2.4 Comparison of total time including transmission, computation and decoding time for n = 12, 20 and s = 2, 4...... 31

2.5 Magnitude of gradient versus time for number of data partitions n = 12, 20 and number of stragglers s = 2,4...... 32

3.1 Framework of coded distributed matrix multiplication...... 36

3.2 Example of the hybrid peeling and Gaussian decoding process of sparse code...... 39

3.3 Recovery threshold versus the number of blocks mn...... 48

3.4 Job completion time of two 1.5E5 1.5E5 matrices with 6E5 nonzero elements...... × ...... 50

3.5 Simulation results for two 1.5E5 1.5E5 matrices with 6E5 nonzero elements. T1 and T2 are the transmission× times from master to worker, and the worker to master, respectively...... 51

4.1 Gradient coding framework...... 58

xiii 4.2 Information-theoretical lower bound of existing worst-case gradient coding [1] and proposed -approximate gradient coding when n = 1000. 65

4.3 Example of the batch raptor code and peeling decoding algorithm. . 68

4.4 The generalization AUC versus running time of applying distributed gradient descent in a logistic regression model. The two proposed schemes FRC and BRC are compared against three existing schemes. The learning rate α is fixed for all the experiment...... 71

4.5 The final job completion time of achieving a generalization AUC = 0.8 in a logistic regression model. The two proposed schemes FRC and BRC are compared against three existing schemes. The learning rate α is fixed for all the experiments...... 73

B.1 Example of structure S V1, N(S) V2 and S V2, N(S) V1 satisfying condition 1,2 and∈ 3. One can∈ easily check∈ that there exists∈ no perfect matching in these two examples...... 101

xiv CHAPTER 1

INTRODUCTION

Large-scale machine learning has shown great promise for solving many practical applications, ranging from image classification [2], speech recognition [3], text pro- cessing [4], and recent technique of self-driving car [5]. It has been observed that increasing the scale, with respect to the number of training examples, the number of model parameters, or both, can drastically improve classification accuracy [6]. For example, Google has built a 9-layer locally connected sparse autoencoder to detect the human face [7]. With more than 1 billion connections (parameters) and 10 million

200 200 images, they could get a leap of 70% relative improvement over the previ- × ous state-of-the-art. Such applications require massive training datasets and model parameters, and force practitioners to adopt distributed computing frameworks such as Hadoop [8] and Spark [9] to increase the learning speed. In the previous example, they train such networks using parallel asynchronous SGD on a cluster with 1000 machines (16000 cores) for three days.

Over the past few several years, distributing the data and model into multiple machines have facilitated the machine learning community to build extremely larges- scale models, and achieve unprecedented human-level prediction accuracy. However, the system performance of these systems is bottlenecked by the slow machine problem. This issue is due to the fact that modern distributed computation system adopts a split-apply-combine paradigm: a computing job is first split into tasks that

1 are served independently in parallel. Their results are then combined before the computation can proceed further. Therefore, the execution time of a job is determined by the slowest of the tasks, the “straggler”. For example, it was observed in [10] that a straggler may run 8 times slower than the average worker performance on

Amazon EC2. In an experiment reported by Google [6], they use 100+ machines to train a large-scale deep neural network, and only obtain at most a10X speed up. They observe that the typical cause of less-than-ideal speedups is the variance in processing times across the different machines, leading to many machines waiting for the single slowest machine to finish a given phase of computation.

1.1 Curse of the Straggler

As mentioned earlier, the effectiveness of the distributed computation is bottlenecked by the straggler, the slowest parallel task [11–13]. While a programmer can try her best to evenly split the computation, the execution environment is usually out of her control such that even perfectly evenly split tasks can have different execution times. First, a node can experience transient or faults, leading to the corresponding task being stuck and becoming the straggler. Faults can range from hardware failures to soft errors to thermal throttling due to overheating. Even if a node faults independently with a small probability, the probability of a straggler increases proportionally with the number of nodes, effectively limiting the performance benefit of scaling out. Second, nodes may be heterogeneous, leading to different execution times even for the same tasks. A large data center may have serves of different technology generations; modern multi-core processors can have heterogenous cores to achieve energy proportionality such as the ARM big.LITTLE architecture. Finally, the split tasks may have to share the nodes with other workload, which introduces variations in task execution time and leads to stragglers.

2 Existing work has approached the problem of stragglers using two strategies: speculation and replication. Both are limited in their effectiveness and incur significant runtime overhead. Speculative execution [8, 14–16] monitors the progress of the split tasks and replicates the slower tasks before they become stragglers. However, speculative execution techniques have a fundamental limitation when dealing with extremely large-scale distributed systems: They must collect statistically significant samples of task performance to predict which tasks are like to become stragglers, which not only incurs runtime overhead but also takes time. Speculative execution does not eliminate stragglers but makes them faster. For example, [17] shows that, even with speculative execution, the stragglers still run 8 times slower than a task’s median time. While speculation reacts to potential stragglers, replication copes with them proactively by replicating each task into multiple workers and waiting for the fastest one. This strategy is shown to provide a 30% speed-up compared to speculation.

However, it makes an expensive tradeoff between avoiding stragglers for a significant constant overhead. To obtain a 30% speed up, it must double or even triple the computation [17].

1.2 New Frontier: Coded Computation

Recently, forward error correction and other coding techniques have shown to be effective in dealing with the stragglers in distributed computation tasks [1,18–24]. By exploiting coding redundancy, the final result is recoverable even if all workers have not finished their computations, thus reducing the delay caused by straggler nodes.

1.2.1 A Toy Example

We start our description of coded linear transform by considering a linear transform

Ax. Suppose that we have a distributed system with n worker nodes. The standard 3 Replication Coded computation master master

… …

A1x A2x Anx A1x A2x Anx

(A + A + + A )x 1 2 ··· n

A1x A2x Anx Figure 1.1: Example of replication and coded computation scheme in distributed linear transform

row splitting technique used in parallel linear transforms will split the matrix A into n submatrices along the row side, i.e., A = [A1, A2, ..., An−1], then assign each matrix block into each worker to launch a partial linear transform Aix. The replication strategy will employ n additional workers to guarantee that each partial linear transform Aix is assigned to two workers. By employing extra n workers, this effectively copes with a single straggler no matter where it appears. Using coded linear transform, to achieve the same straggler resilience, we can employ only one more worker to compute (A1 + A2 + ... + An)x besides A1x, A2x,..., Anx. As a result, we can compute Ax as soon as any n 1 out of the n workers finish. For example, if − the second worker is a straggler, we can use the results from worker 1, worker 3, ... worker n + 1 to recover A2x by the following simple decoding,

(A1 + A2 + ... + An)x A1x A3x Anx − − − · · · − result of (n+1)th worker 1st worker 3rd worker nth worker | {z } |{z} |{z} |{z}

4 Coded computation can be considered as a more efficient and flexible generalization of the aforementioned replication strategy. In this example, to achieve the same capability of coping with a single straggler using replication, the coded computation almost reduces the redundancy, i.e., number of workers, by 50% compared to the replication strategy. This scheme can be generalized to cope with s stragglers using m n extra workers even when the workers have uneven probabilities of straggling. ≥

1.2.2 Historical Development of Coded Computation

The work of Lee et al. [19] initiated the study of using coding techniques such as the

MDS code for mitigating stragglers in the distributed linear transformation problem and the regression problem. Subsequently, one line of studies was centered on designing the coding schemes in distributed linear transformation problems. Dutta et al. [18] constructed a deterministic coding scheme in the product of a matrix and a long vector. Lee et al. [25] designed a type of efficient 2-dimensional MDS code for the high dimensional matrix multiplication problem. Yu et al. [21] proposed the optimal coding scheme, named as polynomial code, in the matrix multiplication problem.

Wang et al. [26,27] further initialized the study of computation load in the distributed transformation problem and design several efficient coding schemes with low density .

The second line of works focus on constructing the coding schemes in the distributed algorithm in machine learning application. The work of [22] first addressed the straggler mitigation in the linear regression problem using data encoding. The initial study by [1] presented an optimal trade-off between the computation load and straggler tolerance for any loss functions. Two subsequent works in [28, 29] considered the approximate gradient evaluation and proposed the BGC scheme with less computation load compared to the scheme in [1]. Maity et al. [30] applied the existing LDPC

5 to a linear regression model with sparse recovery. Ye et al. [31] further introduced the communication complexity in such problem and constructed an efficient code for reducing both straggler effect and communication overhead.

There exist some works that try to apply the central idea of coded computation to other domains. The work by [32] designs the coding scheme for the map reduce system. It exploits the repetitive mapping of data blocks at different servers to create coded multicasting opportunities in the shuffling phase, cutting down the total communication load. They further apply such a scheme to the sorting problem and overcome the shuffling bottleneck of terasort [33]. The work [23] designs the coding scheme for linear inverse problem. The work by [34] applies the MatDot code in the distributed nearest neighbor search algorithm. There also exits several works [20,35–37] attempting to optimize the system performance of coded computation. For example, the work by [5] considers the design of coded computation scheme in the heterogeneous cluster in which the computation power of each machine is different. The work [36] systematically analyzes how to allocation the resource for coded computation system.

1.3 Challenges in Coded Computation

As we discussed in the previous section, coded computation shares a similar problem structure with the traditional coding scheme design in the communication system: creates some parity blocks to provide the resilience to potential failures (stragglers).

However, there exists several new challenges posed by architecture of the distributed computation and applications.

First, the traditional coding scheme design focuses on the performance such as encoding and decoding complexity when the number of input symbols n are large, i.e., n > 10000. In coded computation, the number of input symbols is actually the number of data partitions, which is usually smaller than 100. The existing codes that

6 provide asymptotic optimal performance may not work well for coded computation problem. For example, the luby transformation (LT) code [38] can provide less than

3% parities (redundant number of workers) when the message length is larger than

105. However, simple simulation shows that when n = 100, the LT code requires roughly 50 parities, which requires almost 50% redundancy of number of workers for coded computation. Namely, the coded computation problem requires that we focus on designing the coding scheme in the finite or short message length.

The second challenge is deriving the data sparsity in the modern machine learning problem. To illustrate the main idea, we use the PageRank problem [39] and existing polynomial code [21] as an example. This problem aims to measure the importance score of the nodes on a graph, which is typically solved by the following power-iteration,

xt = cr + (1 c)Axt−1. (1.3.1) −

r×t where c = 0.15 is a constant and A R is the graph . The ∈ standard distributed method to solve (1.3.1) is to partition A into n equal blocks

n Ai along the row side and store them in the memory of several workers. In each { }i=1 iteration, the master node broadcasts the xt into all workers, each worker computes a partial linear transform Aixt and sends it back to the master node. Then the master node collects all the partial results and updates the vector. As we discussed, this approach may suffer from the straggler issue. In the coded computation, the existing polynomial code [21] works as follows: in each iteration, the kth worker essentially stores a coded submatrix, n ˜ i Ak = Aiαk x, (1.3.2) i=1 X  ˜ and computes a coded partial linear transform y˜k = Akx. Here αk is a given integer.

7 Then the master node can recover Axt by decoding the results from part of workers and update xt correspondingly. In practical applications such as web searching, the graph adjacent matrix A is extremely large and sparse, i.e., nnz(A) rt. One can observe that, due to the  ˜ sparsity of matrix A and matrices additions, the density of the coded submatrices Ak ˜ will increase at most n times, and the time of sublinear transform Akx will increase roughly O(n) times of the simple uncoded one.

70 70 18 18 uncodeduncoded scheme scheme n=30n=30 60 60 polynomialpolynomial code code 16 16 n=20n=20 n=10n=10 50 50 14 14

40 40 12 12

30 30 10 10 Frequency Frequency 20 20 8 8 ratio of computation time 10 10 ratio of computation time 6 6

0 0 4 4 0.2 0.2 0.4 0.4 0.6 0.6 0.8 0.8 2 2 3 3 4 4 5 5 density p −3 −3 ComputationComputation and communication and communication time (s) time (s) density p x 10 x 10 Figure 1.2: Measured local computation time per worker.

In the Figure 1.2(a), we show measurements on local computation and communi- cation time required for m = 20 workers to operate the polynomial code and naive uncoded distributed scheme for the above problem with dimension roughly equal to

106 and number of nonzero elements equal to 107. Our observation is that the final job completion time of the polynomial code is significantly increased compared to that of the uncoded scheme. The main reason is that the increased density of the input matrix leads to increased computation time. In Figure 1.2(b), we generate a

105 random square Bernoulli matrix with different densities p. We plot the ratio of the average local computation time between the polynomial code and the uncoded

8 scheme versus the matrix density p. It can be observed that this ratio is generally large and equal to the order of O(n), when the matrix is sparse.

Based on this simple example, we can see that, in the coded linear transformation problem, the current coded computation schemes may destroy the data sparsity in the data encoding and create a large redundancy during the computation phase. Actually, in other application domains such as distributed optimization, the coding scheme will not encode the training data, and can be regarded as a type of data allocation strategy. The number of data partitions allocated to each worker directly relates to the redundancy in the coded computation strategy. Therefore, how to optimally utilize the redundancy, i.e., how to create the encoding data with less original data partitions, become a critical problem in the coded computation problem.

1.4 Contribution and Thesis Organization

This dissertation addresses the above discussed challenges for three important problems in modern large-scale machine learning: distributed linear transform, distributed matrix multiplication, and distributed optimization (data parallelism). New coding schemes, algorithms and theoretical analysis framework are proposed to overcome the observed challenges.

In Chapter 2, we focus on the distributed coded linear transform problem.

We formulate the coded linear transform problem and systematically propose two performance metrics: computation load and recovery threshold. These performance metrics characterize the number of redundant computations and resilience to stragglers for a given coded computation scheme. We derive the information-theoretical lower bound of computation load and recovery threshold. We further construct a code named as diagonal code that exactly matches such lower bounds. One important theoretical contribution is that we propose a new theoretical framework that relates

9 the design of coded computation scheme to the construction of bipartite graph that contains the perfect matching. Based on the proposed new theoretical framework, we design several coded schemes that achieves significantly lower computation load, i.e., constant, from the sparse random bipartite graph.

In Chapter 3, we further consider a more complex problem, distributed coded matrix multiplication. Compared to the linear transform problem, the output of this problem is a matrix instead of single vector. The simple inverse decoding that is used in the coded linear transforms will lead to a large decoding complexity for this problem. To overcome this challenge, we propose a new coded matrix multiplication scheme that has low computation load, but also structure enough such that decoding process can exploit this structure to provide fast decoding. The underlying analysis is based on our proposed theoretical framework and a construction of random bipartite graph via modified Soliton distribution.

In Chapter 4, we consider the distributed coded distributed optimization problem. The distributed optimization problem focuses on computing the sum of functions across multiple workers. In many distributed machine learning training algorithms, the sum of functions can be regarded as sum of partial gradients. The data is allocated into each worker, and each worker computes a partial gradient via the partial data sets. The master receives the partial gradients and updates classifier via summing all partial gradients. We discuss the drawbacks of the existing schemes and systematically formulate an approximate gradient coding framework. We further provides the fundamental limits of minimum computation load for approximate gradient coding problem and construct a random code that asymptotically matches such lower bound. We implement all our proposed coded computation schemes for the above three problems in the Ohio Supercomputer center [40] via MPI4py and real world data set, and show significant speed up compared to the state-of-art.

10 Finally, concluding remarks and possible future research directions are presented in Chapter 5.

11 CHAPTER 2

CODED DISTRIBUTED LINEAR TRANSFORM

In this chapter, we first consider a distributed linear transformation problem, where

r×t t we aim to compute y = Ax from input matrix A R and vector x R . This ∈ ∈ problem is the key building block in machine learning and signal processing problems, and has been used in a large variety of application areas. Optimization-based training algorithms such as gradient descent in regression and classification problems and backpropagation algorithms in deep neural networks, require the computation of large linear transforms of high-dimensional data. It is also the critical step in the dimensionality reduction techniques such as principal component analysis and linear discriminant analysis.

2.1 Introduction

In the classical approach of distributed linear transforms with m workers, the input

r ×t matrix A is evenly divided into n (n m) submatrices Ai R n along the row side. ≤ ∈ Each worker computes a partially coded linear transform and returns the result to the master node. Given a computation strategy, the recovery threshold is defined as the minimum number of workers that the master needs to wait for in order to compute Ax.

The existing MDS coded scheme is shown to achieve a recovery threshold Θ(m) [19].

An improved scheme proposed in [18], referred to as the short dot code, can offer a

12 larger recovery threshold Θ(m(1+)) but with a constant reduction of the computation load by imposing some sparsity of the encoded submatrices. More recently, the work in [21] designs a type of polynomial code, which achieves the information-theoretical optimal recovery threshold n.

However, as we observed in Section 1.3, many problems in machine learning exhibit both extremely large-scale and sparse targeting data, i.e., nnz(A) rt. The  traditional coding schemes may destroy the data sparsity and creates a “code straggler”

(large computation overhead). Inspired by this phenomenon, we propose a new metric, named as computation load, which is defined as total number of submatrices the local works access (formal definition can be seen in Section 2.2). For example, the polynomial code achieves optimal recovery threshold but a large computation load of mn. In certain suboptimal codes such as short-dot code, they achieve slightly lower computation load but still on the order of O(mn). Specifically, we are interested in the following key problem: can we find a coded linear transformation scheme that achieves optimal recovery threshold and low computation load?

In this chapter, we provide a fundamental limit of the minimum computation load: given n data partitions and number of s stragglers, the minimum computation load of any coded computation scheme is n(s + 1). We then design a novel scheme, we call s-diagonal code, that achieves both the optimum recovery threshold and the minimum computation load. We also exploit the diagonal structure to design a hybrid decoding algorithm between peeling decoding and Gaussian elimination that achieves the nearly linear decoding time of the output dimension O(r).

We further show that, under random code designs, the computation time can be reduced even lower to a constant with high probability. Specifically, we define a performance metric probabilistic recovery threshold that allows the coded computation scheme to provide the decodability with high probability. Based on this metric, there

13 Table 2.1: Comparison of Existing Schemes in Coded Computation.

MDS Short-dot Poly s-MDS Sparse diag cross Scheme code code code code code code code Recovery Θ(m) Θ(m(1 + )) n n* Θ(n)* n n* threshold Computation O(n) O(n(1 )) n Θ(log(n))* Θ(log(n))* O(s) Θ(1)* load/m − * The result holds with high probability, i.e., 1 − e−cn. exist several schemes, i.e., sparse MDS code [20] and sparse code [24] achieving the optimal probabilistic recovery threshold but a small computation load Θ(n log(n)). In this work, we construct a new (d1, d2)-cross code with optimal probabilistic recovery threshold but constant computation load Θ(n). The comparison of existing and our results are listed in TABLE 3.2.

The theoretical analysis to show the optimality of the recovery threshold is based

m×n on a determinant analysis of a in R . The state-of-the-art in this field is limited to the Bernoulli case [41,42], in which each element is identically and independently distributed random variable. However, in our proposed (d1, d2)-cross code, the underlying coding matrix is generated based on a hypergeometric degree distribution, which leads to dependencies among the elements in the same row. To overcome this difficulty, we propose a new technical framework: we first utilize the

Schwartz-Zeppel Lemma [43] to reduce the determinant analysis problem to the analysis of the probability that a random bipartite graph contains a perfect matching.

Then we combine the random proposal and the probabilistic method to show that when n tasks are collected, the coefficient matrix is full rank with high probability.

Further, we apply our proposed random codes to the gradient coding problem.

This problem has wide applicability in mitigating the stragglers of distributed machine learning, and was first investigated in [1]. It designs a cyclic code that achieves the optimum recovery threshold n and computation load s + 1 per worker. The work [44] 14 proposes an LDPC code to further reduce the average computation load to Θ(log(n)).

Another line of works [22,28,45] try to reduce the computation load by approximated gradient computation. In this paper, we show that our constructed (d1, d2)-cross code can not only exactly recover the gradient (sum of functions) but also provides a constant computation load Θ(1). Finally, we implement the constructed codes and demonstrate their improvement compared with existing strategies.

2.2 Problem Formulation

r×t We are interested in distributedly computing a linear transform with matrix A R ∈ t and input vector x R for some integers r, t. The matrix A is evenly divided along ∈ the row side into n submatrices.

T T T T T A = [A1 , A2 , A3 ,..., An ] (2.2.1)

Suppose that we have a master node and m worker nodes. Worker i first stores 1/n ˜ r ×t fraction of matrix A, defined as Ai R n . Then it can compute a partial linear ∈ ˜ transform y˜i = Aix and return it to the master node. The master node waits only for ˜ the results of a subset of workers Aix i I [m] to recover the final output y using { | ∈ ⊆ } certain decoding algorithms. The main framework is illustrated in Figure 3.1. Given the above system model, we can formulate the coded distributed linear transform problem based on the following definitions.

Definition 2.2.1. (Coded computation strategy) A coded computation strategy is an ˜ m n coding matrix M = [mij]i∈[m],j∈[n] that is used to compute each Ai, ×

n ˜ Ai = mijAj, i [m]. (2.2.2) j=1 ∀ ∈ X

15 Data matrix Data encoding

A1 master A2

A3

A1 + A2 An 2 + An A + A + A … 2 4 5

An-1 … An ˜ A˜ 1 A˜ 2 Am

Coded computation and decode

˜y Multiple rounds master i1 decode (gradient descent) ˜y i2 2 3 Ax . ˜ . ) ˜y 1 x ˜y 2 =x A2x˜y m x 6 7 6˜y 7 6 ik 7 4 5

worker …

˜ ˜ ˜y 1 = A1x ˜y 2 = A˜ 2x ˜y m = Amx

Figure 2.1: Framework of coded distributed linear transform.

˜ Then each worker i computes y˜i = Aix.

This is a general definition of a large class of coded computation schemes. For example, in the polynomial code [21], the coding matrix M is the . In the MDS type of codes [18–20], M is a specific form of corresponding generator matrices.

Definition 2.2.2. (Recovery threshold) A coded computation strategy M is k-recoverable

16 if for any subset I [m] with I = k, the master node can recover Ax from y˜i i I . ⊆ | | { | ∈ } The recovery threshold κ(M) is defined as the minimum integer k such that strategy

M is k-recoverable.

Regarding the recovery threshold, the existing work [21] has applied a cut-set type argument to show that the minimum recovery threshold of any scheme is

κ∗ = min κ(M) n. (2.2.3) M∈Rm×n ≥

Definition 2.2.3. (Computation load) The computation load of strategy M is defined as l(M) = M 0, the number of the nonzero elements of coding matrix. k k

2.3 Fundamental Limits and Optimal Code

In this section, we will describe the optimum computation load and the s-diagonal code that exactly matches such lower bound. Then we will provide a fast decoding algorithm with nearly linear decoding time.

2.3.1 Fundamental Limits on Computation Load

We first establish the lower bound on the computation load, i.e., the density of coding matrix M.

Theorem 2.3.1. (Optimum computation load) For any coded computation scheme

m×n 1 M R using m workers that can each store fraction of A, to resist s stragglers, ∈ n we have

l(M) n(s + 1). (2.3.1) ≥

The polynomial code [21] achieves the optimum recover threshold of n. However, the coding matrix (Vandermonde matrix) is fully dense, i.e., l(M) = nm, and far

17 beyond the above bound. The short-dot code [18] can reduce the computation load but sacrifices the recovery threshold. Therefore, a natural question that arises is, can we design a code that achieves such lower bound? We will answer this question in the sequel of this paper.

2.3.2 Diagonal Code

We now present the s-diagonal code that achieves both the optimum recovery threshold and optimum computation load for any given parameter values of n, m and s.

Definition 2.3.1. (s-diagonal code) Given parameters m,n and s, the s-diagonal code is defined as ˜ min{i,n} Ai = mijAj, i [m], (2.3.2) j=max{1,i−s} ∀ ∈ X where each coefficient mij is chosen from a finite set S independently and uniformly at random.

The reason we name this code as the s-diagonal code is that the nonzero positions of the coding matrix M exhibit the following block diagonal structure.

s + 1

0 0 0 0 ∗ ∗ · · · ∗ ··· z }| {  0 0 0 0  ∗ ∗ · · · ∗ ···  T   M = 0 0 0 0 ,  ∗ ∗ · · · ∗ ···  ......  ......        0 0 0   ··· ∗ ∗ · · · ∗ ∗   where indicates the nonzero entries of M. Before we analyze the recovery threshold ∗ of the diagonal code, the following example provides an instantiation for n = 4, s = 1 and m = 5.

18 Example 1: (1-Diagonal code) Consider a distributed linear transform task Ax using m = 5 workers. We evenly divide the matrix A along the row side into 4

T T T T T submatrices: A = [A1 , A2 , A3 , A4 ]. Given this notation, we need to compute the following 4 uncoded components A1x, A2x, A3x, A4x . Based on the definition of { } ˜ ˜ the 1-diagonal code, each worker stores the following submatrices: A1 = A1, A2 = ˜ ˜ ˜ A1 + A2, A3 = A2 + A3, A4 = A3 + A4, A5 = A4. Suppose that the first worker is a straggler and the master node receives results from worker 2, 3, 4, 5 . According to { } the above coded computation strategy, we have

y˜2 1 1 0 0 A1x       y˜3 0 1 1 0 A2x   =     (2.3.3)       y˜4 0 0 1 1 A3x             y˜5 0 0 0 1 A4x             The coefficient matrix is an upper , which is invertible since the elements in the main diagonal are nonzero. Then we can recover the uncoded components

Aix by direct inversion of the above coefficient matrix. The decodability for the { } other 4 possible scenarios can be proved similarly. Therefore, this code achieves the optimum recovery threshold of 4.

Obviously, the computation load of the s-diagonal code is optimal and can be easily obtained by counting the number of nonzero elements of coding matrix M. The following result gives us the recovery threshold of the s-diagonal code.

Theorem 2.3.2. Let finite set S satisfy S 2n2Cn . Then there exists an s-diagonal | | ≥ m code that achieves the recovery threshold n, and can be constructed, on average, in 2 trails.

The theoretical framework on analyzing the recovery threshold is provided in

Section 2.4. 19 2.3.3 Fast Decoding Algorithm

The original decoding algorithm of s-diagonal code is based on inverting the coding

matrix, and the mappings from [˜yi1 , ˜yi2 ,..., ˜yin ] to vector [y1, y2,..., yn] incurs a complexity of O(nr). We next show that it can be further reduced by a hybrid decoding algorithm between the peeling decoding and Gaussian elimination. The key idea is to utilize the peeling decoding algorithm to reduce the number of blocks recovered by above mapping, and Gaussian elimination to guarantee the existence of the ripple node.

Suppose that the master node receives results from workers indexed by U =

U i1, . . . , in [n + s] with 1 i1 in n + s. Let M be a n n submatrix { } ⊆ ≤ ≤ · · · ≤ ≤ × consisting of rows of M index by U. Let the index k [n] be ik n < ik+1.Then ∈ ≤ recover the blocks indexed by [n] i1, . . . , ik from the following rooting step, and \{ } add the recovered results into the ripples set = Aix i∈[n]\{i ,...,i }. R { } 1 k

U Lemma 2.3.1. (Rooting step) If rank(M ) = n, then for any k0 1, 2, . . . , n , ∈ { } U we can recover a particular block Aix with column index k0 in matrix M via the following linear combination.

n Aix = uk˜yk. (2.3.4) k=1 X

| 1×n The vector u = [u1, . . . , un] can be determined by solving u M = ek0 , where ek0 R ∈ is a unit vector with unique 1 locating at the index k0.

Then the master node goes to a peeling decoding process: the master node first

finds a ripple in set , i.e., Aix. For each collected results ˜yj, it subtracts this block R if the computation task A˜ jx contains this block (Mji = 0). If the set is empty, the 6 R master node finds a new ripple that computes a uncoded task in the rest workers, add it to set and continue above process. R 20 Algorithm 1 Fast decoding algorithm for s-diagonal code

Receive n results with coefficient matrix. Find the index k [n] with ik n < ik+1. ∈ ≤ Recover the blocks indexed by [n] i1, . . . , ik by (3.3.2). \{ } Construct ripples set = Aix i∈[n]\{i1,...,ik} repeat R { } if is not empty then R Choose a result Aix from . R for each computation results ˜yj do if Mji is nonzero then ˜yj = ˜yj mjiAix and set mji = 0. end if − end for else U U U Find a row M 0 in matrix M with M 0 0 = 1. i k i k = ˜yi0 . endR if R ∪ until every block of vector y is recovered.

Example 2: (Fast decoding algorithm) Consider a similar setup with Example

1. Suppose that we receive the results from workers index by U = 2, 3, 4, 5 . The { } naive inverse decoding algorithm will lead to decoding complexity 4r (complexity of inverse mapping of (2.3.3)). In the fast decoding algorithm, we first recover the block

A1x ( 1, 2, 3, 4 2, 3, 4 ) by rooting step A1x = y˜1 y˜2 + y˜3 y˜4. Then we start { }\{ } − − the peeling decoding process: (i) recover the block A2x by y˜2 A1x, add the result − to set ; (ii) recover the block A3x by y˜3 A2x, add the result to set ; (iii) recover R − R the block A4x by y˜4 A3x. Obviously, the whole procedure of new algorithm leads − to complexity 7r/4, which is smaller than 4r.

The main procedure is listed in Algorithm 2, and the decoding complexity is given by the following theorem.

Theorem 2.3.3. (Nearly linear time decoding of s-diagonal codes) The Algorithm 2 uses the at most s times of the rooting steps (3.3.2) and the total decoding time is

O(rs). 21 2.4 Graph-based Analysis of Recovery Threshold

In this section, we will present our main technical tool to analyze the recovery threshold of proposed code, which is also the basis in our random code construction of next section.

For each subset U [m] with U = n, let MU be an n n submatrix consisting ⊆ | | × of rows of M index by U. To prove that our s-diagonal code achieves the recovery threshold of n, we need to show that all the n n submatrices MU are full rank. × The basic idea is to reduce the full rank analysis to the analysis of the existence of perfect matching in the corresponding bipartite graph. We first define the following bipartite graph model between the set of m workers indexed by [m] and the set of n data partitions indexed by [n].

D Definition 2.4.1. Let G (V1,V2) be a bipartite graph with V1 = [m] and V2 = [n].

Each node i V1 is connected to nodes j V2 if Mij = 0. ∈ ⊆ 6

m×n D Definition 2.4.2. Define an Edmonds matrix M(x) R(x) of graph G (V1,V2) ∈ with [M(x)]ij = xij if nodes i V1 and j V2 are connected; [M(x)]ij = 0, otherwise. ∈ ∈

Based on this notation, our main result is given by the following theorem.

m×n Theorem 2.4.1. For a coded computation scheme M R , if every subgraph ∈ D D G (U, V2) of G (V1,V2) with U [m] and U = n contain a perfect matching, the ⊆ | | m recovery threshold κ(M) = n with probability at least 1 n2 / S . − n | |   Proof. Based on the above definition, the coding matrix M of the s-diagonal code can be obtained by assigning each intermediate xij of the Edmonds matrix M(x) a value

D from set S independently and uniformly at random. Given a subgraph G (U, V2) of

D U G (V1,V2), let M (x) be the corresponding Edmonds matrix. Then the probability that matrix MU is full rank is equal to the probability that the determinant of the 22 Edmonds matrix MU (x) is nonzero at the given value x. The following technical lemma from [43] provides a simple lower bound of such an event.

Lemma 2.4.1. (Schwartz-Zeppel Lemma) Let f(x1, . . . , xn2 ) be a nonzero polynomial

2 with degree n . Let S be a finite set in R. If we assign each variable a value from S independently and uniformly at random, then

2 P(f(x1, x2, . . . , xN ) = 0) 1 n / S . (2.4.1) 6 ≥ − | |

D A classical result in graph theory is that a bipartite graph G (U, V2) contains a perfect matching if and only if the determinant of the Edmonds matrix, i.e., MU (x) , | | is a nonzero polynomial. Combining this result with Schwartz-Zeppel Lemma, we can reduce the analysis of the full rank probability of the submatrix MU to the probability

D that the subgraph G (U, V2) contains a perfect matching.

U U U U P( M = 0) = P( M = 0 M (x) 0) P( M (x) 0) + | | 6 | | 6 | | 6≡ · | | 6≡ m contains perfect matching S-Z Lemma: ≥1−1/2Cn

U U U P| ( M = 0{zM (x) 0)} P(|M (x{z) 0)} (2.4.2) | | 6 | | ≡ · | | ≡ 0

| {z } Therefore, utilizing the union bound, we conclude that the probability that there m exists a submatrix MU is not full rank is upper bound by n2 / S . n | |   The next technical lemma shows that for all subsets U [m] with U = n, the ⊆ | | D subgraph G (U, V2) exactly contains a perfect matching for diagonal code. Therefore, m let S 2n2 , we conclude that, with probability at least 1/2, all the n n | | ≥ n ×   submatrices of M are full rank. Since we can generate the coding matrix M offline, with a few rounds of trials (2 on average), we can find a coding matrix with all n n × submatrices being full rank. Therefore, we arrive the Theorem 2.3.2.

23 Lemma 2.4.2. Let M be constructed as Definition 2.3.1 and the bipartite graph

D G (V1,V2) be constructed as Definition 2.4.1. For each U [m] with U = n, the ⊆ | | D subgraph G (U, V2) contains a perfect matching.

One special case of the s-diagonal code is that, when we are required to resist only one straggler, all the nonzero elements of matrix M can be equal to 1.

Corollary 2.4.1.1. Given the parameters n and m = n + 1, define the 1-diagonal code: A , i = 1; A , i = n + 1 ˜ 1 n Ai =  . Ai−1 + Ai, 2 i n  ≤ ≤ It achieves the optimum computation load 2n and optimum recovery threshold n.

2.5 Random Code: “Break” the Limits

In this section, we utilize our proposed theoretical framework in Theorem 2.4.1 to construct a random code that achieve the optimal recovery threshold with high probability but with constant computation load.

2.5.1 Probabilistic Recovery Threshold

In practice, the stragglers randomly occur in each worker, and any specific straggling configuration happens with very low probability. We first demonstrate the main idea through the following motivating examples.

Example 3: Consider a distributed linear transform task with n = 20 data partitions, s = 5 stragglers and m = 25 workers. The recovery threshold of 20 implies that all C20 = 53130 square 20 20 submatrices of coding matrix are full rank. 25 × Suppose that a worker being a straggler is identically and independently Bernoulli random variable with probability 10%. Then, the probability that workers 1, 2, 3, 4, 5 { } 24 are all stragglers is 10−5. Now, if there exists a scheme that can guarantee that the master can decode the results in all configurations except the straggling configuration

1, 2, 3, 4, 5 , we can argue that this scheme achieves a recovery threshold 20 with { } probability 1 10−5. − Example 4: Consider the same scenario of Example 1 (n = 4, s = 1, m = 5). ˜ We change the coded computation strategy of the second worker from A2 = A1 + A2 ˜ to A2 = A2. Based on the similar analysis, we can show that the new strategy can recover the final result from 4 workers except the scenario that the first worker is a straggler. Therefore, the new strategy achieves recovery threshold 4 with probability

0.75, and reduces the computation load by 1.

Based on the above two examples, we observe that the computation load can be reduced when we allow the coded computation strategy to fail in some specific scenarios. Formally, it motivates us to define the following metric.

Definition 2.5.1. (Probabilistic recovery threshold) A coded computation strategy M is probabilistic k-recoverable if for each subset I [m] with I = κ(M), the master ⊆ | | ˜ −n node can recover Ax from Aix i I with high probability, i.e., 1 O(2 ). The { | ∈ } − probabilistic recovery threshold κ(M) is defined as the minimum integer k such that strategy M is probabilistic k-recoverable.

The new definition provides a probabilistic relaxation such that a small vanishing percentage (as n ) of all straggiling configurations, are allowed to be unrecoverable. → ∞ In the sequel, we show that, under such a relaxation, one can construct a coded computation scheme that achieves a probabilistic recovery threshold n and a constant

(regarding parameter s) computation load.

25 2.5.2 Construction of the Random Code

Based on our previous analysis of the recovery threshold of the s diagonal code, we − show that, for any subset U [m] with U = n, the probability that MU is full rank ⊆ | | is lower bounded by the probability (multiplied by 1 o(1)) that the corresponding − subgraph G(U, V2) contains a perfect matching. This technical path motivates us to utilize the random proposal graph to construct the coded computation scheme. The

first one is the following p-Bernoulli code, which is constructed from the ER random bipartite graph model [46].

Definition 2.5.2. (p-Bernoulli code) Given parameters m, n, construct the coding matrix M as follows:

tij, with probability p mij =  . (2.5.1) 0, with probability 1 p  − where tij is picked independently and uniformly from the finite set S.

Theorem 2.5.1. For any parameters m, n and s, if p = 2 log(n)/n, the p-Bernoulli code achieves the probabilistic recovery threshold n.

This result implies that each worker of the p-Bernoulli code requires accessing

2 log(n) submatrices on average, which is independent of number of the stragglers.

Note that the existing work in distributed functional computation [20] proposes a random sparse code that also utilizes Bernoulli random variables to construct the coding matrix. There exist two key differences: (i) the elements of our matrix can be integer valued, while the random sparse code adopts real-valued matrix; (ii) the density of the p-Bernoulli code is 2 log(n), while the density of the random sparse code is an unknown constant. The second random code is the following (d1, d2)-cross code.

Definition 2.5.3. ((d1, d2)-cross code) Given parameters m, n, construct the coding matrix M as follows: (1) Each row (column) independently and uniformly chooses d1 26 (d2) nonzero positions; (2) For those nonzero positions, assign the value independently and uniformly from the finite set S.

The computation load of (d1, d2)-cross code is upper bounded by d1m + d2n. The numerical analysis of proposed random codes can be seen in Appendix. The next theorem shows that a constant choice of d1 and d2 can guarantee the probabilistic recovery threshold n.

Theorem 2.5.2. For any parameters m, n, if s = poly(log(n)), the (2, 3)-cross code achieves the probabilistic recovery threshold n. If s = Θ(nα), α < 1, the (2, 2/(1 α))- − cross code achieves the probabilistic recovery threshold n.

The proof of this theorem is based on analyzing the existence of perfect matching in a random bipartite graph constructed as follows: (i) each node in the left partition randomly and uniformly connects to d1 nodes in the opposite class; (ii) each node in the right partition randomly and uniformly connects to l nodes in the opposite class, where l is chosen under a specific degree distribution. This random graph model can be regarded as a generalization of Walkup’s 2-out bipartite graph model [47]. The main technical difficulty in this case derives from the intrinsic complicated statistical model of the node degree of the right partition.

2.5.3 Numerical Results of Random Code

We examine the performance of the proposed p-Bernoulli code and (d1, d2)-cross code in terms of the convergence speed of full rank probability and computation load. In Fig. 2.2 and Fig. 2.2 , we plot the percentage of the full rank n n square × submatrix and the average computation load l(M)/m of each scheme, based on 1000 experimental runs. Each column of the (2, 2.5)-code independently and randomly chooses 2 or 3 nonzero elements with equal probability. It can be observed that the

27 full rank probability of both p-Bernoulli code and (d1, d2)-cross code converges to 1 for relatively small values of n. The (d1, d2)-cross code exhibits even faster convergence and much less computation load compared to the p-Bernoulli code. For example, when n = 20, s = 4, the (2, 2)-cross code achieves the full rank probability of 0.86 and average computation load 3.4. This provides evidence that (d1, d2)-cross code is useful in practice. Moreover, in practice, one can use random codes by running multiple rounds of trails to find a “best” coding matrix with even higher full rank probability and lower computation load.

2.6 Experimental Results

In this section, we present the experimental results on Ohio Supercomputer Cen- ter [40]. We compare our proposed coding schemes including the s diagonal code and − (d1, d2)-cross codes against the following existing schemes in both single matrix vector multiplication and gradient coding problem: (i) uncoded scheme: the input matrix is divided uniformly across all workers without replication and the master waits for all workers to send their results; (ii) sparse MDS code [20]: the generator matrix is a sparse random Bernoulli matrix with average computation overhead Θ(log(n)); (iii) polynomial code [21]: coded matrix multiplication scheme with optimum recovery threshold and nearly linear decoding time; (iv) short dot code [18]: appending the dummy vectors to data matrix A before applying the MDS code, which provides certain sparsity of encoded data matrix with cost of increased recovery threshold; (v)

LT code [38]: rateless code widely used in broadcast communication. It achieves an average computation load of Θ(log(n)) and a nearly linear decoding time using peeling decoder. To simulate straggler effects in large-scale system, we randomly pick s workers that are running a background thread. More details can be seen in

Appendix.

28 1 8

0.95 7 0.9 6 0.85 p=1.6log(n)/n 0.8 p=1.8log(n)/n 5 p=2log(n)/n p=1.6log(n)/n 0.75 4 p=1.8log(n)/n p=2log(n)/n

Full rank probability 0.7 3 0.65 Average computation load

0.6 2 10 20 30 40 50 10 20 30 40 50 Number of data partitions n Number of data partitions n

1 6 (2,2)-cross code 0.95 5.5 (2,2.5)-cross code (2,2)-cross code (2,3)-cross code 0.9 (2,2.5)-cross code 5 (2,3)-cross code 0.85 4.5

0.8 4

0.75 3.5

Full rank probability 0.7 3

0.65 Average computation load 2.5

0.6 2 10 20 30 40 50 10 20 30 40 50 Number of data partitions n Number of data partitions n

Figure 2.2: Statistical convergence speed of full rank probability and average compu- tation load of random code under number of stragglers s = 2.

We first use a matrix with r = t = 1048576 and nnz(A) = 89239674 from data sets [48] , and evenly divide this matrix into n = 12 and 20 partitions. In Figure 2.4

(a)(b), we report the job completion time under s = 2 and s = 4, based on 20 experimental runs. It can be observed that both (2, 2)-cross code outperforms uncoded scheme (in 50% the time), LT code (in 70% the time), sparse MDS code (in 60% the time), polynomial code (in 20% the time) and our s-diagonal code. Moreover, we compare our proposed s-diagonal code with (2, 2)-cross code versus the number of

29 s=3 s=3 1 6 (2,2)-cross code 0.95 5.5 (2,2.5)-cross code (2,3)-cross code (2,2)-cross code 0.9 (2,2.5)-cross code 5 (2,3)-cross code 0.85 4.5

0.8 4

0.75 3.5

Full rank probability 0.7 3

0.65 Average computation load 2.5

0.6 2 10 20 30 40 50 10 20 30 40 50 Number of data partitions n Number of data partitions n s=4 s=4 1 5.5 (2,2)-cross code 0.95 5 (2,2.5)-cross code (2,3)-cross code 0.9 (2,2)-cross code 4.5 0.85 (2,2.5)-cross code (2,3)-cross code 4 0.8 3.5 0.75 3 Full rank probability 0.7 Average computation load 0.65 2.5

0.6 2 10 20 30 40 50 10 20 30 40 50 Number of data partitions n Number of data partitions n Figure 2.3: Statistical convergence speed of full rank probability and average compu- tation load of (d1, d2)-cross code under number of stragglers s = 3, 4.

stragglers s. As shown in Figure 2.4 (a)(b), when the number of stragglers s increases, the job completion time of the s-diagonal code increases while that of (2, 2)-cross code does not change. Another interesting observation is that the irregularity of the work load can decrease the I/O contention. For example, when s = 2, the computation load of the 2-diagonal code is similar as (2, 2)-cross code, which is equal to 36 in the case of n = 12. However, the (2, 2)-cross code cost less time due to the random worker load.

We finally compare our proposed codes with existing schemes in a gradient coding

30 n=12 n=20 30 35 uncoded scheme uncoded scheme LT code LT code 25 sparse MDS 30 sparse MDS s-diagonal code s-diagonal code cross code 25 cross code 20 polynomial code polynomial code 20 15 time (s) time (s) 15 10 10

5 5

0 0 s=2 s=4 s=2 s=4 n=12 n=20 14 12 s-diagonal code s-diagonal code (2,2)-cross code 12 (2,2)-cross code 10

10 8 8 6 time (s) 6 time (s) 4 4

2 2

0 0 s=1 s=2 s=3 s=4 s=5 s=1 s=2 s=3 s=4 s=5

Figure 2.4: Comparison of total time including transmission, computation and decoding time for n = 12, 20 and s = 2, 4.

problems. Details can be seen in the Appendix. We use data from LIBSVM dataset repository with r = 19264097 samples and t = 1163024 features. We evenly divide the data matrix A into n = 12 submatrices. In Fig. 2.4 (c)(d) , we plot the magnitude of scaled gradient ηA|(Ax b) versus the running time of the above seven different k − k schemes under n = 12 and s = 2, 4. Among all experiments, we can see that the

(2, 2)-cross code converges at least 30% faster than sparse MDS code, 2 times faster than both uncoded scheme and LT code and at least 4 times faster than short dot 31 -1 10 10-1 polynomial code polynomial code uncoded uncoded LT short dot short dot LT sparse MDS s-diagonal code s-diagonal code sparse MDS cross code cross code (Ax-b)| (Ax-b)| T T A A | |

-2 10 10-2

0 200 400 600 800 1000 0 200 400 600 800 1000 time (s) time (s) -1 10 10-1 polynomial code polynomial code short dot short dot uncoded uncoded LT LT sparse MDS sparse MDS s-diagonal code s-diagonal code

(Ax-b)| cross code cross code (Ax-b)| T T A A | |

-2 10 10-2

0 200 400 600 800 1000 0 200 400 600 800 1000 time (s) time (s)

Figure 2.5: Magnitude of gradient versus time for number of data partitions n = 12, 20 and number of stragglers s = 2,4.

and polynomial code. The (2, 2)-cross code performs similar with s-diagonal code when s = 2 and converges 30% faster than s-diagonal code when number of stragglers increases to s = 4.

32 CHAPTER 3

CODED DISTRIBUTED MATRIX MULTIPLICATION

In this chapter, we further consider a distributed matrix multiplication problem, where

s×r s×t we aim to compute C = A|B from input matrices A R and B R for some ∈ ∈ integers r, s, t. This problem is the key building block in machine learning and signal processing problems, and has been used in a large variety of application areas including classification, regression, clustering and feature selection problems.

3.1 Introduction

In a general setting with N workers, each input matrix A, B is divided into m, n

| submatrices, respectively. Each worker computes a partial result Ai Bj, and the master node has to collect the results from all workers to output matrix C. The setup of coded matrix multiplication is quite similar with the previously discussed coded linear transform problem. We can regard the coded matrix multiplication problem as a coded linear transformation with mn input blocks. The only difference is that we need to recover a matrix instead of the single vector. Such a property poses a new challenge in this problem. For example, if we adopt the proposed p Bernoulli − code in Section 2.5.2, we can achieve a low computation load O(ln(mn)) and optimal recovery threshold Θ(mn) with high probability. However, the decoding complexity

33 of the simple inverse decoding in this case is O(mn ln(mn)nnz(C)), which will creates a large decoding redundancy, i.e., 100 times of output size nnz(C),mn = 30.

In this chapter, we overcome this challenge by designing a novel coded computation strategy, we call sparse code. It achieves near optimal recovery threshold Θ(mn) by exploiting the coding advantage in local computation. Moreover, such a coding scheme can exploit the sparsity of both input and output matrices, which leads to low computation load, i.e., O(ln(mn)) times of uncoded scheme and, nearly linear decoding time O(nnz(C) ln(mn)).

The basic idea in sparse code is: each worker chooses a random number of input submatrices based on a given degree distribution P ; then computes a weighted linear

| combination wijAi Bj, where the weights wij are randomly drawn from a finite set ij S. When theX master node receives a bunch of finished tasks such that the coefficient matrix formed by weights wij is full rank, it starts to operate a hybrid decoding algorithm between peeling decoding and Gaussian elimination to recover the resultant matrix C.

We prove the optimality of the sparse code by carefully designing the degree distribution P and the algebraic structure of set S. The recovery threshold of the sparse code is mainly determined by how many tasks are required such that the coefficient matrix is full rank and the hybrid decoding algorithm recovers all the results. We design a type of Wave Soliton distribution (definition is given in

Section 3.4), and show that, under such a distribution, when Θ(mn) tasks are finished, the hybrid decoding algorithm will successfully decode all the results with decoding time O(nnz(C) ln(mn)).

Moreover, based on our proposed theoretical framework in Section 2.4, we reduce the full rank analysis of the coefficient matrix to the analysis of the probability that a random bipartite graph contains a perfect matching. Then we combine the

34 Table 3.1: Comparison of Existing Coding Schemes

Scheme Recovery threshold Computation overhead Decoding time MDS Θ(N) Θ(mn) O˜(rt)2 sparse MDS Θ∗(mn)2 Θ(ln(mn)) O˜(mn nnz(C)) · product code Θ∗(mn) Θ(mn) O˜(rt) LDPC code Θ∗(mn) Θ(ln(mn)) O˜(rt) polynomial mn mn O˜(rt) our scheme Θ∗(mn) Θ(ln(mn)) O˜(nnz(C)) 1 Computation overhead is the time of local computation over uncoded scheme. 2 O˜(·) omits the logarithmic terms and O∗(·) refers the high probability result. combinatoric graph theory and the probabilistic method to show that when number of mn tasks are collected, the coefficient matrix is full rank with high probability. We further utilize the above analysis to formulate an optimization problem to determine the optimal degree distribution P when mn is small. We finally implement and benchmark the sparse code at Ohio Supercomputer Center [40], and empirically demonstrate its performance gain compared with the existing strategies.

3.2 Preliminary

s×r We are interested in a matrix multiplication problem with two input matrices A R , ∈ s×t B R for some integers r, s, t. Each input matrix A and B is evenly divided along ∈ the column side into m and n submatrices, respectively.

A = [A1,A2,...,Am] and B = [B1,B2,...,Bn]. (3.2.1)

| Then computing the matrix C is equivalent to computing mn blocks Cij = Ai Bj. Let

| the set W = Cij = A Bj 1 i m, 1 j n denote these components. Given { i | ≤ ≤ ≤ ≤ } this notation, the coded distributed matrix multiplication problem can be described as follows: define N coded computation functions, denoted by

f = (f1, f2, . . . , fN ).

35 ˜ r × t Each local function fi is used by worker i to compute a submatrix Ci R m n = fi(W ) ∈ and return it to the master node. The master node waits only for the results of the ˜ partial workers Ci i I 1,...,N to recover the final output C using certain { | ∈ ⊆ { }} decoding functions. For any integer k, the recovery threshold k(f) of a coded computation strategy f is defined as the minimum integer k such that the master node can recover matrix C from results of the any k workers. The framework is illustrated in Figure 3.1.

data assignment Master node

worker 1 A1 A2 … Am

decode

� � worker 2 �1 �1 … �1 �

B B … B � � 1 2 n … � �1 … � � … … � � ��1 … � � worker N

Figure 3.1: Framework of coded distributed matrix multiplication.

The main result of this paper is the design of a new coded computation scheme, we call the sparse code, that has the following performance.

Theorem 3.2.1. The sparse code achieves a recovery threshold Θ(mn) with high probability, while allowing nearly linear decoding time O(nnz(C) ln(mn)) at the master node.

36 As shown in TABLE 4.1, compared to the state of the art, the sparse code provides order-wise improvement in terms of the recovery threshold, computation overhead and decoding complexity. Specifically, the decoding time of MDS code [19], product code [25], LDPC code and polynomial code [21] is O(rt ln2(mn ln(mn))), which is dependent on the dimension of the output matrix. Instead, the proposed sparse code actually exhibits a decoding complexity that is nearly linear time in number of nonzero elements of the output matrix C, which is extremely less than the product of dimension. Although the decoding complexity of sparse MDS code is linear to nnz(C), it is also dependent on the mn. To the best of our knowledge, this is the first coded distributed matrix multiplication scheme with complexity independent of the dimension.

Regarding the recovery threshold, the existing work [21] has applied a cut-set type argument to show that the minimum recovery threshold of any scheme is

K∗ = min k(f) = mn. (3.2.2) f

The proposed sparse code matches this lower bound with a constant gap and high probability.

3.3 Sparse Codes

In this section, we first demonstrate the main idea of the sparse code through a motivating example. We then formally describe the construction of the general sparse code and its decoding algorithm.

37 3.3.1 Motivating Example

Consider a distributed matrix multiplication task C = A|B using N = 6 workers. Let m = 2 and n = 2 and each input matrix A and B be evenly divided as

A = [A1,A2] and B = [B1,B2].

Then computing the matrix C is equivalent to computing following 4 blocks.

| | A1B1 A1B2 | C = A B =   | | A2B1 A2B2     We design a coded computation strategy via the following procedure: each worker i locally computes a weighted sum of four components in matrix C.

˜ i | i | i | i | Ci = w1A1B1 + w2A1B2 + w3A2B1 + w4A2B2,

i Each weight wj is independently and identically distributed Bernoulli random variable with parameter p. For example, Let p = 1/3, then on average, 2/3 of these weights are equal to 0. We randomly generate the following N = 6 local computation tasks.

˜ | | ˜ | | C1 = A1B1 + A1B2, C2 = A1B2 + A2B1

˜ | ˜ | | C3 = A1B1, C4 = A1B2 + A2B2

˜ | | ˜ | | C5 = A2B1 + A2B2, C6 = A1B1 + A2B1

Suppose that both the 2rd and 6th workers are stragglers and the master node has collected the results from nodes 1, 3, 4, 5 . According to the designed computation { }

38 (a)

(b)

Figure 3.2: Example of the hybrid peeling and Gaussian decoding process of sparse code.

strategy, we have following group of linear systems.

˜ | C1 1 1 0 0 A1B1

 ˜     |  C3 1 0 0 0 A1B2   =      ˜    ·  |  C4 0 1 0 1 A2B1        ˜     |  C5 0 0 1 1 A2B2             One can easily check that the above coefficient matrix is full rank. Therefore, one straightforward way to recover C is to solve rt/4 linear systems, which proves decod- ability. However, the complexity of this decoding algorithm is expensive, i.e., O(rt) in this case.

Interestingly, we can use a type of peeling algorithm to recover the matrix C with only three sparse matrix additions: first, we can straightforwardly recover the

| block A1B1 from worker 3. Then we can use the result of worker 1 to recover block

| ˜ | A B2 = C1 A B1. Further, we can use the results of worker 4 to recover block 1 − 1 39 | ˜ | | ˜ | A B2 = C4 A B2 and use the results of worker 5 to obtain block A B1 = C5 A B2. 2 − 1 2 − 2 Actually, the above peeling decoding algorithm can be viewed as an edge-removal process in a bipartite graph. We construct a bipartite graph with one partition being the original blocks W and the other partition being the finished coded computation ˜ tasks Ci . Two nodes are connected if such computation task contains that block. { } As shown in the Figure B.1(a), in each iteration, we find a ripple (degree one node) in the right that can be used to recover one node of left. We remove the adjacent edges of that left node, which might produce some new ripples in the right. Then we iterate this process until we decode all blocks.

Based on the above graphical illustration, the key point of successful decoding is the existence of the ripple during the edge removal process. Clearly, this is not always true from the design of our coding scheme and the uncertainty in the cloud.

For example, if both the 3rd and 4th workers are stragglers and the master node has collected the results from node 1, 2, 5, 6 , even though the coefficient matrix is full { } rank, there exists no ripple in the graph. To avoid this problem, we can randomly pick one block and recover it through a linear combination of the collected results, then use this block to continue the decoding process. This particular linear combination can be determined by solving a linear system. Suppose that we choose to recover

| A1B2, then we can recover it via the following linear combination.

1 1 1 | ˜ ˜ ˜ A B2 = C1 + C2 C6. 1 2 2 − 2

As illustrated in the Figure B.1(b), we can recover the rest of the blocks using the same peeling decoding process. The above decoding algorithm only involves simple matrix additions and the total decoding time is O(nnz(C)).

40 3.3.2 General Sparse Code

Now we present the construction and decoding of the sparse code in a general setting.

We first evenly divide the input matrices A and B along the column side into m and n submatrices, as defined in (3.2.1). Then we define a set S that contains m2n2 distinct elements except zero element. One simplest example of S is [m2n2] 1, 2, . . . , m2n2 . , { } Under this setting, we define the following class of coded computation strategies.

mn Definition 3.3.1. (Sparse Code) Given the parameter P R and set S, we define ∈ the (P,S) sparse code as: for each worker k [N], compute − ∈

m n ˜ k | Ck = fk(W ) = wijAi Bj. (3.3.1) i=1 j=1 X X

Here the parameter P = [p1, p2, . . . , pmn] is the degree distribution, where the pl is the

k probability that there exists number of l nonzero weights wij in each worker k. The

k value of each nonzero weight wij is picked from set S independently and uniformly at random.

Without loss of generality, suppose that the master node collects results from the

first K workers with K N. Given the above coding scheme, we have ≤

˜ 1 1 1 | C1 w w w A B1 11 12 ··· mn 1  ˜   2 2 2   |  C2 w11 w12 wmn A1B2 = ··· .  .   . . . .   .   .   . . .. .  ·  .   .   . . .   .         ˜   K K K   |  CK  w11 w12 wmn AmBn    ···          K×mn We use M R to represent the above coefficient matrix. To guarantee decodabil- ∈ ity, the master node should collect results from enough number of workers such that the coefficient matrix M is of column full rank. Then the master node goes through a

41 peeling decoding process: it first finds a ripple worker to recover one block. Then for each collected results, it subtracts this block if the computation task contains this block. If there exists no ripple in our peeling decoding process, we go to rooting

| step: randomly pick a particular block Ai Bj. The following lemma shows that we ˜ K can recover this block via a linear combination of the results Ck . { }k=1

Lemma 3.3.1. (rooting step) If rank(M) = mn, for any k0 1, 2, . . . , mn , we can ∈ { } | recover a particular block Ai Bj with column index k0 in matrix M via the following linear combination. K | ˜ Ai Bj = ukCk. (3.3.2) k=1 X T K The vector u = [u1, . . . , uK ] can be determined by solving M u = ek0 , where ek0 R ∈ is a unit vector with unique 1 locating at the index k0.

The basic intuition is to find a linear combination of row vectors of matrix M such

| that the row vectors eliminate all other blocks except the particular block Ai Bj. The whole procedure is listed in Algorithm 2.

Here we conduct some analysis of the complexity of Algorithm 2. During each

˜ ˜ | | iteration, the complexity of operation Ck = Ck Mkk A Bj is O(nnz(A Bj)). Suppose − 0 i i that the number of average nonzero elements in each row of coefficient matrix M is α.

| Then each block Ai Bj will be used O(αK/mn) times in average. Further, suppose that there exists number of c blocks requiring the rooting step (3.3.2) to recover, the ˜ ˜ complexity in each step is O( nnz(Ck)). On average, each coding block Ck is equal k to the sum of O(α) original blocks.X Therefore, the complexity of Algorithm 2 is

αK | ˜ O nnz(Ai Bj) + O c nnz(Ck) mn i,j k  X   X  =O ((c + 1)αK/mn nnz(C)) . (3.3.3) ·

42 Algorithm 2 Sparse code (master node’s protocol) repeat The master node assign the coded computation tasks according to Defini- tion 3.3.1. until the master node collects results with rank(M)= mn and K is larger than a given threshold. repeat Find a row Mk0 in matrix M with Mk0 0 = 1. if such row does not exist then k k | Randomly pick a k0 1, . . . , mn and recover corresponding block Ai Bj by (3.3.2). ∈ { } else | ˜ Recover the block Ai Bj from Ck0 . end if | Suppose that the column index of the recovered block Ai Bj in matrix M is k0. ˜ for each computation results Ck do

if Mkk0 is nonzero then ˜ ˜ | Ck = Ck Mkk0 Ai Bj and set Mkk0 = 0. end if − end for until every block of matrix C is recovered.

We can observe that the decoding time is linear in the density of matrix M, the recovery threshold K and the number of rooting steps (3.3.2). In the next section, we will show that, under a good choice of degree distribution P and set S, we can achieve the result in Theorem 3.2.1.

3.4 Theoretical Analysis

As discussed in the preceding section, to reduce the decoding complexity, it is good to make the coefficient matrix M as sparse as possible. However, the lower density will require that the master node collects a larger number of workers to enable the full rank of matrix M. For example, in the extreme case, if we randomly assign one nonzero element in each row of M. The analysis of the classical balls and bins

43 process implies that, when K = O(mn ln(mn)), the matrix M is full rank, which is far from the optimal recovery threshold. On the other hand, the polynomial code [21] achieves the optimal recovery threshold. Nonetheless, it exhibits the densest matrix

M, i.e., K mn nonzero elements, which significantly increases the local computation, × communication and final decoding time.

In this section, we will design the sparse code between these two extremes. This code has near optimal recovery threshold K = Θ(mn) and constant number of rooting steps (3.3.2) with high probability and extremely sparse matrix M with

α = Θ(ln(mn)) nonzero elements in each row. The main idea is to choose the following degree distribution.

Definition 3.4.1. (Wave Soliton distribution) The Wave Soliton distribution Pw =

[p1, p2, . . . , pmn] is defined as follows.

τ τ , k = 1; , k = 2 mn 70 p = . (3.4.1) k  τ  , 3 k mn k(k 1) ≤ ≤ −   The parameter τ = 35/18 is the normalizing factor.

The above degree distribution is modified from the Soliton distribution [38]. In particular, we cap the original Soliton distribution at the maximum degree mn, and remove a constant weight from degree 2 to other larger degrees. It can be observed that the recovery threshold K of the proposed sparse code depends on two factors: (i) the full rank of coefficient matrix M; (ii) the successful decoding of peeling algorithm with constant number of rooting steps. Based on the proposed framework in Section 2.4, we can reduce the analysis of the full rank probability to the probability of existence of perfect matching of underlying bipartite graph. Formally, we have the following result.

44 Theorem 3.4.1. (Existence of perfect matching) Let G(V1,V2,Pw) be a random blanced bipartite graph, in which V1 = V2 = mn. Each node v V2 independently | | | | ∈ and randomly connects to l nodes in partition V1 according to Wave Soliton distribution (3.4.1), then there exists a constant c > 0 such that

−0.94 P(G contains a perfect matching) > 1 c(mn) . −

Proof. Here we sketch the proof. Details can be seen in the supplementary material.

The basic technique is to utilize Hall’s theorem to show that such a probability is lower bounded by the probability that G does not contain a structure S V1 or ⊂ S V2 such that S > N(S) , where N(S) is the neighboring set of S. We show ⊂ | | | | that, if S V1, the probability that S exists is upper bounded by ⊂

s=o(mn) 1 s 0.94s + + cmn. (mn)0.94s mn 1 s=Θ(1)X s=Ω(1)X   s=Θ(Xmn)

If S V2, the probability that S exists is upper bounded by ⊂

1 scs + 2 + cmn. mn mn 3 s=Θ(1)X s=Xo(mn) s=Θ(Xmn) where the constants c1, c2, c3 are strictly less than 1. Combining these results together, gives us Theorem 3.4.1.

Analyzing the existence of perfect matching in a random bipartite graph has been developed since [46]. However, existing analysis is limited to the independent generation model. For example, the Erdos-Renyi model assumes each edge exists independently with probability p. The κ out model [47] assumes each vertex v − ∈ V1 independently and randomly chooses κ neighbors in V2. The minimum degree model [49] assumes each vertex has a minimum degree and edges are uniformly 45 distributed among all allowable classes. There exists no work in analyzing such a probability in a random bipartite graph generated by a given degree distribution. In this case, one technical difficulty is that each node in the partition V1 is dependent. All analysis should be carried from the nodes of the right partition, which exhibits an intrinsic complicated statistical model.

We now focus on quantitatively analyzing the impact of the recovery threshold K on the peeling decoding process and the number of rooting steps (3.3.2). Intuitively, a larger K implies a larger number of ripples, which leads to higher successful peeling decoding probability and therefore less number of rooting steps. The key question is: how large must K be such that all mn blocks are recovered with only constant number of rooting steps. To answer this question, we first define a distribution generation function of Pw as τ τ mn xk Ω (x) = x + x2 + τ . (3.4.2) w mn 70 k(k 1) k=3 X − The following technical lemma is useful in our analysis.

Lemma 3.4.1. If the degree distribution Ωw(x) and recovery threshold K satisfy

[1 Ω0 (1 x)/mn]K−1 x, for x [b/mn, 1], (3.4.3) − w − ≤ ∈ then the peeling decoding process in Algorithm 2 can recover mn b blocks with − probability at least 1 e−cmn, where b, c are constants. − Lemma 3.4.1 is tailored from applying a martingale argument to the peeling decoding process [50]. This result provides a quantitative recovery condition on the degree generation function. It remains to be shown that the proposed Wave Soliton distribution (3.4.2) satisfies the above inequality with a specific choice of K.

2 2 Theorem 3.4.2. (Recovery threshold) Given the sparse code with parameter(Pw, [m n ]), if K = Θ(mn), then there exists a constant c such that with probability at least 1 e−cmn, − 46 Algorithm 2 is sufficient to recover all mn blocks with Θ(1) blocks recovering from rooting step (3.3.2).

Combining the results of Theorem 3.4.1 and Theorem 3.4.2, we conclude that the re-

2 2 covery threshold of sparse code (Pw, [m n ]) is Θ(mn) with high probability. Moreover, since the average degree of Wave Soliton Distribution is O(ln(mn)), combining these results with (3.3.3), the complexity of Algorithm 2 is therefore O(nnz(C) ln(mn)).

Remark 3.4.2.1. Although the recovery threshold of the proposed scheme exhibits a constant gap to the information theoretical lower bound, the practical performance is very close to such a bound. This mainly comes from the pessimistic estimation in Theorem 3.4.2. As illustrated in Figure 3.3, we generate the sparse code under

Robust Soliton distribution, and plot the average recovery threshold versus the number of blocks mn. It can be observed that the overhead of proposed sparse code is less than

15%.

Remark 3.4.2.2. Existing codes such as Tornado code [51] and LT code [38] also utilize the peeling decoding algorithm and can provide a recovery threshold Θ(mn).

However, they exhibit a large constant, especially when mn is less than 103. Figure 3.3 compares the practical recovery threshold among these codes. We can see that our proposed sparse code results in a much lower recovery threshold. Moreover, the intrinsic cascading structure of these codes will also destroy the sparsity of input matrices.

The proposed Wave Soliton distribution (3.4.1) is asymptotically optimal, however, it is far from optimal in practice when m, n is small. The analysis of the full rank probability and the decoding process relies on an asymptotic argument to enable upper bounds of error probability. Such bounds are far from tight when m, n are small.

In this subsection, we focus on determining the optimal degree distribution based on

47 80

70 sparse code LT code 60 lower bound average degree 50

40

30

recovery threshold 20

10

0 10 20 30 40 50 number of blocks (mn) Figure 3.3: Recovery threshold versus the number of blocks mn. our analysis in Section 3.4. Formally, we can formulate the following optimization problem.

mn

min kpk (3.4.4) k=1 X s.t. P(M is full rank) > pc, 0 mn+c Ωw(x) 1 x 1 1 x c0 − , − mn ≤ − − mn   r x [0, 1 b/mn] , [pk] ∆mn, ∈ − ∈

The objective is to minimize the average degree, namely, to minimize the computation and communication overhead at each worker. The first constraint represents that the probability of full rank is at least pc. Since it is difficult to obtain the exact form of such a probability, we can use the analysis in Section 3.4 to replace this condition by requiring the probability that the balanced bipartite graph G(V1,V2,P ) contains a perfect matching is larger than a given threshold, which can be calculated exactly.

The second inequality represents the decodability condition that when K = mn + c + 1 results are received, mn b blocks are recovered through the peeling decoding process −

48 and b blocks are recovered from the rooting step (3.3.2). This condition is modified from (3.4.3) by adding an additional term, which is useful in increasing the expected ripple size [52]. By discretizing the interval [0, 1 b/mn] and requiring the above − inequality to hold on the discretization points, we obtain a set of linear inequalities constraints. Details regarding the exact form of the above optimization model and solutions are provided in the supplementary material.

3.5 Experimental Results

In this section, we present experimental results at Ohio Supercomputer Center [40].

We compare our proposed coding scheme against the following schemes: (i) uncoded scheme: the input matrices are divided uniformly across all workers and the master waits for all workers to send their results; (ii) sparse MDS code [20]: the generator matrix is a sparse random Bernoulli matrix with average computation overhead

Θ(ln(mn)), recovery threshold of Θ(mn) and decoding complexity O˜(mn nnz(C)). · (iii) product code [25]: two-layer MDS code that can achieves the probabilistic recovery threshold of Θ(mn) and decoding complexity O˜(rt). We use the above sparse

MDS code to ensemble the product code to reduce the computation overhead. (iv) polynomial code [21]: coded matrix multiplication scheme with optimum recovery threshold; (v) LT code [38]: rateless code widely used in broadcast communication.

It has low decoding complexity due to the peeling decoding algorithm. To simulate straggler effects in large-scale system, we randomly pick number of s workers that are running a background thread which increases the computation time.

We implement all methods in the python using MPI4py. To simplify the simulation,

N×mn we fix the number of workers N and randomly generate a coefficient matrix M R ∈ under given degree distribution offline such that it can resist one straggler. Then, each worker loads a certain number of partitions of input matrices according to

49 mn = 9 mn = 16 12 20 uncoded scheme uncoded scheme LT code LT code 10 sparse MDS sparse MDS product code product code sparse code 15 sparse code 8 polynomial code polynomial code

6 10 time (s) time (s)

4 5 2

0 0 s=2 s=3 s=2 s=3

Figure 3.4: Job completion time of two 1.5E5 1.5E5 matrices with 6E5 nonzero elements. × the coefficient matrix M. In the computation stage, each worker computes the product of their assigned submatrices and returns the results using Isend(). Then the master node actively listens to the responses from each worker via Irecv(), and uses Waitany() to keep polling for the earliest finished tasks. Upon receiving enough results, the master stops listening and starts decoding the results.

We first generate two random Bernoulli sparse matrices with r = s = t = 150000 and 600000 nonzero elements. Figure. 4.5 reports the job completion time under m = n = 3, m = n = 3, m = n = 4 and number of stragglers s = 2, 3, based on 20 experimental runs. It can be observed that our proposed sparse code requires the minimum time, and outperforms LT code (in 20-30% the time), sparse MDS code and product code (in 30-50% the time) and polynomial code (in 15-20% the time).

The uncoded scheme is faster than the polynomial code. The main reason is that, due to the increased number of nonzero elements of coded matrices, the per-worker computation time for these codes is increased. Moreover, the data transmission time is also greatly increased, which leads to additional I/O contention at the master node.

50 mn = 9, s =2 mn = 9, s =3 5 6 LT code sparse MDS code LT code 4 product code 5 sparse MDS code sparse code product code polynomial code sparse code 4 polynomial code 3

3 time (s) 2 time (s) 2

1 1

0 0 T1 computation T2 decode T1 computation T2 decode mn = 16, s =2 mn = 16, s =3 10 10 LT code LT code sparse MDS code sparse MDS code product code 8 8 product code sparse code sparse code polynomial code polynomial code 6 6 time (s) time (s) 4 4

2 2

0 0 T1 computation T2 decode T1 computation T2 decode Figure 3.5: Simulation results for two 1.5E5 1.5E5 matrices with 6E5 nonzero elements. T1 and T2 are the transmission times× from master to worker, and the worker to master, respectively.

We further compare our proposed sparse code with the existing schemes from the point of view of the time required to communicate inputs to each worker, compute the matrix multiplication in parallel, fetch the required outputs, and decode. Results can be seen in the Figure. 3.5.

We finally compare our scheme with these schemes for other type of matrices and larger matrices. The data statistics can be seen in the supplementary material. The

51 Table 3.2: Timing Results for Different Sparse Matrix Multiplications (in sec)

LT sparse MDS product polynomial sparse Data uncoded code code code code code square 6.81 3.91 6.42 6.11 18.44 2.17 tall 7.03 2.69 6.25 5.50 18.51 2.04 fat 6.49 1.36 3.89 3.08 9.02 1.22 amazon-08/ 15.35 17.59 46.26 38.61 161.6 11.01 web-google cont1/ 7.05 5.54 9.32 14.66 61.47 3.23 cont11 cit-patents/ 22.10 29.15 69.86 56.59 1592 21.07 patents hugetrace-00/ 18.06 21.76 51.15 37.36 951.3 14.16 hugetrace-00

first three data sets square, tall and fat are randomly generated square, fat and tall matrices. We also consider 8 sparse matrices from real data sets [48]. We evenly divide each input matrices into m = n = 4 submatrices and number of stragglers is equal to 2. We match the column dimension of A and row dimension of B using the smaller one. The timing results are results averaged over 20 experimental runs.

Among all experiments, we can observe in Table 3.2 that our proposed sparse code speeds up 1 3 of uncoded scheme and outperforms the existing codes, with the − × effects being more pronounced for the real data sets. The job completion of the LT code, random sparse, product code is smaller than uncoded scheme in square, tall and fat matrix and larger than uncoded scheme in those real data sets.

52 CHAPTER 4

CODED DISTRIBUTED OPTIMIZATION

In this chapter, we further consider the distributed optimization problem: the master node aims to compute the sum of n functions,

n

f(x) = fi(x) (4.0.1) i=1 X

d w in a distributed way, where fi : R R and is assigned to each local worker. → This problem is actually the key component in most distribute machine learning training algorithms. For example, in the data parallelism of distributed neural network training [6], the data is partitioned into n parts. Each local worker train a neural network on local data sets and return the trained model fi(x) to the parameter sever, the parameter sever updates the model by averaging received the results together. In general distributed gradient descent, each local worker computes a partial gradient fi(x) via ith local data set and classifier x, and return the result to the master. The master node updates the classifier based on summing all partial gradients.

4.1 Introduction

The coded computation schemes in this problem is named as gradient coding techniques, and have been proposed to provide an effective way to deal with straggler for distributed learning applications [1]. The system being considered has n workers, in which the

53 training data is partitioned into n parts. Each worker stores multiple parts of datasets, computes a partial gradient over each of its assigned partitions, and returns the linear combination of these partial gradients to the master node. By creating and exploiting coding redundancy in local computation, the master node can reconstruct the full gradient even if part of results are collected, and therefore alleviate the impact of straggling workers.

The key performance metric used in gradient coding scheme is the computation load d(s), which refers to the number of data partitions that are sent to each node, and characterizes the amount of redundant computations to resist s stragglers. Given the number of workers n and number of stragglers s, the work of [1] establishes a fundamental bound d(s) s + 1, and constructs a random code that exactly matches ≥ this lower bound. Two subsequent works [18,29] provide a deterministic construction of the gradient coding scheme. These results imply that, to resist one or two stragglers, the best gradient coding scheme will double or even triple the computation load in each worker, which leads to a large transmission and processing overhead for data-intensive applications.

In practical distributed learning applications, we only need to approximately reconstruct the gradients. For example, the gradient descent algorithm is internally robust to the noise of gradient evaluation, and the algorithm still converges when the error of each step is bounded [53]. In other scenarios, adding the noise to the gradient evaluation may even improve the generalization performance of the trained model [54]. These facts motivate the idea of approximate gradient coding technique. More specifically, suppose that s of the n workers are stragglers, the approximate gradient coding allows the master node to reconstruct the full gradient with a multiplicative error  from n s received results. The computation load in this − case is a function d(s, ) of both number of stragglers s and error . By introducing

54 Table 4.1: Comparison of Existing Gradient Coding Schemes

Computation Error of Scheme load gradient cyclic MDS s + 1 0 [1] expander graph code ns O  [29] (n s) 1  −  BGC n O(log(n)) O [28] (n s) log(n)   log(n) − FRC1 O 0 log(n/s)  log(1/)  BRC1 O  log(n/s)   1 result holds with high probability, i.e., 1-o(1). the error term, one may expect to further reduce the computation load. Given this formulation, we are interested in the following key questions:

What is the minimum computation load for the approximate gradient coding problem? Can we find an optimal scheme that achieves this lower bound?

There have been two computing schemes proposed earlier for this problem. The

first one, introduced in [29], utilizes the expander graph, particularly Ramanujan graphs to provide an approximate construction that achieves a computation load

O(ns/(n s)) given error . However, expander graphs, especially Ramanujan graphs, − are expensive to compute in practice, especially for large number of workers. Hence, an alternative computing scheme was recently proposed in [28], referred to as Bernoulli

Gradient Code (BGC). This coding scheme incurs a computation load of O(log(n)) and an error of O(n/(n s) log(n)) with high probability. − In this chapter, we show that, the optimum computation load can be far less than what the above two schemes achieve. More specifically, we first show that, if we need to exactly ( = 0) recover the full gradients with high probability, the minimum

55 computation load satisfies

log(n) d(s, 0) O . (4.1.1) ≥ log(n/s)  

We also design a coding scheme, referred to as d-fractional repetition code (FRC) that achieves the optimum computation load. This result implies that, if we allow the decoding process to fail with a vanishing probability, the computation load in each worker can be significantly reduced from s + 1 to O(log(n)/ log(n/s)). For example, when n = 100 and s = 10, each worker in the original gradient coding strategy requires storing 11 data partitions, while requires only 2 data partitions in approximate × × scheme.

Furthermore, we identify the following three-fold fundamental tradeoff among the computation load d(s, ), recovery error  and number of stragglers s in order to approximately recover the full gradients with high probability. The tradeoff reads

log(1/) d(s, ) O . ≥ log(n/s)  

This result provides a quantitative characterization that the noise of gradient plays a logarithmic reduction role, i.e., from O(log(n)) to the O(log(n)) O(log(n)) in the − desired computation load. For example, when the error of gradient is O(1/ log(n)), the existing BGC scheme in [28] provides a computation load of O(log(n)), instead, the information-theoretical lower bound is O(log(log(n))). We further give an explicit code construction, referred to as batch raptor code (BRC), based on random edge removal process that achieves this fundamental tradeoff. The comparison of our proposed schemes and existing gradient coding schemes are listed in TABLE 4.1.

We finally implement and benchmark the proposed gradient coding schemes at

Ohio supercomputer center [40] and empirically demonstrate its performance gain

56 compared with existing strategies. Due to the space limit, all the proofs and extensions are given in the appendix.

4.2 Preliminaries

4.2.1 Problem Formulation

N p The data set is denoted by D = (xi, yi) i=1 with input feature xi R and label { } ∈ yi R. Most machine learning tasks aim to solve the following optimization problem: ∈

N ∗ β = arg min L(xi, yi; β) + λR(β), (4.2.1) p β∈R i=1 X where L( ) is a task-specific loss function, and R( ) is a regularization function. · · This problem is usually solved by gradient-based approaches. More specifically, the

(t+1) (t) (t) parameters β are updated according to the iteration β = hR(β , g ), where hR( ) · is the proximal mapping of gradient-based iteration, and g(t) is the gradient of the loss function at the current parameter β(t), defined as

N (t) (t) g = L(xi, yi; β ). (4.2.2) i=1 ∇ X

In practice, the number of data samples N is quite large, i.e., N 109, the evaluation ≥ of the gradient g(t) will become a bottleneck of the above optimization process and should be distributed over multiple workers. Suppose that there are n workers

W1,W2,...,Wn, and the original dataset is partitioned into n subsets of equal size

D1,D2,...,Dn . In the traditional distributed gradient descent, each worker i stores { } the dataset Di. During iteration t, the master node first broadcasts the current classifier

(t) (t) β to each worker. Then each worker i computes a partial gradient gi over data block Di, and returns it to the master node. The master node collects all the partial

57 n (t) (t) gradients to obtain a gradient evaluation g = gi and updates the classifier i=1 correspondingly. In the gradient coding framework,X as illustrated in Figure. 4.1, each worker i stores multiple data blocks and computes a linear combination of partial gradients, then the master node receives a subset of results and decodes the full gradient g(t).

Data assignment Gradient coding W1

D1 Iterations (t) D2 … (t) W2 g˜1 D (t) n master (t) node g˜2

… (t) g˜n decode Wn (t) (t+1) = (t) ⌘ g(t) t i i P Figure 4.1: Gradient coding framework.

More formally, the gradient coding framework can be represented by a coding n n×n 1 n×p matrix A R , where the ith worker computes g˜i = Aijgj. Let g R ∈ j=1 ∈ n×p (˜g R ) be a matrix with each row being the (coded) partialX gradient ∈

g˜ = [˜g1;g ˜2; ... ;g ˜n] and g = [g1; g2; ... ; gn].

Therefore, we can represent g˜ = Ag. Suppose that there exist s stragglers and the master node receives n s results indexed by set S. Then the received partial −

1Here we omit iteration count t.

58 (n−s)×n gradients can be represented by g˜S = ASg, where AS R is a row submatrix ∈ of A containing rows indexed by S. During the decoding process, the master node solves the following problem,

∗ T 2 u = arg min n−s A u 1n , (4.2.3) u∈R k S − k

∗ n and recovers the full gradient by u g˜, where 1n R denotes the all one vector. ∈

(n−s)×n Definition 4.2.1. (Recovery error) Given a submatrix AS R , the correspond- ∈ ing recovery error is defined as

T 2 err(AS) = min AS u 1n , (4.2.4) u∈Rn−s k − k

Instead of directly measuring the error of recovered gradient, i.e., min uASg 1ng , u k − k this metric quantifies how close 1n is to being in the span of the columns of AS. It is also worth noting that the overall recovery error is small relative to the magnitude of the gradient, since the minimum decoding error satisfies min uASg 1ng u k − k ≤ g min uAS 1n . k k · u k − k

Definition 4.2.2. (Computation load) The computation load of a gradient coding scheme A is defined as κ(A) = max Ai 0, where Ai 0 is the number of nonzero 1≤i≤n k k k k coefficients of the ith row Ai.

The existing work [1] shows that the minimum computation load is at least s + 1 when we require decoding the full gradient exactly, i.e., err(AS) = 0, among all S [n], S = n s. The approximate gradient coding relaxes the worst-case scenario ⊆ | | − to a more realistic setting, the “average and approximate” scenario. Formally, we have the following systematic definition of the approximate gradient codes.

59 Definition 4.2.3. (-approximate gradient code) Given number of s stragglers in n workers, the set of -approximate gradient code is defined as

n×n  = A R P[err(AS) > n] = o(1) , (4.2.5) G { ∈ | }

(n−s)×n where AS R is a randomly chosen row submatrix of A. ∈

The above definition of gradient code is general and includes most existing works on approximate gradient coding. For example, let δ = s/n, the existing scheme based on Ramanujan graphs is a -approximate gradient code that achieves computation load of O(s/(1 δ)); the existing BGC [28] can be regarded as a O(1/(1 δ) log(n))- − − approximate gradient code that achieves computation load of O(log(n)/(1 δ)). −

4.2.2 Main Results

Our first main result provides the minimum computation load and corresponding optimal code when we want to exactly decode the full gradient with high probability.

Theorem 4.2.1. Suppose that out of n workers, s = δn are stragglers. The minimum computation load of any gradient codes in 0 satisfies G

∗ log(n) κ0(A) , min κ(A) O max 1, . (4.2.6) A∈G0 ≥ log(1/δ)   

FRC And we construct a gradient code, we call d fractional repetition code A 0, − ∈ G such that

FRC ∗ lim κ(A )/κ0(A) = 1. (4.2.7) n→∞

The following main result provides a more general outer bound when we allow the recovered gradient to contain some error.

60 Theorem 4.2.2. Suppose that out of n workers, s = δn are stragglers. If 0 <  <

2 O(1/ log (n)), the minimum computation load of any gradient codes in  satisfies G

∗ log(1/) κ (A) , min κ(A) O max 1, . A∈G ≥ log(1/δ)   

BRC And we construct a gradient code, named as batch raptor code A c, such that ∈ G

BRC ∗ lim κ(A )/κ (A) = 1. (4.2.8) n→∞

Theorem 4.2.2 provides a fundamental tradeoff among the gradient noise, the straggler tolerance and the computation load. And the gradient noise  provides a factor of logarithmic reduction, i.e., log(n) of the computation load. − (n−s)×n Notation: Suppose that AS R is a row submatrix of A containing (n s) ∈ − randomly and uniformly chosen rows. Ai (or AS,i) denotes ith column of matrix A (or

AS) and ai (or aS,i) denotes ith row of matrix A (or AS). The supp(x) is defined as the support set of vector x. x 0 represents the number of nonzero elements in vector k k x.

4.3 0-Approximate Gradient Code

In this section, we consider a simplified scenario that the error of gradient evaluation is zero. It can be regarded as a probabilistic relaxation of the worst-case scenario in [1]. We first characterize the fundamental limits of the any gradient codes in the set.

n×n 0 = A R P[err(AS) > 0] = o(1) . G { ∈ | }

Then we design a gradient code to achieve the lower bound.

61 4.3.1 Minimum Computation Load

The minimum computation load can be determined by exhaustively searching over all

n2 possible coding matrices A 0. However, there exist Ω(2 ) possible candidates in ∈ G 0 and such a procedure is practically intractable. To overcome this challenge, we G construct a new theoretical path: (i) we first analyze the structure of the optimal gradient codes, and establish a lower bound of the minimum failure probability

T 2 P(minu∈ n−s AS u 1n > 0) given computation load d; (ii) we derive an exact R k − k estimation of such lower bound, which is a monotonically non-increasing function of d; and (iii) we show that this lower bound is non-vanishing when the computation d is less than a specific quantity, which provides the desired lower bound.

The following lemma shows that the minimum probability of decoding failure is lower bounded by the minimum probability that there exists an all-zero column of matrix AS in a specific set of matrices.

Lemma 4.3.1. Suppose that the computation load κ(A) = d and define the set of

d n×n matrices n = A R κ(A) = d , we have A { ∈ | }

n min P(err(AS) > 0) min P AS,i 0 = 0 , (4.3.1) A∈Ad A∈U d n ≥ n i=1 k k ! [

d n×n where set of matrices n A R ai 0 = Ai 0 = d, i [n] . U , { ∈ |k k k k ∀ ∈ }

Based on the inclusion-exclusion principle, we observe that the above lower bound

n is dependent on the set system supp(Ai) formed by matrix A. Therefore, one can { }i=1 directly transform the above minimization problem into an integer program. However, due to the non-convexity of the objective function, it is difficult to obtain a closed form expression. To reduce the complexity of our analysis, we have the following lemma to characterize a common structure among all matrices in set d. Un

62 d Lemma 4.3.2. For any matrix A , there exists a set d [n] such that d ∈ Un I ⊆ |I | ≥ n/d2 and b c supp(Ai) supp(Aj) = , i = j, i, j d. (4.3.2) ∩ ∅ ∀ 6 ∈ I

Based on the results of Lemma 4.3.1 and Lemma 4.3.2, we can get an estimation of the lower bound (4.3.1), and obtain the following theorem.

Theorem 4.3.1. Suppose that out of n workers, s = δn are stragglers. If s = Ω(1), the minimum computation load satisfies

2 2 ∗ log(n log (1/δ)/ log (n)) d (s, 0) , min κ(A) ; (4.3.3) A∈G0 ≥ log(1/δ) otherwise, d∗(s, 0) = 1.

Based on Theorem 4.4.1, we can observe the power of probabilistic relaxation in reducing the computation load. For example, if the number of stragglers s is proportional to the number of workers n, the minimum computation load d∗(s, 0) =

O(log(n)), while the worst-case lower bound is Θ(n); if s = θ(nλ), where 0 < λ < 1 is a constant, the minimum computation load d∗(s, 0) = 1/(1 λ) is a constant, while − the worst-case one is Θ(nλ). The Figure 4.2 provides a quantitative comparison of the proposed  approximate gradient coding and existing ones. −

4.3.2 d-Fractional Repetition Code

In this subsection, we provide a construction of coding matrix A that asymptotically achieves the minimum computation load. The main idea is based on a generalization of the existing fractional repetition code [1].

Definition 4.3.1. (d-Fractional Repetition Code) Divide n workers into d groups of size n/d. In each group, divide all data equally and disjointly, and assign d partitions

63 to each worker. All the groups are replicas of each other. The coding matrix AFRC is defined as

Ab 11×d 01×d 01×d ···     Ab 01×d 11×d 01×d AFRC = ,A = ··· .  .  b  . . . .   .   . . .. .   .   . . .          Ab 01×d 01×d 11×d     n ×n    ···  d     Note that we do not need the assumption that n is a multiple d. In this case, we can construct the FRC as following: let the size of each group equal to n/d . Then b c randomly choose mod(n, d) groups and increase the size of each by one. Besides, the decoding algorithm for the FRC is straightforward: instead of solving the problem

(4.2.3), the master node sums the partial gradients of any n/d workers that contain disjoint data partitions. The following technical lemma proposed in [55] is useful in our theoretical analysis.

Lemma 4.3.3. (Approximate inclusion-exclusion principle) Let n be integers and k Ω(√n), and let E1,E2,...,En be collections of sets, then we have ≥

|I|≤k − √2k |I| P Ai = 1 + e n ( 1) P Ai .   − · i∈I ! i[∈[n]   IX⊆[n] \   The above lemma shows that one can approximately estimate the probability of

0.5 event E1 En given the probability of events Ei, I < Ω(k ). ∩ · · · ∩ i∈I | | \ Theorem 4.3.2. Suppose that there exist s = δn stragglers in n workers. If d satisfies

log(n log(1/δ)) d = max 1, , (4.3.4) log(1/δ)  

FRC then we have P(err(AS ) > 0) = o(1).

Combining the results of Theorem 4.3.1 and Theorem 4.3.2, we can obtain the 64 102

exact gradient coding 0-approximate gradient coding 0.01-approximate gradient coding 0.1-approximate gradient coding 101

0

minimum computation load 10

0 50 100 150 200 number of stragglers

Figure 4.2: Information-theoretical lower bound of existing worst-case gradient cod- ing [1] and proposed -approximate gradient coding when n = 1000.

main argument of Theorem 4.2.1. In practical implementation of FRC, once the decoding process fails in kth iteration, a straightforward method is to restart kth iteration. Due to the fact that the decoding failure is less happen during the iteration, such overhead will be amortized. As can be seen in the experimental section, during

100 iterations, only one or two iterations are decoding failure.

4.4 -Approximate Gradient Code

In this section, we consider a more general scenario that the error of gradient evaluation

 is larger than zero. We first provide a fundamental three-fold trade-off among the computation load, error of gradient and the number of stragglers of any codes in the set.

n×n  = A R P[err(AS) > n] = o(1) . (4.4.1) G { ∈ | }

Then we construct a random code that achieves this lower bound.

65 4.4.1 Fundamental Three-fold Tradeoff

Based on the proposed theoretical path in Section 4.3.1, we can lower bound the probability that the decoding error is larger than n by the one that there exist larger than n all-zero columns of matrix AS. However, such a lower bound does not admit a close-form expression since the probability event is complicated and contains exponential many partitions. To overcome this challenge, we decompose the above probability event into n dependent events, and analyze its the second-order moment. Then, we use Bienayme-Chebyshev inequality to further lower bound the above probability. The following theorem provides the lower bound of computation load among the feasible gradient codes . G

Theorem 4.4.1. Suppose that out of n workers, s = δn are stragglers, and  <

O(1/ log2(n)), then the minimum computation load satisfies

2 2 ∗ log(n log (1/δ)/(2n + 4) log (n)) κ (A) , min κ(A) . A∈G ≥ log(1/δ)

Note that the above result also holds for  = 0, which is slightly lower than the bound in Theorem 4.3.1. Based on the result of Theorem 4.4.1, we can see that the gradient noise  provides a logarithmic reduction of the computation load. For example, when n = 1000 and s = 100, the 0-approximate gradient coding requires each worker storing 3 data partitions, while even a 0.01-approximate gradient coding × only requires 2 data partitions. Detailed comparison can be seen in Figure 4.2. ×

4.4.2 Random Code Design

Now we present the construction of our random code, we name batch rapter code

(BRC), which achieves the above lower bound with high probability. The construction

n of the BRC consists of two layers. In the first layer, the original data set Di are { }i=1 66 n/b partitioned into n/b batches Bi with the size of each batch equal to b. The data { }i=1 ib in each batch is selected by Bi = Dj , and therefore the intermediate coded { }j=1+(i−1)b ib b partial gradients can be represented by gi = gj, i [n/b]. In the second j=1+(i−1)b ∀ ∈ step, we construct a type of raptor code takingX the coded partial gradients gb as input block.

n/b Definition 4.4.1. ((b, P )-batch rapter code) Given the degree distribution P R ∈ n/b and batches Bi , we define the (b, P )-batch rapter code as: each worker k [n], { }i=1 ∈ stores the data Bi i∈I and computes { }

ib b g˜k = gi = gj, (4.4.2) i∈I i∈I X X j=1+(Xi−1)b where I is a randomly and uniformly subset of [n/b] with I = d, and d is generated | | according to distribution P . The coding matrix ABRC is therefore given by

random d nonzero blocks

11×b 01×b 11×b 01×b ···  z }| {  01×b 01×b 11×b 01×b ABRC ··· . ,  . . . . .   ......   . . . .      01×b 11×b 01×b 11×b  ···    Note that when n is not a multiple d, we can tackle it using a method similar to the one used in FRC. The decoding algorithm for the BRC goes through a peeling decoding

b process: it first finds a ripple worker (with only one batch) to recover one batch gi and add it to the sum of gradient g. Then for each collected results, it subtracts this batch if the computed gradients contains this batch. The whole procedure is listed in

Algorithm 3.

Example 1. (Batch raptor code, n = 6, s = 2) Consider a distributed gradient descent problem with 6 workers. In the batch raptor code, the data is first partitioned

67 b (a) (b) g1 b g1 g˜1 g˜1 g1 g˜1 g˜1 b g2 b g2 b b g˜2 g2 g2 g˜2 g2 b g3 g 3 b gb g4 g˜3 g˜3 g˜3 g3 3 g˜3 g˜3 b g5 g 4 b gb b g6 g˜4 g4 g˜4 g˜4 4 g˜5 g4 g˜5 b (a) (b) g1 b g1 g˜1 g˜1 g1 g˜1 g˜1 b g2 b g2 b b g˜2 g2 g2 g˜2 g2 b g3 g 3 b gb g4 g˜3 g˜3 g˜3 g3 3 g˜3 g˜3 b g5 g 4 b gb b g6 g˜4 g4 g˜4 g˜4 4 g˜5 g4 g˜5

Figure 4.3: Example of the batch raptor code and peeling decoding algorithm.

into 4 batches with B1 = D1 , B2 = D2 , B3 = D3,D4 , B4 = D5,D6 . After { } { } { } { } random construction, 6 workers are assigned the tasks: g˜1 = g1 + g2, g˜2 = g1,

g˜3 = g2 + (g5 + g6), g˜4 = (g3 + g4) + (g5 + g6), g˜5 = g5 + g6, g˜6 = g2 + (g5 + g6). Suppose that both the 5th and 6th workers are stragglers and the master node collects partial

results from workers 1, 2, 3, 4 . Then we can use the peeling decoding algorithm: { } first, find a ripple node g˜2 . Then we can use g˜2 to recover g2 by g˜1 g˜2. Further, we − can use g2 to get a new ripple g5 + g6 by g˜3 g2, and use ripple g5 + g6 to recover − g3 + g4 by g˜4 (g5 + g6). In another case, change the coding scheme of 3th, 4th and − 6th worker to g˜3 = g2, g˜4 = g3 + g4 and g˜6 = g1 + g2. Suppose that both the 4th and 6th workers are stragglers. We can use a similar decoding algorithm to recover

g1 + g2 + g5 + g6 without g3, g4. However, the computation load is decreased from 4 to 2. Actually, the above peeling decoding algorithm can be viewed as an edge-removal

process in a bipartite graph. We construct a bipartite graph with one partition being

the original batch gradients gb and the other partition being the coded gradients { i }

68 Algorithm 3 Batch raptor code (master node’s protocol) repeat The master node assign the data sets according to Definition 4.4.1. until the master node collects results from first finished n s workers. repeat − Find a row Mi in received coding matrix M with Mi 0 = 1. k k Suppose that the column index of the nonzero element in matrix Mi is k0 and

let g = g +g ˜k0 . for each computation resultsg ˜k do

if Mkk0 is nonzero then

g˜k =g ˜k Mkk0 g˜k0 and set Mkk0 = 0. end if − end for until n(1 ) partial gradients is recovered. −

g˜i . Two nodes are connected if such a computation task contains that block. As { } shown in the Figure B.1, in each iteration, we find a ripple (degree one node) on the right and remove the adjacent edges of that left node, which might produce some new ripples in the right. Then we iterate this process until we decode all gradients.

Based on the above graphical illustration, the key point of being able to successfully decode for BRC is the existence of the ripple during the edge removal process, which is mainly dependent on the degree distribution P and batch size b. The following theorem shows that, under a specific choice of P and b, we can guarantee the success of decoding process with high probability.

Theorem 4.4.2. Define the degree distribution Pw

u 1 , k = 1; , k = D + 1 u + 1 D(u + 1) pk =  , (4.4.3) 1  , 2 k D k(k 1)(u + 1) ≤ ≤ −   where D = 1/ , u = 2(1 2)/(1 4)2 and b = 1/ log(1/δ) + 1. Then the b c − − d e

69 ( 1/ log(1/δ) ,Pw)-batch rapter code with decoding Algorithm 3 satisfies d e

BRC −c0n P(err(AS ) > c) < e , and achieves an average computation load of

log(1/) O . (4.4.4) log(1/δ)  

The above result is based on applying a martingale argument to the peeling decoding process [50]. In practical implementation, the degree distribution can be further optimized given the n, s and error  [52].

4.5 Simulation Results

In this section, we present the experimental results at a super computing center. We compare our proposed schemes including d-fractional repetition code (FRC) and batch raptor code (BRC) against existing gradient coding schemes: (i) forget-s scheme

(stochastic gradient descent): the master node only waits the results of non-straggling workers; (ii) cyclic MDS code [1]: gradient coding scheme that can guarantee the decodability for any s stragglers; (iii) bernoulli gradient code (BGC) [28]: approximate gradient coding scheme that only requires O(log(n)) data copies in each worker. To simulate straggler effects in a large-scale system, we randomly pick s workers that are running a background thread.

4.5.1 Experiment Setup

We implement all methods in python using MPI4py. Each worker stores the data according to the coding matrix A. During the iteration of the distributed gradient

70 n=30, s=3 n=30, s=6

0.8 0.8

0.75 0.75 FRC FRC cyclic MDS cyclic MDS BRC BRC BGC BGC 0.7 Forget-s 0.7 Forget-s Generalization AUC Generalization AUC

0.65 0.65 0 100 200 300 400 500 0 200 400 600 800 Runing time Runing time n=60, s=6 n=60, s=12

0.8 0.8

0.75 0.75 FRC FRC cyclic MDS cyclic MDS BRC BRC BGC BGC 0.7 Forget-s 0.7 Forget-s Generalization AUC Generalization AUC

0.65 0.65 0 100 200 300 400 0 200 400 600 Runing time Runing time

Figure 4.4: The generalization AUC versus running time of applying distributed gradient descent in a logistic regression model. The two proposed schemes FRC and BRC are compared against three existing schemes. The learning rate α is fixed for all the experiment. descent, the master node broadcasts the current classifier β(t) using Isend(); then each

(t) worker computes the coded partial gradient g˜i and returns the results using Isend(). Then the master node actively listens to the response from each worker via Irecv(), and uses Waitany() to keep polling for the earliest finished tasks. Upon receiving enough results, the master stops listening and starts decoding the full gradient g(t) and updates the classifier to β(t+1).

In our experiment, we ran various schemes to train logistic regression models, a

71 well-understood convex optimization problem that is widely used in practice. We choose the training data from LIBSVM dataset repository. We use N = 19264097 samples and a model dimension of p = 1163024. We evenly divide the data into n

n partitions Dk . The key step of gradient descent algorithm is { }k=1

(t) gi n (t+1) (t) β = β + α η(yi h (t) (xi))xi, − β k=1 i∈D X zXk }| { where h (t) ( ) is the logistic function, α is the predetermined step size. β ·

4.5.2 Generalization Error

We first compare the generalization AUC of the above five schemes when number of workers n = 30 or 60 and 10% or 20% workers are stragglers. In Figure 4.4, we plot the generalization AUC versus the running time of all the schemes under different n and s. We can observe that our proposed schemes (FRC and BRC) achieve significantly better generalization error compared to existing ones. The forget-s scheme (stochastic gradient descent) converges slowly, since it does not utilize the full gradient and only admits a small step size α compared to other schemes. In particular, when the number of workers increases, our proposed schemes provide even larger speed up over the state of the art.

4.5.3 Impact of Straggler Tolerance

We further investigate the impact of straggler tolerance s. We fix the number of workers n = 30 or n = 60 and increase the fraction of stragglers from 10% to 30%. In

Figure 4.5, we plot the job completion time to achieve a fixed generalization AUC

= 0.8. The first observation is that our propose schemes reduce the completion time by 50% compared to existing ones. The cyclic MDS code and forget-s (stochastic

72 n=30 n=60

3000 Cyclic MDS code 3000 Cyclic MDS code Bernoulli Gradient Code >2800 >2800 Bernoulli Gradient Code >2800 >2800 Fractional Repetition Code Fractional Repetition Code 2500 Batch Raptor Code 2500 Batch Raptor Code Forget s Forget s 2000 2000

1500 1500 time (s) time (s)

1000 1000

500 500

0 0 s=3 s=6 s=9 s=6 s=12 s=18

Figure 4.5: The final job completion time of achieving a generalization AUC = 0.8 in a logistic regression model. The two proposed schemes FRC and BRC are compared against three existing schemes. The learning rate α is fixed for all the experiments.

gradient descent) schemes are sensitive to the number of stragglers. The main reasons are: (i) the computation load of cyclic MDS code is linear in s; (ii) the available step size of forget-s scheme is reduced when the number of received partial gradients decreases. The job completion time of the proposed FRC and BRC are not sensitive to the straggler tolerance s, especially when the number of workers n is large. For example, the job completion time of BRC only increases 10% when fraction of straggler increases from 10% to 30%. Besides, we observe that, when the straggler tolerance is small, i.e., s/n < 0.1, the job completion time of FRC is slightly lower than that of

BRC, because the computation loads are similar for both FRC and BRC in this case, and the FRC utilizes the information of the full gradient.

73 CHAPTER 5

CONCLUSION

In this dissertation, we considered a new frontier, named, coded computation, in mitigating the slow machine problem prevalent in the distributed and large-scale machine learning. We observed that, although traditional coded computation schemes perform well in traditional distributed computation problem, they have several limi- tations in handling the large-scale machine learning problem. We focused on three kinds of elementary computation components in the modern machine learning: linear transform, matrix multiplication, and the distributed optimization problem. We designed several new coded computation scheme that were specifically tailored to these problems, and provided deeper understanding of how these new approaches perform on these problems. We also constructed a new theoretical framework to relate the construction of coded computation scheme to the construction of the random bipartite graph that contains a perfect matching. Through comprehensive simulations, we showed that our constructed codes provide significant speed up compared to the state-of-art techniques.

First, we illustrated the main challenges behind the existing coded computation when applied to the machine learning problem: (i) the existing coded computation schemes do not focus on designing short length codes; (ii) they also create large computation overhead in machine learning applications. To show this effect, we run a large-scale PageRank application and observed that the existing optimal coded

74 computation scheme could destroy the sparsity of the input data set, which will further increase the density of the training data and thus increase the computation time. This observation motivated us to design a completely new coded computation scheme that is suitable in the context of large-scale machine learning application.

Second, we started our investigation in the linear transformation problem. We

first proposed a new performance metric of computation load (number of redundant computations in each worker). We analyzed the minimum computation load of any coded computation scheme for the linear transformation problem, and constructed a code we called the diagonal code that exactly matched the lower bound. An important feature in this part is that we constructed a new theoretical framework to relate the construction of the coded computation scheme to the design of the random bipartite graph that contains a perfect matching. Based on this framework, we further constructed several random codes that provide even lower computation load

(compared to the deterministic lower bound) with high probability. We implemented our coded computation scheme in the Ohio Supercomputer center and compare it with state-of-art using real-world data sets. We observed a significant speed-up provided by our schemes.

We also further considered a more complex problem: matrix multiplication. We showed that previous constructed coded computation scheme designed for the linear transformation problem will lead to a large decoding overhead for the matrix multi- plication problem. To handle this issue, we focused on designing a code that has a sparse generator matrix but preserves some structural property such that the decoding process can exploit the sparsity of the data. Motivated by the peeling decoding process of the LT code, we designed a new sparse code that is generated by a specifically designed degree distribution, which is a variant of the soliton distribution, we named a wave solition distribution. We then designed a type of hybrid decoding algorithm

75 between peeling decoding and the Gaussian elimination process, which eliminated the drawbacks of LT code in the short length scenario, and provided a fast decod- ing process for this problem. We compared our proposed code with existing coded computation schemes in computing real world large-scale matrix multiplication. We observed that our proposed sparse code can provides at least 50% speed up compared to the state-of-art.

Finally, we shifted our focus on the distributed optimization problem or gradient coding problem. The major objective of this problem is to distributedly compute the sum of functions. We observed that the existing gradient coding scheme designed for the worst-case scenario will yield a large computation redundancy. To overcome this challenge, we proposed the idea of approximate gradient coding, which aims to approximately compute the sum of functions. We analyzed the minimum computa- tion load for the approximate gradient coding problem and further constructed two approximate gradient coding schemes: fractional repetition code and batch raptor code that asymptotically match such lower bound. We applied our proposed scheme into a classical gradient descent algorithm in solving the logistic regression problem.

Based on our implementation in the Ohio Supercomputer center, we observe that our schemes can provide at least 2 faster convergence speed compared to existing × schemes.

The investigations in this dissertation can be viewed as initial steps for further research directions. Current results are most focused on the elementary blocks such as linear transformation in many machine learning problems. In the future research, we will consider two directions.

(i) System-level optimization of the coded computation scheme. Existing design of coded computation scheme is based on certain abstraction of the underlying distributed system. The future research will focus on the system-level design of the

76 current coded computation scheme. For example, the machines in a data center has different computation power, thus different computation time. This phenomenon will further worsen the problem of slow machine. One interesting question is that how to design the coded computation scheme in the heterogeneous cluster.

(ii) Design the coded computation scheme according to job completion time distribution. The slow machine problem can be regarded as the tail problem in a distribution. For example, suppose that we have n workers, and each worker’s local computation time is an i.i.d exponential distribution with parameter µ. From the basic order statistics theory, we know that the expectation time of largest one is log(n) times of the average. One interesting question is how to exploit the completion time distribution to design the coded computation scheme. In future research, besides the distribution of computation time, we can also consider the distribution of the communication between the local worker and the master node in our design.

(iii) Efficient gradient coding scheme design in the adversarial learning. In the adversarial learning environment, some adversarial agents might send back a incorrect result such that that deviates from the true result in the master node. For example, in the distributed gradient descent, one extremely negative partial gradient can reverse the direction of the full gradient. The existing work in [56] has design a coding scheme that can detect and resolve the adversarial results. However, it still incur a large computation and storage overhead for each local machine. For future research, we can exploit our idea of approximate gradient coding to design more efficient codes and reduce the computation load.

(iv) Theoretical investigation of the nonlinear data encoding problem: current data encoding schemes are designed for linear operations. For nonlinear operations such as gradient coding problem: the coded computation actually plays a role in assigning the data partitions instead of encoding the data. The existing work [22]

77 has designed a data encoding algorithm for the least square problem. They utilize the restricted isometry property to design several data encoding algorithms and show speed up compared to state-of-art. For future research we can focus on designing the data encoding scheme for some nonlinear problems such as logistic regression problem.

One key difference between this problem and existing least square problem is that the gradient of logistic regression cannot be decompose into a series of linear operations due to the existence of logistic function. Therefore, it is more challenging to design the data encoding scheme. Compared to the standard gradient coding approach, each local worker only stores one data partition (coded) and evaluate one partial gradient during the iteration.

78 BIBLIOGRAPHY

[1] R. Tandon, Q. Lei, A. G. Dimakis, and N. Karampatziakis, “Gradient coding: Avoiding stragglers in distributed learning,” in International Conference on Machine Learning, pp. 3368–3376, 2017.

[2] A. Coates, A. Ng, and H. Lee, “An analysis of single-layer networks in unsuper- vised feature learning,” in Proceedings of the fourteenth international conference on artificial intelligence and statistics, pp. 215–223, 2011.

[3] G. E. Dahl, D. Yu, L. Deng, and A. Acero, “Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition,” IEEE Transactions on audio, speech, and language processing, vol. 20, no. 1, pp. 30–42, 2012.

[4] Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin, “A neural probabilistic language model,” Journal of machine learning research, vol. 3, no. Feb, pp. 1137– 1155, 2003.

[5] S. Schulter, P. Vernaza, W. Choi, and M. Chandraker, “Deep network flow for multi-object tracking,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6951–6960, 2017.

[6] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, A. Senior, P. Tucker, K. Yang, Q. V. Le, and A. Ng, “Large scale distributed deep networks,” in Advances in neural information processing systems, pp. 1223–1231, 2012.

[7] Q. V. Le, M. Ranzato, R. Monga, M. Devin, K. Chen, G. S. Corrado, J. Dean, and A. Y. Ng, “Building high-level features using large scale unsupervised learning,” arXiv preprint arXiv:1112.6209, 2011.

[8] J. Dean and S. Ghemawat, “Mapreduce: simplified data processing on large clusters,” Communications of the ACM, vol. 51, no. 1, pp. 107–113, 2008.

[9] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica, “Spark: Cluster computing with working sets.,” HotCloud, vol. 10, no. 10-10, p. 95, 2010.

[10] N. J. Yadwadkar, B. Hariharan, J. E. Gonzalez, and R. Katz, “Multi-task learning for straggler avoiding predictive job scheduling,” The Journal of Machine Learning Research, vol. 17, no. 1, pp. 3692–3728, 2016. 79 [11] J. Dean and L. A. Barroso, “The tail at scale,” Communications of the ACM, vol. 56, no. 2, pp. 74–80, 2013. [12] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica, “Resilient distributed datasets: A fault- tolerant abstraction for in-memory cluster computing,” in Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation, pp. 2–2, USENIX Association, 2012. [13] J. Teevan, K. Collins-Thompson, R. W. White, S. T. Dumais, and Y. Kim, “Slow search: Information retrieval without time constraints,” in Proceedings of the Symposium on Human-Computer Interaction and Information Retrieval, p. 1, ACM, 2013. [14] G. Ananthanarayanan, S. Kandula, A. G. Greenberg, I. Stoica, Y. Lu, B. Saha, and E. Harris, “Reining in the outliers in map-reduce clusters using mantri.,” in Osdi, vol. 10, p. 24, 2010. [15] M. Zaharia, A. Konwinski, A. D. Joseph, R. H. Katz, and I. Stoica, “Improving mapreduce performance in heterogeneous environments.,” in Osdi, vol. 8, p. 7, 2008. [16] S. Melnik, A. Gubarev, J. J. Long, G. Romer, S. Shivakumar, M. Tolton, and T. Vassilakis, “Dremel: interactive analysis of web-scale datasets,” Proceedings of the VLDB Endowment, vol. 3, no. 1-2, pp. 330–339, 2010. [17] G. Ananthanarayanan, A. Ghodsi, S. Shenker, and I. Stoica, “Effective straggler mitigation: Attack of the clones.,” in NSDI, vol. 13, pp. 185–198, 2013. [18] S. Dutta, V. Cadambe, and P. Grover, “Short-dot: Computing large linear transforms distributedly using coded short dot products,” in Advances In Neural Information Processing Systems, pp. 2100–2108, 2016. [19] K. Lee, M. Lam, R. Pedarsani, D. Papailiopoulos, and K. Ramchandran, “Speed- ing up distributed machine learning using codes,” IEEE Transactions on Infor- mation Theory, 2017. [20] K. Lee, R. Pedarsani, D. Papailiopoulos, and K. Ramchandran, “Coded computa- tion for multicore setups,” in 2017 IEEE International Symposium on Information Theory (ISIT), pp. 2413–2417, IEEE, 2017. [21] Q. Yu, M. Maddah-Ali, and S. Avestimehr, “Polynomial codes: an optimal design for high-dimensional coded matrix multiplication,” in Advances in Neural Information Processing Systems, pp. 4406–4416, 2017. [22] C. Karakus, Y. Sun, S. Diggavi, and W. Yin, “Straggler mitigation in distributed optimization through data encoding,” in Advances in Neural Information Pro- cessing Systems, pp. 5440–5448, 2017. 80 [23] Y. Yang, P. Grover, and S. Kar, “Coded distributed computing for inverse problems,” in Advances in Neural Information Processing Systems, pp. 709–719, 2017.

[24] S. Wang, J. Liu, and N. Shroff, “Coded sparse matrix multiplication,” ICML, 2018.

[25] K. Lee, C. Suh, and K. Ramchandran, “High-dimensional coded matrix multipli- cation,” in Information Theory (ISIT), 2017 IEEE International Symposium on, pp. 2418–2422, IEEE, 2017.

[26] J. L. Sinong Wang and N. Shroff, “Coded sparse matrix multiplications,” in ICML, 2018.

[27] N. S. Sinong Wang, Jiashang Liu and P. Yang, “Computation efficient coded linear transform,” in AISTATS, 2019.

[28] Z. Charles, D. Papailiopoulos, and J. Ellenberg, “Approximate gradient coding via sparse random graphs,” arXiv preprint arXiv:1711.06771, 2017.

[29] N. Raviv, I. Tamo, R. Tandon, and A. G. Dimakis, “Gradient coding from cyclic mds codes and expander graphs,” 2018.

[30] R. K. Maity, A. S. Rawat, and A. Mazumdar, “Robust gradient descent via moment encoding with ldpc codes,” SysML, 2018.

[31] M. Ye and E. Abbe, “Communication-computation efficient gradient coding,” in International Conference on Machine Learning, 2018.

[32] S. Li, M. A. Maddah-Ali, and A. S. Avestimehr, “Coded mapreduce,” in 2015 53rd Annual Allerton Conference on Communication, Control, and Computing (Allerton), pp. 964–971, IEEE, 2015.

[33] S. Li, S. Supittayapornpong, M. A. Maddah-Ali, and S. Avestimehr, “Coded tera- sort,” in 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 389–398, IEEE, 2017.

[34] U. Sheth, S. Dutta, M. Chaudhari, H. Jeong, Y. Yang, J. Kohonen, T. Roos, and P. Grover, “An application of storage-optimal matdot codes for coded matrix multiplication: Fast k-nearest neighbors estimation,” in 2018 IEEE International Conference on Big Data (Big Data), pp. 1113–1120, IEEE, 2018.

[35] N. Ferdinand and S. C. Draper, “Hierarchical coded computation,” in 2018 IEEE International Symposium on Information Theory (ISIT), pp. 1620–1624, IEEE, 2018.

81 [36] Q. Yu, S. Li, M. A. Maddah-Ali, and A. S. Avestimehr, “How to optimally allocate resources for coded distributed computing?,” in 2017 IEEE International Conference on Communications (ICC), pp. 1–7, IEEE, 2017. [37] Q. Yu, N. Raviv, J. So, and A. S. Avestimehr, “Lagrange coded computing: Opti- mal design for resiliency, security and privacy,” arXiv preprint arXiv:1806.00939, 2018. [38] M. Luby, “Lt codes,” in Foundations of Computer Science, 2002. Proceedings. The 43rd Annual IEEE Symposium on, pp. 271–280, IEEE, 2002. [39] L. Page, S. Brin, R. Motwani, and T. Winograd, “The pagerank citation ranking: Bringing order to the web.,” tech. rep., Stanford InfoLab, 1999. [40] O. S. Center, “Ohio supercomputer center.” http://osc.edu/ark:/19495/ f5s1ph73, 1987. [41] T. Tao and V. Vu, “On the singularity probability of random bernoulli matrices,” Journal of the American Mathematical Society, vol. 20, no. 3, pp. 603–628, 2007. [42] J. Bourgain, V. H. Vu, and P. M. Wood, “On the singularity probability of discrete random matrices,” Journal of Functional Analysis, vol. 258, no. 2, pp. 559–603, 2010. [43] J. T. Schwartz, “Fast probabilistic algorithms for verification of polynomial identities,” Journal of the ACM (JACM), vol. 27, no. 4, pp. 701–717, 1980. [44] R. K. Maity, A. S. Rawat, and A. Mazumdar, “Robust gradient descent via moment encoding with ldpc codes,” in SysML, 2018. [45] S. Wang, J. Liu, and N. Shroff, “Fundamental limits of approximate gradient coding,” arXiv preprint arXiv:1901.08166, 2019. [46] P. Erdos and A. Renyi, “On random matrices,” Magyar Tud. Akad. Mat. Kutat´o Int. K¨ozl, vol. 8, no. 455-461, p. 1964, 1964. [47] D. W. Walkup, “Matchings in random regular bipartite digraphs,” Discrete Mathematics, vol. 31, no. 1, pp. 59–64, 1980. [48] T. A. Davis and Y. Hu, “The university of florida sparse matrix collection,” ACM Transactions on Mathematical Software (TOMS), vol. 38, no. 1, p. 1, 2011. [49] A. Frieze and B. Pittel, “Perfect matchings in random graphs with prescribed minimal degree,” in Mathematics and Computer Science III, pp. 95–132, Springer, 2004. [50] M. G. Luby, M. Mitzenmacher, M. A. Shokrollahi, and D. A. Spielman, “Efficient erasure correcting codes,” IEEE Transactions on Information Theory, vol. 47, no. 2, pp. 569–584, 2001. 82 [51] M. Luby, “Tornado codes: Practical erasure codes based on random irregu- lar graphs,” in International Workshop on Randomization and Approximation Techniques in Computer Science, pp. 171–171, Springer, 1998.

[52] A. Shokrollahi, “Raptor codes,” IEEE transactions on information theory, vol. 52, no. 6, pp. 2551–2567, 2006.

[53] L. Bottou, “Large-scale machine learning with stochastic gradient descent,” in Proceedings of COMPSTAT’2010, pp. 177–186, Springer, 2010.

[54] A. Neelakantan, L. Vilnis, Q. V. Le, I. Sutskever, L. Kaiser, K. Kurach, and J. Martens, “Adding gradient noise improves learning for very deep networks,” arXiv preprint arXiv:1511.06807, 2015.

[55] N. Linial and N. Nisan, “Approximate inclusion-exclusion,” Combinatorica, vol. 10, no. 4, pp. 349–365, 1990.

[56] L. Chen, H. Wang, Z. Charles, and D. Papailiopoulos, “Draco: byzantine-resilient distributed training via redundant gradients,” arXiv preprint arXiv:1803.09877, 2018.

83 APPENDIX A: PROOFS FOR CHAPTER 2

A.1 Proof of Theorem 2.3.1

Given parameters m and n, considering any coded computation scheme that can resist s stragglers. We first define the following bipartite graph model between m workers indexed by [m] and n data partitions indexed by [n], where we connect node i [m] ∈ and node j [n] if mij = 0 (worker i has access to data block Aj). The degree of ∈ 6 worker node i [m] is Mi 0. We next show by contradiction that the degree of node ∈ k k j [n] must be at least s+1. Suppose that it is less than s+1 and all its neighbors are ∈ stragglers. In this case, there exists no worker that is a non-straggler and also access to Aj(or the corresponding submatrix of coding matrix is rank deficient). Hence, it contradicts the assumption that it can resist s stragglers.

Based on the above argument and the fact that, the sum of the degrees of one partition is equal to the sum of degrees in another partition in the bipartite graph, we have that the computation load

n

l(M) = Mi 0 n(s + 1). (A.1.1) i=1 k k ≥ X Therefore, the theorem follows.

84 A.2 Proof of Theorem 2.3.3

Based on our construction, the cardinality of the set

[n] i1, . . . , ik = n k. (A.2.1) | \{ }| −

Since n < ik+1 < < in n + s, we have n k s. We next show, after we recover ··· ≤ − ≤ the blocks indexed by [n] i1, . . . , ik , the rest blocks can be recovered by peeling \{ } decoding without rooting steps. Combining these results together, the total number of rooting steps is at most s.

Since we utilize the rooting step to recover blocks indexed by [n] i1, . . . , ik , we \{ } obtain that matrix M ith column Mi = 0 for i 1, . . . , i1 1 . Based on our ∈ { − } construction of the s-diagonal code, we have mi i = 0, which implies i1th block is a 1 1 6 ripple. Then we can use the result ˜yi1 to recover block yi1 and peel the i1th column,

which implies that Mi1 = 0. Using the similar process, we can find a ripple ˜yi2 and peel the i2th column. Continue this process, we can peel the ikth column. Here we analyze the complexity of above procedure. During each iteration, the complexity of operation ˜yj = ˜yj mjiAix is O(r/n). There exists the total n(s+1) the − above operations. The complexity from peeling decoding is r(s+1). The complexity in s rooting steps (3.3.2) is O(rs). Therefore, the total complexity is O(rs) and theorem follows.

A.3 Proof of Lemma 2.4.2

D A direct application of Hall’s theorem is: given a bipartite graph G (V1,V2), for

D each U [m] with U = n, each subgraph G (U, V2) contains a prefect matching if ⊆ | | and only if every subset set S V1 such that N(S) < S , where the neighboring ⊆ | | | |

85 set N(S) is defined as N(S) = y x, y are connected for some x S . This result is { | ∈ } equivalent to the following condition: for each subset I [m], ⊆

supp(Mi) I , (A.3.1) ≥ | | i∈I [

where supp(Mi) is defined as the support set: supp(Mi) = j mij = 0, j [n] , Mi { | 6 ∈ } is ith row of the coding matrix M. Suppose that the set I = i1, i2, . . . , ik with { } i1 < i2 < . . . < ik and supp(Mi ) supp(Mi ) = . Otherwise, we can divide the k1 ∩ k2 6 ∅ set into two parts Il = i1, i2, . . . , ik and Ir = ik , i2, . . . , ik and prove a similar { 1 } { 2 } result in these two sets. Based on our construction of diagonal code, we have

(a) supp(Mi) = min ik, n max 1, i1 s k, (A.3.2) { } − { − } ≥ i∈I [

The above, step (a), is based on the fact that ik i1 k. Therefore, the lemma − ≥ follows.

A.4 Proof of Corollary 2.4.1.1

To prove that the 1-diagonal code achieves the recovery threshold n, we need to show that, for each subset U [n + 1] with U = n, submatrix MU is full rank. Let ⊆ | | U = [n + 1] k , the submatrix MU satisfies \{ }

E 0 U M =   (A.4.1) 0 F,     where E is a (k 1) dimensional square submatrix consisting of first (k 1) rows and − − columns, and F is a (n k + 1) dimensional square submatrix consists of the last − (n k + 1) rows and columns. The matrix E is a lower diagonal matrix due to the − 86 fact that, for i < j,

Eij = mij = 0. (A.4.2)

The matrix E is a upper diagonal matrix due to the fact that, for i > j,

(a) Fij = mi+k,j+k−1 = 0. (A.4.3)

The above, (a) utilizes the fact that (i + k) (j + k 1) 2 when i > j. Based on − − ≥ the above analysis, we have

det(MU ) = det(E) det(F) = 1, (A.4.4) · which implies that matrix MU is full rank. Therefore, the corollary follows.

A.5 Proof of Theorem 2.5.1

Based on our analysis in Section ??, the full rank probability of an n n submatrix × MU can be lower bounded by a constant times the probability of the existence of a perfect matching in a bipartite graph.

U P( M = 0) = | | 6 U U U P( M = 0 M (x) 0) P( M (x) 0) + | | 6 | | 6≡ · | | 6≡ n contains perfect matching S-Z Lemma: ≥1−1/2Cm

U U U P| ( M = 0{zM (x) 0)} P(|M (x{z) 0)} (A.5.1) | | 6 | | ≡ · | | ≡ 0

| {z } Therefore, to prove that the p-Bernoulli code achieves the probabilistic recovery threshold of n, we need to show that each subgraph contains a perfect matching

87 with high probability. Without loss of generality, we can define the following random bipartite graph model.

b Definition A.5.1. (p-Bernoulli random graph) Graph G (U, V2, p) initially contains isolated nodes with U = V2 = n. Then each node v1 U and node v2 V2 is | | | | ∈ ∈ connected with probability p independently.

Clearly, the above model describes the support structure of each submatrix MU of p-Bernoulli code. The rest is to show that, with specific choice of p, the subgraph

b G (U, V2, p) contains a perfect matching with high probability. The technical idea is to use Hall’s theorem. Assume that the bipartite graph

b G (U, V2, p) does not have a perfect matching. Then by Hall’s condition, there exists a violating set S U or S V2 such that N(S) < S . Formally, by choosing such ⊆ ⊆ | | | | an S having smallest cardinality, one immediate consequence is the following technical statement.

b Lemma A.5.1. If the bipartite graph G (U, V2, p) does not contain a perfect matching, then there exists a set S U or S V2 with the following properties. ⊆ ⊆

1. S = N(S) + 1. | | | |

2. For each node t N(S), there exists at least two adjacent nodes in S. ∈

3. S n/2. | | ≤

Case 1: We consider S U and S = 1. In this case, we have N(S) = 0 and ⊆ | | | | need to estimate the probability that there exists one isolated node in partition U.

Let random variable Xi be the indicator function of the event that node vi is isolated. Then we have the probability that

n P(Xi = 1) = (1 p) , − 88 Let X be the total number of isolated nodes in partition U. Then we have

n (a) 1 [X] = E X = n (1 p)n . (A.5.2) E i n " i=1 # − ≤ X The above, step (a) utilizes the assumption that p = 2 log(n)/n and the inequality that (1 + x/n)n ex. ≤ Case 2: We consider S U and 2 S n/2. Let E be the event that such an ⊆ ≤ | | ≤ S exists, we have

n/2 k−1 n n k k(n−k+1) 2(k−1) P(E) (1 p) p ≤ k k 1 2 − k=2   −   Xn/2 (a) 1 n n2n < 6 · (n k)(n k + 1) · k2k(n k)2(n−k) k=2 X − − − k(k 1) k−1 − 2   (1 p)k(n−k+1)p2(k−1) − n/2 k−1 (b) e2n 2 log2(n) < 6k2(n k)(n k + 1) · n k=2 X − −   n/2 k−1 (c) 2 2 log2(n) < 3(n 1) · n k=2 X −   log2(n) < . (A.5.3) 3n

The above, step (a) is based on the inequality

n n 60 n n √2πn n! √2πn , n 5. (A.5.4) e ≤ ≤ 59 e ∀ ≥    

The step (b) utilizes the fact that p = 2 log(n)/n, k n/2 and the inequality ≤ (1 + x/n)n ex; step (c) is based on the fact that k(n k + 1) 2(n 1) and ≤ − ≥ − k(n k) 2(n 2), n/(n 2) < 5/3 for k 2 and n 5. Utilizing the union bound − ≥ − − ≥ ≥ 89 to sum the results in case 1 and case 2, we can obtain that the probability that graph

G(U, V2, p) contains a perfect matching is at least

log2(n) 1 . (A.5.5) − 3n

Therefore, incorporating this result into estimating (A.5.1), the theorem follows.

A.6 Proof of Theorem 2.5.2

To prove the (d1, d2)-cross code achieves the probabilistic recovery threshold of n, we need to show that each subgraph of the following random bipartite graph contains a perfect matching with high probability.

c Definition A.6.1. ((d1, d2)-regular random graph) Graph G (V1,V2, d1, d2) initially contains the isolated nodes with V1 = m and V2 = n. Each node v1 V1 (v2 V2) | | | | ∈ ∈ randomly and uniformly connects to d1 (d2) nodes in V1 (V2).

The corresponding subgraph is defined as follows.

c ¯ Definition A.6.2. For each U V1 with U = n, the subgraph G (U, V2, d1, d) is ⊆ | | obtained by deleting the nodes in V1 U and corresponding arcs. \

Clearly, the above definitions of (d1, d2)-regular graph and corresponding subgraph describe the support structure of the coding matrix M and submatrix MU of the

(d1, d2)-cross code. Moreover, we have the following result regarding the structure of c ¯ each subgraph G (U, V2, d1, d) c ¯ Claim. For each U V1 with U = n, the subgraph G (U, V2, d1, d) can be ⊆ | | c ¯ constructed from the following procedure” (i) Initially, graph G (U, V2, d1, d) contain the isolated nodes with U = V2 = n; (ii) Each node v1 U randomly and uniformly | | | | ∈

90 connects to d1 nodes in V2; (iii) Each node v2 V2 randomly and uniformly connects ∈ to l nodes in V1, where l is chosen according to the distribution:

n m n m P(l) = − , 0 l d2. (A.6.1) l d2 l d2 ≤ ≤   −  

c ¯ Then, the rest is to show that the subgraph G (U, V2, d1, d) contains a perfect matching with high probability.

¯ Definition A.6.3. (Forbidden k-pair) For a bipartite graph G(U, V2, d1, d), a pair

(A, B) is called a k-blocking pair if A U with A = k, B V2 with B = n k + 1, ⊆ | | ⊆ | | − and there exists no arc between the nodes of sets A and B. A blocking k-pair (A, B) is called a forbidden pair if at least one of the following holds:

1. 2 k < (n + 1)/2, and for any v1 A and v2 V2 B, (A v1 ,B v2 ) is ≤ ∈ ∈ \ \{ } ∪ { } not a (k 1)-blocking pair. −

2. (n + 1)/2 k n 1, and for any v1 U A and v2 B, (A v1 ,B v2 ) ≤ ≤ − ∈ \ ∈ ∪ { } \{ } is not a (k + 1)-blocking pair.

The following technical lemma modified from [47] is useful in our proof.

c ¯ Lemma A.6.1. If the graph G (U, V2, d1, d) does not contain a perfect matching, then there exists a forbidden k-pair for some k.

Proof. One direct application of the Konig’s theorem to bipartite graph shows that c ¯ G (U, V2, d1, d) contains a perfect matching if and only if it does not contain any blocking k-pair. It is rest to show that the existence of a k-blocking pair implies that there exists a forbidden l pair for some l. Suppose that there exists a k-blocking − pair (A, B) with k < (n + 1)/2, and it is not a forbidden k-pair. Otherwise, we already find a forbidden pair. Then, it implies that there exists v1 A and v2 V2 B ∈ ∈ \ 91 such that (A v1 ,B v2 ) is a (k 1)-blocking pair. Similarly, we can continue \{ } ∪ { } − above argument on blocking pair (A v1 ,B v2 ) until we find a forbidden pair. \{ } ∪ { } Otherwise, we will find a 1 blocking pair (A0,B0), which is a contradiction to our − assumption that each node v1 U connects d1 nodes in V2. The proof for k (n+1)/2 ∈ ≥ is same.

¯ Let E be the event that graph G(U, V2, d1, d) contains perfect matching. Based on the the results of Lemma A.6.1, we have

n−1 1 P(E) = P k-forbidden pair exists − k=2 ! n−1 [ P (k-forbidden pair exists) ≤ k=2 X n−1 n n P ((A, B) is k-forbidden pair) ≤ k n k + 1 · k=2 X   −  n−1 n n = α(k)β(k). k n k + 1 k=2 X   −  The above, A and B are defined as node sets such that A U with A = k and ⊆ | | B V2 with B = n k + 1. The α(k) and β(k) are defined as follows. ⊆ | | −

α(k) = P (A, B) is k-forbidden pair (A, B) is k-blocking pair , (A.6.2)  β(k) = P ((A, B) is k-blocking pair) . (A.6.3)

From the Definition A.6.3, it can be obtained the following estimation of probability

β(k).

k 1 n k β(k) = − d d ·  1  1

92 n−k+1 d2 n k m n m − − . (A.6.4) l d2 l d2 " l=0 # X   −   The first factor gives the probability that there exists no arc from nodes of A to nodes of B. The second factor gives the probability that there exists no arc from nodes of B to nodes of A. The summation operation in the second factor comes from conditioning such probability on the distribution (A.6.1). Based on the Chu-Vandermonde identity, one can simplify β(k) as

k 1 n k m k m n−k+1 β(k) = − − . (A.6.5) d d · d d  1  1  2  2

Utilizing the inequality

n n 1 n n √2πn n! e 12n √2πn , (A.6.6) e ≤ ≤ e     we have

n n n2n k n k + 1 ≤k2k(n k)2(n−k) ·   −  − ne1/6n , (A.6.7) 2π(n k)(n k + 1) − −

k 1 n (k 1)(n d1) − c1 − − d1 d1 ≤ (k d1 1)n ·    s − − k−d1−1 n−d1 d1 k 1 n d1 k 1 − − − k d1 1 n n  − −      1 −d d (a) 2 1 1 k 1 n d1 k 1 c1 − − − . (A.6.8) ≤ k d1 1 n n r − −    

93 1 −d (a) 2 2 m k m m k m d2 − c2 − − d2 d2 ≤ m d2 k m    r − −   m k d2 − . (A.6.9) m  

n x In the above, step (a) is based on fact that (1 + x/n) e and parameters c1 and c2 ≤ are defined as

1/12(k−1) 1/12(n−d1) 1/12(m−k) 1/12(m−d2) c1 = e e , c2 = e e . (A.6.10)

Combining the equations (A.6.7)-(A.6.10), we can obtain that

n n γ(k) = β(k) k n k + 1   − 2k  1 n(m k)d2

2 d +13d1/12+d2+1/3 The constant c3 is given by c3 = e 1 /2π. The third term satisfies, 2 k n 1, ∀ ≤ ≤ −

n2(m k)d2 n2(m n + 1)d2 − max 1, − , (A.6.12) (n k)2md2 ≤ md2 −   which is based on the fact that, if d2 > 2, the function

n2(m k)d2 f(k) = − (n k)2md2 − is monotonically decreasing when k (d2n 2m)/(d2 2) and increasing when ≤ − − k (d2n 2m)/(d2 2). If d2 = 2, it is monotonically increasing for k 0. ≥ − − ≥ We then estimate the conditional probability α(k). Given a blocking pair A U ⊆ 94 with A = k and B V2 with B = n k + 1, and a node vi A, let Ei be the set | | ⊆ | | − ∈ 0 of nodes in V2 B on which d1 arcs from node vi terminate. Let E be the set of nodes \ v in V2 B such that at least 2 arcs leaving from v to nodes in A. Then we have the \ following technical lemma.

Lemma A.6.2. Given a blocking pair A U with A = k and B V2 with ⊆ | | ⊆ B = n k + 1, if (A, B) is k-forbidden pair, then | | −

k ∗ 0 E = Ei E = V2 B. i=1 ! ∪ \ [

∗ Proof. Suppose that there exists node v V2 (E B), then there exists no arc from ∈ \ ∪ A to v and there exists at most 1 arc from v to A. If such an arc exists, let v0 be the corresponding terminating node in A. Then we have (A v0 ,B v ) is a blocking \{ } ∪ { } pair, which is contradictory to the definition of forbidden pair. If such an arc does not exist, let v0 be the an arbitrary node in A. Then we have (A v0 ,B v ) is a \{ } ∪ { } blocking pair, which is also contradictory to the definition of forbidden pair.

The lemma A.6.2 implies that we can upper bound the conditional probability by

k 0 k k−1 α(k) P Ei E = V2 B = (1 P1 P2) , ≤ " i=1 ! ∪ \ # − [ where P1 and P2 is defined as: for any node v V2 B, ∈ \

k 2 k 1 k d1 1 P1 = P(v / Ei) = − − = − − , ∈ d1 d1 k 1    − 0 P2 = P(v / E ) ∈ d2 d2 k n k n = 1 P(l2) − − · l1 l2 l1 l2 l =2 l =l X1 X2 1   −   (a) m k m k m = − + k − d2 d2 1 d2    −    95 m−k m−k−d2 k (b) m k m d2 m d2 > e1/6 − − − m m d2 k m    − −    (c) 1/6−d2 > c4e . (A.6.13)

The above, step (a) utilizes Chu-Vandermonde identity twice; step (b) is adopts the inequality (A.6.6); step (c) is based on the fact that if n is sufficiently large,

n −x (1 x/n) c5e , where c5 is a constant. Combining the above estimation of P1 − ≥ and P2, we have the following upper bound of α(k).

k k−1 k d1 1 α(k) < 1 c6 − − . (A.6.14) − k 1 "  −  #

c ¯ We finally estimate the probability that the graph G (U, V2, d1, d) contains a perfect matching under the following two cases.

Case 1: The number of stragglers s = poly(log(n)). Let d1 = 2, d2 = 3. Based on the estimation (A.6.12), we have that, for n sufficiently large,

n2(m k)3 n2(s + 1)3 − max 1, 1. (A.6.15) (n k)2m3 ≤ (n + s)3 ≤ −  

Combining the above results with the estimation of β(k), we have

c e−2 γ(k) 3 , 2 k n 1. (A.6.16) ≤ n ≤ ≤ −

Then we can obtain that

¯ P(G(U, V2, d1, d) contains perfect matching) n n n 1 α(k)β(k) ≥ − k n k + 1 k=1 X   − 

96 n−1 c e−2 1 3 α(k) ≥ − n k=2 X k−1 −2 n−1 k (a) c3e k 3 >1 1 c6 − − n − k 1 k=2 " # X  −  (b) c >1 7 . (A.6.17) − n

The above, step (a) utilizes the estimation of α(k) in (A.6.14); step (b) is based on

2 k−1 estimating the tail of the summation as geometric series (1 c6/e ) . − α Case 2: The number of stragglers s = Θ(n ), α < 1. Let d1 = 2, d2 = 2/(1 α). − For 2 k n 2, we have ≤ ≤ −

n2(m k)3 n2(nα + 2)2/(1−α) − max 1, 1, (A.6.18) (n k)2m3 ≤ 4(n + nα)2/(1−α) ≤ −   for n sufficiently large. Combining the above results with the estimation of β(k), we have c e−2 γ(k) 3 , 2 k n 2. (A.6.19) ≤ n ≤ ≤ −

For k = n 1, we have −

n2(m k)3 n + n1−α 2/(1−α) − max 1, . (A.6.20) (n k)2m3 ≤ n + nα − (   )

c e−2 If α 1/2, we can directly obtain that γ(k) 3 . If α < 1/2, we have that, for n ≥ ≤ n sufficiently large,

c e−2 n + n1−α 4/(1−α) c e−2 γ(k) 3 8 . (A.6.21) ≤ n n + nα ≤ n  

97 Similarly, we can obtain that

¯ P(G(U, V2, d1, d) contains perfect matching) n n n 1 α(k)β(k) ≥ − k n k + 1 k=1 X   −  n−1 −2 e max c3, c8 1 { }α(k) ≥ − n k=2 X (b) c >1 9 . (A.6.22) − n

Therefore, in both cases, incorporating the above results into estimating (A.5.1), the theorem follows.

98 APPENDIX B: PROOFS FOR CHAPTER 3

B.1 Proof of Theorem 3.4.1

Before we presenting the main proof idea, we first analyze the moment characteristic of our proposed Wave Soliton Distribution. For simplicity, we use d to denote mn in the sequel.

Lemma B.1.1. Let a random variable X follows the following Wave Soliton distri- bution Pw = [p1, p2, . . . , pd].

τ , k = 1 d  τ  pk =  , k = 2 . (B.1.1) 70  τ , 3 k d k(k 1) ≤ ≤  −   Then all orders moment of X is given by

Θ(τ ln (d)) , s = 1 [Xs] = . (B.1.2) E  τ Θ ds−1 , s 2.  s ≥    Proof. Based on the definition of moment for discrete random variable, we have

d τ τ d τ [X] = kp = + + = Θ (τ ln (d)) . (B.1.3) E k d 35 k 1 k=1 k=3 X X −

99 Note that in the last step, we use the fact that 1 + 1/2 + + 1/d = Θ(ln(d)). ···

d s s E[X ] = k pk k=1 X τ τ2s d τks−1 = + + d 70 k 1 k=3 X − (a) τ = Θ ds−1 . (B.1.4) s   d The step (a) uses the Faulhaber’s formula that ks = Θ(ds+1/(s + 1)) . k=1 X The technical idea in the proof of this theorem is to use the Hall’s theorem. Assume that the bipartite graph G(V1,V2,Pw) does not have a perfect matching. Then by Hall’s condition, there exists a violating set S V1 or S V2 such that N(S) < S , where ⊆ ⊆ | | | | the neighboring set N(S) is defined as N(S) = y (x, y) E(G) for some x S . { | ∈ ∈ } Formally, by choosing such S of smallest cardinality, one immediate consequence is the following technical statement.

Lemma B.1.2. If the bipartite graph G(V1,V2,Pw) does not contain a perfect matching and V1 = V1 = d, then there exists a set S V1 or S V2 with the following | | | | ⊆ ⊆ properties.

1. S = N(S) + 1. | | | |

2. For each vertex t N(S), there exists at least two adjacent vertices in S. ∈

3. S d/2. | | ≤

Figure B.1 illustrates two simple examples of structure S satisfying above three conditions.

Case 1: We consider that S V1. Define an event E(V1) is that there exists a ⊆ set S V1 satisfying above three conditions. ⊆ 100 S N(S)

N(S) S

V 1 V2 V1 V2

Figure B.1: Example of structure S V1, N(S) V2 and S V2, N(S) V1 satisfying condition 1,2 and 3. One can easily∈ check that∈ there exists∈ no perfect∈ matching in these two examples.

Case 1.1: We consider S V1 and S = 1. ⊆ | | In this case, we have N(S) = 0 and need to estimate the probability that there | | exists one isolated vertex in partition V1. Let random variable Xi be the indicator function of the event that vertex vi is isolated. Then we have the probability that

α d P(Xi = 1) = 1 , − d   where α is the average degree of a node in the partition V2 and α = Θ (τ ln (d)) from

Lemma B.1.1. Let X be the total number of isolated vertices in partition V1. Then we have

d d α d τ ln(d) 1 (a) [X] = E X = d 1 = d 1 = Θ = o(1). E i d d dτ−1 " i=1 # − − X       (B.1.5)

The above, step (a) is based on the fact that τ > 1.94.

Before we presenting the results in the case 2 S d/2, we first define the ≤ | | ≤ following three events.

101 Definition B.1.1. Given a set S V1 and S = s, for each vertex v V2, define an ⊆ | | ∈ s s event S0 is that v has zero adjacent vertex in S, an event S1 is that v has one adjacent

s vertex in S and an event S≥2 is that v has at least two adjacent vertices in S.

Then we can upper bound the probability of event E by

P(E(V1)) = P(there exists S V1, N(S) V2 such that conditions 1 3 satisfied) ∈ ∈ − (a) P(there exists an isoloted vertex in V1)+ (B.1.6) ≤ d/2 d d s s−1 s d−s+1 P(S≥2) P(S0) s s 1 · · s=2   −  X d/2 d d =o(1) + (Ss )s−1 (Ss)d−s+1. (B.1.7) s s 1 P ≥2 P 0 s=2 · · X   −  The above, step (a) is based on the union bound. Formally, given S = s and fixed | | s vertex v V2, we can calculate P(S0) via the law of total probability. ∈

d s s P(S0) = P(S0 deg(v) = k) P(deg(v) = k) | · k=1 X d−s d s d −1 = pk − k · k k=1 X     d−s k k k k = pk 1 1 1 1 . (B.1.8) − d − d 1 − d 2 ··· − d s + 1 k=1 X    −   −   − 

s Similarly, the probability P(S1) is given by the following formula.

d s s P(S1) = P(S1 deg(v) = k) P(deg(v) = k) | · k=1 X d−s+1 d s s d −1 = pk − k 1 · 1 · k k=1 X  −      d−s+1 k k k sk = pk 1 1 1 . (B.1.9) − d − d 1 ··· − d s + 2 · d s + 1 k=1 X    −   −  − 102 s Then, the probability P(S≥2) is given by the following formula.

s s s P(S≥2) = 1 P(S0) P(S1). (B.1.10) − −

The rest is to utilize the formula (B.1.8), (B.1.9) and (B.1.10) to estimate the order of (B.1.6) under several scenarios.

Case 1.2: We consider S V1 and S = Θ(1). ⊆ | | Based on the result of (B.1.8), we have

d s d s i s k s i k P(S0) pk 1 = pk ( 1) ≤ − d i − di k=1 k=1 i=0 X   X X   d s i d (a) sk s ( 1) i = pk 1 + − k pk − d i di k=1 i=2 k=1 X   X   X (b) sτ ln(d) 1 = 1 Θ + Θ − d d     sτ ln(d) = 1 Θ . (B.1.11) − d  

The above, step (a) is based on exchanging the order of summation; step (b) utilizes the result of Lemma B.1.1 and the fact that s is the constant. Similarly, we have

d−s s s s k sτ ln(d) P(S≥2) 1 P(S0) 1 pk 1 = Θ . (B.1.12) ≤ − ≤ − − d s d k=1 X  −   

s s Combining the upper bound (B.1.6) and estimation of P(S0) and P(S≥2), we have

d d s s−1 s n−s+1 P(S≥2) P(S0) s s 1 · · s=Θ(1)X   −  (a) e2s−1d2s−1 sτ ln(d) s−1 sτ ln(d) d−s+1 Θ 1 Θ ≤ s2s−1 · d · − d s=Θ(1)X      lns−1(d) = Θ d(τ−1)s s=Θ(1)X   103 = o(1). (B.1.13)

d The above, step (a) utilizes the inequality (ed/s)s. s ≤   Case 1.3: We consider S V1, S = Ω(1) and S = o(d). ⊆ | | | | Based on the result of (B.1.8), we have

d s s k P(S0) pk 1 ≤ − d k=1   X s (a) ks k = pk 1 + pk 1 − d − d ksX=o(d)   ks=Θ(Xd) or Ω(d)   d/s ks pk 1 + pk ≤ − d k=1 X   ks=Θ(Xd) or Ω(d) (b) τ(s 1) τs ln(d/s) τs 1 − Θ + Θ ≤ − d − d d     (c) τs ln(d/cs) = 1 Θ . (B.1.14) − d  

The above, step (a) is based on summation over different orders of k. In particular, when ks = o(d) and s = Ω(1), we have (1 k/d)s = Θ(e−ks/d) = 1 Θ(ks/d). Step − − (b) utilizes the partial sum formula 1 + 1/2 + + s/d = Θ(ln(d/s)). The parameter ··· c of step (c) is a constant. Similarly, we have

s s P(S≥2) 1 P(S0) ≤ − (a) ks − ks 1 pk 1 + pke d−s ≤ −  − d s  ksX=o(d)  −  ks=Θ(dX),k≤d/s−1 d/s −1  (b) ks 1 pk 1 ≤ − − d s k=1  −  X 0 (c) τs ln(d/c s) = Θ . (B.1.15) d  

104 The above, step (a) is based on summation over different orders of k, and abandon the terms when k d/s. In particular, when ks = Θ(d) and s = Ω(1), we have ≥ (1 k/(d s))s = Θ(e−ks/(d−s)). The step (b) utilizes the inequality e−x 1 x, x 0. − − ≥ − ∀ ≥ The parameter c0 of step (c) is a constant. Combining the upper bound (B.1.6) and

s s estimation of P(S0) and P(S≥2), we have

d d s s−1 s d−s+1 P(S≥2) P(S0) s s 1 · · s=Ω(1)X,s=o(d)   −  e2s−1d2s−1 sτ ln(d/c0s) s−1 sτ ln(d/cs) d−s+1 Θ 1 Θ ≤ s2s−1 · d · − d s=Ω(1)X,s=o(d)      s (τ−1)s = Θ cτsτ s−1e2s−1 lns−1(d/c0s) d s=Ω(1)X,s=o(d)    s ln1.06(d/s) (τ−1)s = Θ d s=Ω(1)X,s=o(d)   = o(1). (B.1.16)

Case 1.4: We consider S V1 and S = Θ(d) = cd . ⊆ | | Based on the result of (B.1.8) and Stirling’s approximation, we have

d−s 1 1 d−k+ 2 d−s−k+ 2 s k k k P(S0) = pk 1 1 + (1 c) − d d s k − k=1 X    − −  (a) k k = pk(1 c) + o (1 c) − − kX=o(d) k=Θ(Xd),k≤d−s  k 2 (1 c) = p1(1 c) + p2(1 c) + τ − + o(1) − − k(k 1) k≥3X,k=o(d) − (b) 2 1 2 = p1(1 c) + p2(1 c) + τ (1 c ) + c ln(c) − − 2 −   , f0(c). (B.1.17)

The above, step (a) is based on summation over different orders of k. The step (b) is

105 based on the following partial sum formula.

q (1 c)k 1 − = [2c(1 c)q(1 c)qΦ(1 c, 1, q + 1) 2(1 c)q+1 + q(1 c2)+ k(k 1) 2q − − − − − − k=3 X − 2cq ln(c)]. (B.1.18) where the function Φ(1 c, 1, q + 1) is the Lerch Transcendent, defined as −

∞ ck Φ(1 c, 1, q + 1) = . (B.1.19) − k + q + 1 k=0 X Let q = Ω(1), we arrive at the step (b). Similarly, utilizing the result of (B.1.9), we have

s c k P(S1) = pkk(1 c) 1 c − − kX=o(d)

= p1c + 2p2c(1 c) + τc(c 1 ln(c)) − − −

, f1(c). (B.1.20)

Therefore, utilizing the upper bound (B.1.6), we arrive at

d d s s−1 s n−s+1 P(S≥2) P(S0) s s 1 · · s=Θ(Xd),s≤d/2   −  2c 2(1−c) d 1 1 c 1−c [1 f0(c) f1(c)] [f0(c)] ≤ " c 1 c − − # s=cd,sX≤d/2    −  = (1 Θ(1))d − s=cd,sX≤d/2 = o(1). (B.1.21)

106 Therefore, combining the results in the above four cases, we conclude that P(E(V1)) = o(1).

Case 2: We consider that S V2. We relax the condition 2 in Lemma B.1.2 to ⊆ the following condition.

20. For each vertex t S, there exists at least one adjacent vertex in N(S). ∈ Define an event E(V2) is that there exists a set S V2 satisfying condition 1, 2, ⊆ 3, and an event E0 is that there exists a set S satisfying above condition 1, 20 and 3.

0 0 One can easily show that the event E(V2) implies the event E and P(E(V2)) P(E ). ≤ Then we aim to show that the probability of event E0 is o(1).

Definition B.1.2. Given a set S V2 and S = s, for each vertex v V2, define an ⊆ | | ∈ s event N≥1 is that v has at least one adjacent vertex in N(S) and v does not connect to any vertices in V1/N(S).

Then we can upper bound the probability of event E0 by

0 0 P(E ) = P(there exists S V2, N(S) V1 such that condition 1, 2 ,3 are satisfied) ∈ ∈ d/2 (a) d d e2s−1d2s−1 (N s )s (N s )s. (B.1.22) s s 1 P ≥1 s2s−1 P ≥1 ≤ s=2 · ≤ · X   − 

The above, step (a) is based on the fact that any vertices in set V2 has degree at least one according to the definition of the Wave Soliton distribution. Given S = s and | | s fixed vertex v S, we can calculate P(N≥1) via the law of total probability. ∈

d s s P(N≥1) = P(N≥1 deg(v) = k) P(deg(v) = k) | · k=1 X s−1 s 1 d −1 = pk − k · k k=1 X     s−1 s 1 s 2 s 3 s k = pk − − − − d · d 1 · d 2 ···· d k + 1 k=1 X − − − 107 s−1 sk pk . (B.1.23) ≤ dk k=1 X

Case 2.1: We consider S V2 and S = Θ(1). ⊆ | | Based on the result of (B.1.23), we have

2 s−1 k 2 3 s τs s s τs s τs 1 1 P(N≥1) + p2 + pk + p2 + ≤ d2 d2 dk ≤ d2 d2 d3 2 − s 1 k=3 X  −  s2 1 τ τs 1 1 < + + (B.1.24) d2 36 s d 2 − s 1   − 

Then we have

2s−1 2s−1 e d s s 1 P(N≥1) = Θ . (B.1.25) s2s−1 · d s=Θ(1)X  

Case 2.2: We consider S V2, S = Ω(1) and S = o(d). ⊆ | | | | Similarly, using the result in Case 2.1 and upper bound (B.1.22), we arrive

d d s s P(N≥1) s s 1 · s=Xo(d)   −  se2s−1 1 τ τs 1 1 s + + ≤ d 36 s d 2 − s 1 s=Xo(d)   −  (a) = o(1). (B.1.26)

The above, step (a) is based on the fact that 1/36 + o(1) < e−2.

Case 2.3: We consider S V2 and S = Θ(d) = cd. Based on the result of ⊆ | | (B.1.23), we have

s−1 (a) 2 s k 2 c P(N≥1) pkc p1c + p2c + τ c + (1 c) ln(1 c) f2(c). (B.1.27) ≤ ≤ − 2 − − , k=1 X  

108 The above, step (a) utilizes the partial sum formula (B.1.18). Using the upper bound

(B.1.22), we arrive

2c 2(1−c) d d d s s 1 1 c P(N≥2) = [f2(c)] s s 1 · " c 1 c # s=Θ(Xd)   −     −  = (1 Θ(1))d − s=Θ(Xd) = o(1). (B.1.28)

0 Combining the results in the above three cases, we have P(E ) = o(1). Therefore, the theorem follows.

B.2 Proof of Lemma 3.4.1

Consider a random bipartite graph generated by degree distribution Pw of the nodes

k−1 in left partition V2, define a left edge degree distribution λ(x) = λkx and a k k−1 X right edge degree distribution ρ(x) = ρkx , where λk (ρk) is the fraction of edges k X adjacent to a node of degree k in the left partition V1 (right partition V2). The existing analysis in [50] provides a quantitative condition regarding the recovery threshold in terms of λ(x) and ρ(x).

Lemma B.2.1. Let a random bipartite graph be chosen at random with left edge degree distribution λ(x) and right edge degree distribution ρ(x), if

λ(1 ρ(1 x)) < x, x [δ, 1], (B.2.1) − − ∈ then the probability that peeling decoding process cannot recover δd or more of original blocks is upper bounded by e−cd for some constant c.

109 k−1 We first derive the edge degree distributions λ(x) = λkx and ρ(x) = k k−1 X ρkx via the degree distribution Ωw(x). Suppose that the recovery threshold k X 0 is K. The total number of edges is KΩw(1) and the total number of edges that is

0 adjacent to a right node of degree k is Kkpk. Then, we have ρk = kpk/Ωw(1) and

0 0 ρ(x) = Ωw(x)/Ωw(1). (B.2.2)

Fix a node vi V1, the probability that node vi is a neighbor of node vj V2 is ∈ ∈ given by d −1 d 0 d 1 d 1 Ωw(1) pk − = kpk = . k 1 k d d k=1 k=1 X  −   X Since V2 = K, the probability that node vi is the neighbor of exactly l nodes in V2 is | | K (Ω0 (1)/d)l(1 Ω0 (1)/d)K−l and corresponding probability generating function l w − w   is K K Ω0 (1) l Ω0 (1) K−l Ω0 (1)(1 x) K w 1 w xl = 1 w − . l d − d − d l=1 X         Then we can obtain λ(x) as

Ω0 (1)(1 x) K−1 λ(x) = 1 w − . (B.2.3) − d  

Further, we have Ω0 (1 x) K−1 λ(1 ρ(1 x)) = 1 w − . (B.2.4) − − − d   Combining these results with Lemma B.2.1, let δ = b/mn, the lemma follows.

110 B.3 Proof of Theorem 3.4.2

Suppose that K = cd + 1, one basic fact is that

0 K−1 Ω (1 x) 0 λ(1 ρ(1 x)) = 1 w − e−cΩw(1−x) (B.3.1) − − − d ≤  

0 Based on the results of Lemma 3, the rest is to show that e−cΩw(x) 1 x for ≤ − x [0, 1 b/d]. Based on the definition of our Wave Soliton distribution, we have ∈ −

τ τx d−1 xk Ω0 (x) = + + τ w d 35 k k=2 X ∞ τ 34τx xk (a) τ 34τx = τ ln(1 x) τ τ ln(1 x) τx10. d − 35 − − − k ≥ d − 35 − − − k=d X (B.3.2)

∞ xk The above step (a) is utilizing the fact that x10 for x [0, 1 b/d]. It ≥ k ∈ − k=d remains to show that there exists a constant c such thatX

τ 34τx c τ ln(1 x) τx10 ln(1 x), for x [0, 1 b/d]. (B.3.3) − d − 35 − − − ≤ − ∈ −   which is verified easily.

111 APPENDIX C: PROOFS FOR CHAPTER 4

C.1 Proof of Lemma 4.3.1

Proof. Since the event that there exists i [n] such that AS,i 0 = 0 implies the event ∈ k k T 2 that err(AS) = A u 1n 1, we can obtain k S − k ≥

n P(err(AS) > 0) P AS,i 0 = 0 ≥ i=1 k k ! [

∗ ∗ ∗ Suppose that A = arg min P(err(AS) > 0), AS is the row submatrix of A containing d A∈An (n s) randomly and uniformly chosen rows, we have −

n ∗ min P(err(AS) > 0) P AS,i 0 = 0 A∈Ad n ≥ i=1 k k ! [ n min P AS,i 0 = 0 A∈Ad ≥ n i=1 k k ! [ We next show that

n n minA∈Ad P AS,i 0 = 0 = minA∈U d P AS,i 0 = 0 (C.1.1) n i=1 k k n i=1 k k [  [  We will prove that the above probability is monotonically decreasing with the support size of each row and column of matrix A. Assume that there exists k [n] such that ∈ ak 0 < κ(A). We change one zero position of row ak, i.e., akj to an nonzero constant. k k 0 Define the new matrix as A . For simplicity, define the event Ei as AS,i 0 = 0 and k k 112 0 0 E as A 0 = 0. Then we can write i k S,ik

n P Ei = P Ei + P Ej Ei \ i=1 ! i6=j ! " i6=j !# [ [ [ (a) 0 0 0 P Ei + P Ej Ei ≥ \ i6=j ! " i6=j !# [ [ n 0 = P Ei . i=1 ! [

0 0 The above, step (a) is based on the fact that Ei = E , i = j and E Ej. Similarly, i 6 j ⊂ we can prove the monotonicity for support size of each column. Therefore, based on the monotonicity and the Definition 4.2.2 of computation load, the lemma follows.

C.2 Proof of Lemma 4.3.2

d Proof. Given any matrix A . We construct the set d as follows. First, choose a ∈ Un I column Ai1 and construct set I1 as follows,

I1 = j [n] supp(Ai ) supp(Aj) = , j = i1 (C.2.1) { ∈ | 1 ∩ 6 ∅ 6 }

d Since A , suppose that supp(Ai ) = k1, k2, . . . , kd . We can obtain ∈ Un 1 { }

d

I1 = j [n] Ajk , j = i1 | | { ∈ | i 6 } l=1 [ d (a) j [n] Ajki , j = i1 ≤ i=1 | { ∈ | 6 } | (b) X d2. (C.2.2) ≤

The above, step (a) utilizes the union bound and step (b) is based on the definition of set d. Un 113 Furthermore, we choose a column Ai such that i2 [n] I1. Based on the definition 2 ∈ \ of index set I1, we have supp(Ai ) supp(Ai ) = . Similarly, we can construct the 2 ∩ 1 ∅ 2 index set I2 = j [n] supp(Ai ) supp(Aj) = , j = i1 with I2 d . Continue this { ∈ | 2 ∩ 6 ∅ 6 } | | ≤ process k times, we can construct a set d = i1, i2, . . . , ik such that for any i, j d, I { } ∈ I 2 supp(Ai) supp(Aj) = , and corresponding I1,I2,...,Ik. Since each Ik d , we ∩ ∅ | | ≤ 2 have d n/d . |I | ≥ b c

C.3 Proof of Theorem 4.3.1

Proof. Suppose that n ∗ A = arg min P AS,i 0 = 0 . (C.3.1) d k k A∈Un i=1 ! [ Based on the results of Lemma 4.3.2, we can construct a set I∗ such that ∗ n/d2 d |Id | ≥ b c and

∗ ∗ supp(A ) supp(A ) = , i = j, i, j d. (C.3.2) i ∩ j ∅ ∀ 6 ∈ I

Combining the results of Lemma 4.3.1, we have

n min P(err(AS) > 0) min P AS,i 0 = 0 A∈Ad A∈Ad n ≥ n i=1 k k ! [n = min P AS,i 0 = 0 A∈U d n i=1 k k ! n [ ∗ = P AS,i 0 = 0 i=1 k k ! [

∗ P AS,i 0 = 0 (C.3.3) ≥  k k  i∈I∗ [d  

114 The above, last step is based on the fact that I∗ [n]. Suppose that I∗ = t. Based d ⊆ | d | on the inclusion-exclusion principle, we can write

∗ P AS,i 0 = 0  k k  i∈I∗ [d   |I|+1 ∗ = ( 1) P AS,i 0 = 0 − k k I⊆I∗ i∈I ! Xd \ ∗ (a) n supp(A ) n = ( 1)|I|+1 − i∈I | S,i | − s supp(A∗ ) s I⊆I∗,|I|≤b s c Pi∈I S,i dX d  − | |  min{t,b s c} d P t n kd n = ( 1)k+1 − k − s kd s k=1    −   k≤minX{t,b s c} (b) d t n kd n−kd+0.5 s s−kd+0.5 s kd − ≥ k n s kd n − k is odd      −    k≤minX{t,b s c} d t 144s(n kd) n kd n−kd+0.5 s s−kd+0.5 s kd − − . k (12s 1)(12(n kd) 1) n s kd n k is even X   − − −    −    (C.3.4)

∗ The above, step (a) is based on the property of set Id . Step (b) utilizes the following Sterlin’s inequalities

n n n n 12n √2πn n! √2πn . (C.3.5) e ≤ ≤ e 12n 1     −

Case 1: The number of stragglers s = δn and δ = Θ(1).

Since the event A∗ = 0 belongs to event A∗ = 0 for some i I∗, S,i0 0 ∗ S,i 0 0 d k k i∈Id k k ∈ we have [

∗ ∗ n d n P AS,i 0 = 0 P AS,i 0 = 0 = −  k k  ≥ k 0 k s d s i∈I∗ [d  −     n−d+0.5 s−d+0.5   (a) d d =(1 + o(1)) 1 1 + δd − n s d    −  115 =(1 + o(1))δd. (C.3.6)

The above, step (a) is based on the Sterlin’s approximation. This result implies d >

Ω(1) (otherwise, the failure probability is nonvanishing). Then we have t = n/d2 < b c s/d and kd = o(n) for any 1 k t, and obtain the following approximation, b c ≤ ≤

n kd n−kd+0.5 s s−kd+0.5 − = 1 α(n, k) and lim α(n, k) = 0, 1 k t. n s kd − n→∞ ∀ ≤ ≤    −  144s(n kd) − = 1 + β(n, k) and lim β(n, k) = 0, 1 k t. (12s 1)(12(n kd) 1) n→∞ ∀ ≤ ≤ − − − Utilizing the above approximation and choosing d such that δdt 1/e, we have →

∗ P AS,i 0 = 0  k k  i∈I∗ [d k≤t  k≤t t s kd t s kd (1 α(n, k)) (1 α(n, k)) (1 + β(n, k)) ≥ k − n − k − n k is odd k is even X     X     k≤t t k≤t t =1 (1 δd)t α(n, k)δkd + (α(n, k) β(n, k)+ − − − k k − k is odd k is even X   X   α(n, k)β(n, k))δkd

(a) =1 (1 δd)t + o(1) − − (b) −1 =1 e−e + o(1) > 0.307. (C.3.7) −

t t The above, step (a) utilizes the fact that (et/k)k, then the quantity δkd k ≤ k ≤     (etδd/k)k = 1/kk and

t t t o(1) ( 1)k+1o(1)δkd = o(1). (C.3.8) k − ≤ kk k=1 k=1 X   X

Step (b) is based on the choice of d such that δdt 1/e. It is obvious that the → 116 probability ( A∗ = 0) is monotonically non-increasing with the computation P ∗ S,i 0 i∈Id k k ∗ ∗ load d. Therefore,[ the minimum computation load d should satisfy d > d0, where

δd0 n/d2 1/e. It is easy to see that 0 →

log(ne log2(1/δ)/ log2(n)) d = . (C.3.9) 0 log(1/δ)

Case 2: The number of stragglers s = δn, δ = o(1) and δ = Ω(1/n).

In this case, we can choose d = d0. The conditions δ = Ω(1/n) implies that s = Ω(1).

Then, for k [min t, s/d0 ] and k = o(s/d0), we have following similar estimation. ∈ { }

n−kd0+0.5 s−kd0+0.5 n kd0 s − = 1 α(n, k) and lim α(n, k) = 0. n s kd0 − n→∞    −  (C.3.10)

For k [min t, s/d0 ] and k = Θ(s/d0) = cs/d0, we have ∈ { }

n−kd0+0.5 s−kd0+0.5 (1−c)s+0.5 t n kd0 s t 1 − δkd0 = δcs k n s kd0 k 1 c      −     −  0.5 1/c−1 kd0 1 e 1/d0 1 δ . (C.3.11) ≤ 1 c " c 1 c #  −     − 

For all k [min t, s/d0 ], we have ∈ { }

144s(n kd0) − = 1+β(n, k) and lim β(n, k) = 0, 1 k t. (C.3.12) (12s 1)(12(n kd0) 1) n→∞ ∀ ≤ ≤ − − −

Therefore, we can obtain the following estimation.

n−kd0+0.5 s−kd0+0.5 t n kd0 s − δkd0 = o(1). (C.3.13) k n s kd0 k=Θ(Xs/d0)      − 

117 Utilizing the above approximation, we have

∗ P AS,i 0 = 0  k k  i∈I∗ [d0 k=o(s/d0)  k=o(s/d0) t s kd0 t s kd0 (1 α(n, k)) (1 α(n, k)) (1 + β(n, k)) ≥ k − n − k − n k is odd k is even X     X     + o(1)

k=o(s/d ) 0 t =1 (1 δd0 )t α(n, k)δkd0 + − − − k k is odd   k=o(s/d ) X 0 t (α(n, k) β(n, k) + α(n, k)β(n, k))δkd0 + o(1) k − k is even X   =1 (1 δd0 )t + o(1) − − =1 e−e−1 + o(1) > 0.307. (C.3.14) −

∗ ∗ Therefore, the minimum computation load d should satisfy d > d0. In the case s = Θ(1), the lower bound 1 is trivial (otherwise, some gradients are lost).Therefore, the theorem follows.

C.4 Proof of Theorem 4.3.2

Proof. Based on the structure of coding matrix AFRC and decoding algorithm, we can define the following event

d−1 jn E + i th worker is straggler, 1 i n/d, (C.4.1) i , d j=0 ≤ ≤ \  

118 and we have

n/d FRC P(err(AS ) > 0) = P Ei . (C.4.2)   i=1 [   Utilizing the approximate inclusion-exclusion principle and choose k = n0.6 and d = 1 + log(n)/ log(1/δ), we have

FRC −2n0.1 |I|+1 P(err(AS ) > 0) = (1 + e ) ( 1) P Ei − i∈I ! I⊆[Xn],|I|≤k \ k i (a) n/d =(1 + o(1)) ( 1)i+1 E i P j i=1 − j=1 ! X   \ k (b) n/d n id n =(1 + o(1)) ( 1)i+1 − i − s id s i=1    −   Xi≤k (c) n/d 144s(n id) n id n−id+0.5 s s−id+0.5 (1 + o(1)) − − ≤ i (12s 1)(12(n id) 1) n s id i is odd   − − −    −  X i≤k s id n/d n id n−id+0.5 s s−id+0.5 s id (1 + o(1)) − n − i n s id n   i is even      −    i≤k X (d) n/d s kd =(1 + o(1)) (1 + β0(n, k)) (1 α0(n, k)) i − n − i is odd     i≤k X t s kd (1 α0(n, k)) k − n i is even     X n/d (e) n/d =1 (1 δd)n/d ( 1)i+1δid + o(1) − − − i − i=k+1 X   (f) =1 (1 δd)n/d + o(1) − − (g) =o(1). (C.4.3)

The above, step (a) is based on the symmetry of events Ei. Step (b) utilizes the

FRC definition of event Ei and the structure of coding matrix A . Step (c) utilizes Sterlin’s inequality (C.3.5). In the step (d), since k = n0.6, we have id = o(n) for

119 1 i k and ≤ ≤

n id n−id+0.5 s s−id+0.5 − = 1 α0(n, k) and lim α0(n, k) = 0. n s id − n→∞    −  144s(n id) − = 1 + β0(n, k) and lim β0(n, k) = 0. (12s 1)(12(n id) 1) n→∞ − − − Step (e) is based on the similar argument in the proof of (C.3.8). The last step utilizes the fact that, when i n0.6 and d = log(n log(1/δ))/ log(1/δ), ≥

n/d en i 0.6 δid δd n−0.6n . i ≤ di ≤    

The last step (g) is based on the choice of d such that (1 δd)n/d = e−1/ log(n log(1/δ)) = − 1 o(1). Therefore, the theorem follows. −

C.5 Proof of Theorem 4.4.1

Proof. Define the indicator function

1, AS,i 0 = 0 k k Xi =  (C.5.1) 0, AS,i 0 > 0  k k  Then we can obtain

n P[err(AS) > c] P Xi > c . (C.5.2) ≥ " i=1 # X Based on the similar proof of Lemma 4.3.1, we have

n min P[err(AS) > c] min P Xi > c . (C.5.3) A∈Ad A∈U d n ≥ n " i=1 # X

120 Suppose that n ∗ A = arg min P Xi > c . (C.5.4) d A∈Un " i=1 # X ∗ Based on the results in the Lemma 4.3.2, we can construct a set Id such that ∗ n/d2 and |Id | ≥ b c

∗ ∗ supp(A ) supp(A ) = , i = j, i, j d. (C.5.5) i ∩ j ∅ ∀ 6 ∈ I

Therefore, we have

∗ min [err(AS) > c] [Y (I ) > c] , (C.5.6) d P P d A∈An ≥ where random variable Y (I∗) = X . d ∗ i i∈Id

Suppose that d = o(s). EachX indicator function Xi is a Bernoulli random variable with

n−d+0.5 s−d+0.5 n d n (a) d d d P(Xi = 1) = − = (1 + o(1)) 1 1 + δ s d s − n s d  −      −  (b) = (1 + o(1))δd. (C.5.7)

The above, step (a) utilizes sterlin’s approximation; step (b) is based on the fact that

∗ s = Ω(1) and d = o(s). First, the expectation of Y (Id ) is given by

∗ d E[Y (Id )] = t(1 + o(1))δ . (C.5.8)

Furthermore, considering the fact that, for any i, j I∗ with i = j, the random ∈ d 6 variable XiXj is a also Bernoulli random variable with

n 2d n (XiXj = 1) = − P s 2d s  −   121 2d n−2d+0.5 2d s−2d+0.5 = (1 + o(1)) 1 1 + δ2d − n s 2d    −  = (1 + o(1))δ2d, (C.5.9)

∗ the variance of Y (Id ) is given by

∗ 2 ∗ 2 ∗ Var[Y (Id )] =E[Y (Id )] E [Y (Id )] −

2 2 ∗ =E Xi + 2 XiXj E [Y (Id )]   − i∈I∗ i,j∈I∗,i6=j Xd Xd   =(1 + o(1)) tδd + t(t 1)δ2d t2δ2d − − =(1 + o(1)) tδd(1 δd)  (C.5.10) −   Therefore, utilizing the Chebyshev inequality, we have the following upper bound.

∗ ∗ 0.5 ∗ ∗ ∗ 0.5 ∗ P Y (Id ) E[Y (Id )] 2E [Y (Id )] P Y (Id ) E[Y (Id )] 2E [Y (Id )] ≤ − ≤ | − | ≥ ∗  Var[ Y (Id )] ∗ ≤ 4E[Y (Id )] 1 δd =(1 + o(1)) − . (C.5.11) 4

∗ 0.5 ∗ Assume that c E[Y (Id )] 2E [Y (Id )], then we have ≤ −

∗ P Y (Id ) c { ≤ } ∗ ∗ 0.5 ∗ P Y (Id ) E[Y (Id )] 2E [Y (Id )] ≤ ≤ − 1/4. (C.5.12) ≤

This result implies that 3 [err(A ) > c] > , (C.5.13) P S 4

122 ∗ which is a contradiction. Therefore, the parameter c should satisfy c > E[Y (Id )] − 0.5 ∗ 2E [Y (Id )], which implies that

∗ E[Y (Id )] < 2c + 4. (C.5.14)

∗ d 2 d Since E[Y (Id )] = (1 + o(1))tδ and n/d δ is monotonically non-increasing with d, b c the minimum computation load should satisfy

log(n log2(1/δ)/(2c + 4) log2(n/(2c + 4))) d . (C.5.15) ≥ log(1/δ)

Therefore, the theorem follows.

C.6 Proof of Theorem 4.4.2

We use the analysis of the decoding process as described in [50]. Based on the choice of b = 1/ log(1/δ) + 1 and δ = s/n, we can obtain that d e

n(1 2) − < n s. (C.6.1) b(1 4) − − n(1 2) Based on the results in [50], to successfully recover n/b(1 ) blocks from − − b(1 4) − received results with probability 1 e−cn, we need to show the following inequality − holds.

− 1−2 Ω0(x) e 1−4 < 1 x, x [0, 1 ], (C.6.2) − ∀ ∈ −

0 where Ω (x) is the derivative of the generating function fo the degree distribution Pw. Note that 1 ∞ xd Ω0(x) = u ln(1 x) + xD (C.6.3) u + 1 − − − d d=D+1 ! X

123 ∞ Utilizing the fact that xD > xd/d, the theorem follows. d=D+1 X

124