Coded Computation for Speeding up Distributed Machine Learning
Total Page:16
File Type:pdf, Size:1020Kb
CODED COMPUTATION FOR SPEEDING UP DISTRIBUTED MACHINE LEARNING DISSERTATION Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of the Ohio State University By Sinong Wang Graduate Program in Electrical and Computer Engineering The Ohio State University 2019 Dissertation Committee: Ness B. Shroff, Advisor Atilla Eryilmaz Abhishek Gupta Andrea Serrani c Copyrighted by Sinong Wang 2019 ABSTRACT Large-scale machine learning has shown great promise for solving many practical ap- plications. Such applications require massive training datasets and model parameters, and force practitioners to adopt distributed computing frameworks such as Hadoop and Spark to increase the learning speed. However, the speedup gain is far from ideal due to the latency incurred in waiting for a few slow or faulty processors, called \straggler" to complete their tasks. To alleviate this problem, current frameworks such as Hadoop deploy various straggler detection techniques and usually replicate the straggling task on another available node, which creates a large computation overhead. In this dissertation, we focus on a new and more effective technique, called coded computation to deal with stragglers in the distributed computation problems. It creates and exploits coding redundancy in local computation to enable the final output to be recoverable from the results of partially finished workers, and can therefore alleviate the impact of straggling workers. However, we observe that current coded computation techniques are not suitable for large-scale machine learning application. The reason is that the input training data exhibit both extremely large-scale targeting data and a sparse structure. However, the existing coded computation schemes destroy the sparsity and creates large computation redundancy. Thus, while these schemes reduce delays due to the stragglers in the system, they create additional delays because they end up increasing the computational load on each machine. This fact motivates ii us to focus on designing more efficient coded computation scheme for machine learning applications. We begin by investigating the linear transformation problem. We analyze the minimum computation load (number of redundant computations in each worker) of any coded computation scheme for this problem, and construct a code we name "diagonal code" that achieves the above minimum computation load. An important feature in this part is that we construct a new theoretical framework to relate the construction of coded computation scheme to the design of random bipartite graph that contains a perfect matching. Based on this framework, we further construct several random codes that can provide even lower computation load with high probability. We next consider a more complex problem that is also useful in a number of machine learning applications: matrix multiplication. We show that previous constructed coded computation scheme for the linear transformation problem will lead to a large decoding overhead for the matrix multiplication problem. To handle this issue, we design a new sparse code that is generated by a specifically designed degree distribution, we call \wave soliton distribution". We further design a type of hybrid decoding algorithm between peeling decoding and Gaussian elimination process, which can provide a fast decoding algorithm for this problem. Finally, we shift our focus on the distributed optimization problem for the gradient coding problem. We observe that the existing gradient coding scheme that is designed for the worst-case scenario will yield a large computation redundancy. To overcome this challenge, we propose the idea of approximate gradient coding, which aims to approximately compute the sum of functions. We analyze the minimum computation load for approximate gradient coding problem and further construct two approximate gradient coding schemes: fractional repetition code and batch raptor code that asymptotically achieves the minimum computation load. We apply our proposed iii scheme into a classical gradient descent algorithm in solving the logistic regression problem. These works go to illustrate the power of designing efficient codes that are tailored to solve large-scale machine learning problems. In the future research, we will focus on more complex machine learning problem such as distributed training of deep neural network and the system-level optimization of the coded computation scheme. iv To my parents, Xinhua Wang and Mingzhen Du my wife, Xiaochi Li v ACKNOWLEDGMENTS First and foremost, I would like to sincerely thank my Ph.D. advisor, Prof. Ness B. Shroff, for all the guidance and support he gave during my Ph.D. pursuit. As a great advisor, he not only pinpoints the correct direction and provide thought- provoking feedbacks when I was lost in the numerous challenges during research, but also sheds light on me how to think problem critically from the view of a researcher and find solutions effectively. In the past several years, he gave me immense help and encouragement without whom I wouldn't be able to stand at this point. I would like to thank Prof. Atilla Eryilmaz, Prof. Abhishek Gupta and Prof. Andrea Serrani for serving in my candidacy and dissertation committee. Their valuable suggestions and insightful comments have helped me significantly on the improvement of this dissertation. I am grateful to all my friends and colleagues in the IPS lab and the ECE department. Thanks to Jeri for nicely organizing all the activities and administrative issues. vi VITA Oct, 1992 . Born in Shaanxi, China 2010-2014 . B.S., Telecommunication Engineering, Xi- dian University. 2015-Present . Electrical and Computer Engineering, The Ohio State University PUBLICATIONS S. Wang, Jiashang Liu, N. Shroff and P. Yang,\Computation Efficient Coded Linear Transform". AISTATS 2019. S. Wang, Sinong, J. Liu, and N. Shroff, \Coded Sparse Matrix Multiplication". ICML 2018. S. Wang and N. Shroff,“A New Alternating Direction Method for Linear Programming". NIPS 2017 F. Liu, S. Wang, S. Buccapatnam, and N. Shroff, \UCBoost: A Boosting Approach to Tame Complexity and Optimality for Stochastic Bandits". IJCAI 2018. S. Wang, F. Liu and N. Shroff, \Non-additive Security Game". AAAI, 2017 S. Wang and N. Shroff,“Towards Fast-Convergence, Low-Delay and Low-Complexity Network Optimization". ACM SIGMETRICS 2018. S. Wang and N. Shroff, \A Fresh Look at An Old Problem: Network Utility Maxi- mization Convergence, Delay, and Complexity", invited paper, Allerton 2017. vii S. Wang and N. Shroff, \Security Game with Non-additive Utilities and Multiple At- tacker Resources", Kenneth C. Sevcik Outstanding Paper Award, ACM SIGMETRICS, 2017. FIELDS OF STUDY Major Field: Electrical and Computer Engineering Specialization: Machine Learning, Optimization, Distributed System, Game Theory viii TABLE OF CONTENTS Abstract . ii Dedication . iv Acknowledgments . vi Vita......................................... vii List of Tables . xii List of Figures . xiii CHAPTER PAGE 1 Introduction . .1 1.1 Curse of the Straggler . .2 1.2 New Frontier: Coded Computation . .3 1.2.1 A Toy Example . .3 1.2.2 Historical Development of Coded Computation . .5 1.3 Challenges in Coded Computation . .6 1.4 Contribution and Thesis Organization . .9 2 Coded Distributed Linear Transform . 12 2.1 Introduction . 12 2.2 Problem Formulation . 15 2.3 Fundamental Limits and Optimal Code . 17 2.3.1 Fundamental Limits on Computation Load . 17 2.3.2 Diagonal Code . 18 2.3.3 Fast Decoding Algorithm . 20 2.4 Graph-based Analysis of Recovery Threshold . 22 2.5 Random Code: \Break" the Limits . 24 2.5.1 Probabilistic Recovery Threshold . 24 2.5.2 Construction of the Random Code . 26 2.5.3 Numerical Results of Random Code . 27 ix 2.6 Experimental Results . 28 3 Coded Distributed Matrix Multiplication . 33 3.1 Introduction . 33 3.2 Preliminary . 35 3.3 Sparse Codes . 37 3.3.1 Motivating Example . 38 3.3.2 General Sparse Code . 41 3.4 Theoretical Analysis . 43 3.5 Experimental Results . 49 4 Coded Distributed Optimization . 53 4.1 Introduction . 53 4.2 Preliminaries . 57 4.2.1 Problem Formulation . 57 4.2.2 Main Results . 60 4.3 0-Approximate Gradient Code . 61 4.3.1 Minimum Computation Load . 62 4.3.2 d-Fractional Repetition Code . 63 4.4 -Approximate Gradient Code . 65 4.4.1 Fundamental Three-fold Tradeoff . 66 4.4.2 Random Code Design . 66 4.5 Simulation Results . 70 4.5.1 Experiment Setup . 70 4.5.2 Generalization Error . 72 4.5.3 Impact of Straggler Tolerance . 72 5 Conclusion . 74 Bibliography . 79 Appendix A: Proofs for Chapter 2 . 84 A.1 Proof of Theorem 2.3.1 . 84 A.2 Proof of Theorem 2.3.3 . 85 A.3 Proof of Lemma 2.4.2 . 85 A.4 Proof of Corollary 2.4.1.1 . 86 A.5 Proof of Theorem 2.5.1 . 87 A.6 Proof of Theorem 2.5.2 . 90 Appendix B: Proofs for Chapter 3 . 99 B.1 Proof of Theorem 3.4.1 . 99 B.2 Proof of Lemma 3.4.1 . 109 x B.3 Proof of Theorem 3.4.2 . 111 Appendix C: Proofs for Chapter 4 . 112 C.1 Proof of Lemma 4.3.1 . 112 C.2 Proof of Lemma 4.3.2 . 113 C.3 Proof of Theorem 4.3.1 . 114 C.4 Proof of Theorem 4.3.2 . 118 C.5 Proof of Theorem 4.4.1 . 120 C.6 Proof of Theorem 4.4.2 . 123 xi LIST OF TABLES TABLE PAGE 2.1 Comparison of Existing Schemes in Coded Computation. 14 3.1 Comparison of Existing Coding Schemes . 35 3.2 Timing Results for Different Sparse Matrix Multiplications (in sec) . 52 4.1 Comparison of Existing Gradient Coding Schemes . 55 xii LIST OF FIGURES FIGURE PAGE 1.1 Example of replication and coded computation scheme in distributed linear transform . .4 1.2 Measured local computation time per worker. .8 2.1 Framework of coded distributed linear transform. 16 2.2 Statistical convergence speed of full rank probability and average computation load of random code under number of stragglers s = 2.