Machine Learning Parallelism Could Be Adaptive, Composable and Automated
Total Page:16
File Type:pdf, Size:1020Kb
Machine Learning Parallelism Could Be Adaptive, Composable and Automated Hao Zhang CMU-RI-TR-20-54 October 8, 2020 School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 Thesis Committee: Greg R. Ganger Jinyang Li (NYU) Deva Ramanan Christopher Re´ (Stanford) Eric P. Xing, Chair Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy. Copyright © 2020 Hao Zhang. The research in this thesis was jointly sponsored by: the National Science Foundation awards IIS-1447676 and CCF- 1629559, the Defense Advanced Research Project Agency award FA872105C0003, and Petuum Inc. The views and conclusions contained in this document are those of the author and should not be interpreted as representing the official policies, either expressed or implied, of any sponsoring institution, the U.S. government or any other entity. Keywords: Scalable Machine Learning, Parallelization, Machine Learning Parallelism, Dis- tributed Machine Learning, Machine Learning System, Compiler, Automatic Parallelization, SysML, AutoML, Parameter Server, Composability To my fiancee,´ Luna. iv Abstract In recent years, the pace of innovations in the fields of machine learning (ML) has accelerated, researchers in SysML have created algorithms and systems that par- allelize ML training over multiple devices or computational nodes. As ML models become more structurally complex, many systems have struggled to provide all- round performance on a variety of models. Particularly, ML scale-up is usually un- derestimated in terms of the amount of knowledge and time required to map from an appropriate distribution strategy to the model. Applying parallel training systems to complex models adds nontrivial development overheads in addition to model proto- typing, and often results in lower-than-expected performance. This thesis identifies and addresses research challenges in both usability and performance in parallel ML techniques and system implementations. The first part of this thesis presents a simple design principle, adaptive paral- lelism, that applies suitable parallelization techniques to model building blocks (e.g., layers) according to their specific ML properties. Following it, we derive a series of optimizations and implementations optimizing different aspects of ML paralleliza- tion. We examine them and show that they significantly boost the efficiency or scal- ability of ML training on clusters 2-10x in their applicable scenarios. Generalizing this methodology, this second part of this thesis formulates the ML parallelization as an end-to-end optimization problem, and seeks to solve it automat- ically, for two broad paradigms of ML parallelization tasks: single-node dynamic batching and distributed ML parallelisms. We present principled representations to express the two classes of ML parallelisms, along with composable system archi- tectures, Cavs and AutoDist, respectively. They enable rapid compositions of par- allelization strategies for unseen models, improve parallelization performance, and simplify parallel ML programming. On top of them, the third part of this thesis presents an automatic paralleliza- tion framework, AutoSync, to automatically optimize synchronization strategies in data-parallel distributed training. AutoSync achieves high performance “out-of-the- box” – it navigates the space spanned by the proposed representation, and auto- matically identifies synchronization strategies that report 1.2 - 1.6x speedups over existing hand-optimized systems, lowering the technical barrier of distributed ML and helping make it accessible to a larger community of users. Collectively, the set of techniques and systems developed in this thesis lead to the proof of the concept and the prototype implementation of an end-to-end compiler system for large-scale ML training on distributed environments. vi Acknowledgments First and foremost, I thank my advisor Eric Xing, for giving me a huge amount of freedom and trust during my graduate studies. In my early years, Eric was instru- mental in teaching me to identity the right problems to work on. Eric also offered me the chance to work at Petuum during my PhD, during which I have received tremen- dous support to address new challenges that I would’ve otherwise never been able to see in a blue-sky research environment. This academic entrepreneurship experience has helped me identify my lifelong goal. I am extremely grateful to my thesis committee members: Greg Ganger, Deva Ramanan, Jinyang Li, and Christopher Re,´ for their feedback to make this thesis hap- pen. As a member of the BigLearning group, Greg has always been a good mentor and collaborator – during the BigLearning meetings, Greg’s sharp questions always helped me refine the characterizations of the problems, solutions, and systems. Dur- ing my 1-day visit at NYU, Jinyang inspired me a lot to maintain an ML-system balanced perspective. While Deva is not a SysML insider, his scientific insight and intuition have been extremely important in shaping this thesis. Chris’ path of trans- forming ML research into useful technologies and values have been an example, and I hope to collaborate with him in the future. Research is a team effort and I felt super grateful to be a part of amazing teams at CMU and Petuum. I appreciate the helpful discussion and the camaraderie with my friends and collaborators at CMU: Henggang Cui, Wei Dai, Zhijie Deng, Qirong Ho, Zhiting Hu, Gunhee Kim, Jin Kyu Kim, Christy Li, Aurick Qiao, Jinliang Wei, Shizhen Xu, Zhicheng Yan, Zeyu Zheng, Xun Zheng, Yuntian Deng, Qi Guo, Willie Neiswanger, Hector Li, Xiaodan Liang, Bin Zhao. I was extremely lucky to lead a small team at Petuum but do BIG things during my PhD. I thank my colleagues Peng Wu, Hong Wu, Trevin Gandhi, Zeya Wang, Tairui Wang, Xin Gao, Sean Chen, Luke Lu, Jayesh Gada, Henry Guo, Liz Yi, Cathy Serventi for their helps in various cases ranging from debugging a line a code to standing at the ICML EXPO. During my graduate studies, I also had the opportunity to interact with many other faculty members at CMU: Phillip Gibbons, Graham Neubig, Garth Gibson, Kris Kitani, Ruslan Salakhutdinov, Min Xu, Yaser Sheikh, Abhinav Gupta, David Wettergreen, Martial Hebert. Their nice comments and feedback have been always helpful in shaping me as an independent researcher. I want to give a special ac- knowledgement to Graham Neubig, Kris Kitani, and Min Xu for helping me with my speaking skill requirements. Finally, I would like to thank my family for always supporting and believing in me. Especially, I am deeply indebted to my dear girlfriend Luna Yang for her unconditional love, understanding, and encouragement during my graduate years. I can clearly remember every moment in the past 5 years of her accompany during all my ups and downs – when we were skiing the Rockies, diving Cenotes, watching sea otters. This thesis would not be possible without your continual support. Thank you, viii Contents 1 Introduction1 1.1 Example: Distributed Pretraining of Language Models on Web-scale Text....3 1.2 Thesis Outline, Contributions, and Key Results..................7 2 Background 10 2.1 Machine Learning: A Computational Perspective................. 10 2.1.1 “The Master Equation”........................... 10 2.1.2 Notable ML Models and Applications................... 11 2.1.3 Parallel Machine Learning......................... 13 2.1.4 Trends in Machine Learning........................ 14 2.2 Hardware Environment for Parallel Machine Learning............... 18 2.2.1 Trends in ML Hardware Infrastructure................... 21 2.3 Machine Learning Frameworks........................... 22 2.3.1 Earlier ML systems............................. 22 2.3.2 Modern DL Frameworks.......................... 23 2.3.3 Trends in ML Frameworks......................... 24 2.4 Distributed Systems for Large-scale ML: A Review................ 26 2.4.1 Strategies for ML Parallelization...................... 26 2.4.2 Distributed ML Programming APIs.................... 33 2.4.3 Trends in Distributed ML Systems..................... 35 I Aspects of ML Parallelization 37 3 Scheduling and Communication 40 3.1 Background..................................... 40 3.1.1 Distributed Training of Neural Networks................. 40 3.1.2 Communication Architectures....................... 42 3.1.3 Communication Challenges on GPU Clusters: An Example....... 43 3.2 Scheduling..................................... 44 3.2.1 The Sequential Structure of DL Programs................. 44 3.2.2 Wait-free Backpropagation......................... 45 3.3 Adaptive Communication.............................. 46 ix 3.4 Poseidon: An Efficient Communication Architecture for Distributed DL on GPU Clusters....................................... 48 3.4.1 System Architecture............................ 49 3.4.2 Integrate Poseidon with DL Frameworks................. 51 3.5 Evaluation...................................... 53 3.5.1 Experiment Setup.............................. 53 3.5.2 Scalability................................. 54 3.5.3 Bandwidth Experiments.......................... 59 3.5.4 Comparisons to Other Methods...................... 61 3.5.5 Application: Scaling Up Image Classification on ImageNet 22K..... 61 3.6 Additional Related Work.............................. 63 4 Memory Management 66 4.1 Introduction..................................... 67 4.1.1 Deep Learning on GPUs: a Memory Management Perspective...... 67 4.1.2 Memory Challenges for Distributed DL on GPU Clusters......... 67 4.2 Memory Swapping................................