Kungfu: a Novel Distributed Training System for Tensorflow Using Flexible Synchronisation
Total Page:16
File Type:pdf, Size:1020Kb
MEng Individual Project Imperial College London Department of Computing Kungfu: A Novel Distributed Training System for TensorFlow using Flexible Synchronisation Supervisor: Dr. Peter Pietzuch Author: Andrei-Octavian Brabete Second Marker: Dr. Daphné Tuncer June 17, 2019 2 Abstract Modern times are distinguished as the Renaissance of Deep Learning, with remarkable advances across a wide range of domains and abundant research constantly pushing its boundaries. As models and datasets increase in size, people move to distributed clusters, which pose new challenges to training. Distributed training systems usually use the Parallel Stochastic Gradient Descent algorithm to scale out training. This algorithm creates large network traffic and an intensive need for synchronising the entire system. The fundamental limitation is that they inevitably explode the batch size and force the user to use large batches for training, in order to reduce communication traffic and the intensive demands for synchronising the entire system. Existing systems tend to adopt high-end specialised hardware and network such as InfiniBand, which eventually fail because of large network traffic and large clusters reaching hundreds of nodes. The hardware solutions are very expensive, do not scale and fundamentally, the system will suffer from large batch training, so the user may not be able to converge training when scaling out. In this project, I aim to design a system for Deep Learning that enables flexible synchronisation to address communication and large batch training solutions. The system brings three new designs: it has an abstraction that allows the user to declare flexible types of synchronisation, much more complex than Parallel SGD, it has a high-performance communication system where workers exchange gradients and models to collaborate for training and it enables monitoring for network and training statistics, providing support for dynamic adaptation of synchronisation strategies. I develop two advanced synchro- nisation algorithms based on this new system. One algorithm can exchange partial gradients if the network experiences bottlenecks, while the other enables each worker to selectively synchro- nise with other peers, reducing the global synchronisation overhead. These algorithms can further benefit from monitoring and adaptation to improve convergence and performance. I run large-scale test-bed experiments in commodity cloud to evaluate the effectiveness and performance of the pro- posed systems and algorithms. Experimental results show that the new system can converge to state-of-the-art accuracies while improving training performance by up to 22.5 times. 2 Acknowledgements My work was possible, firstly, because of my parents who gave me full support in every decision I made along my way. They helped me build a strong foundation of human value, which will always be reflected in my pursuits. Secondly, I would like to thank my supervisor, Dr. Peter Pietzuch for the valuable meetings which inspired me to develop a more clear vision of the system and also to grow as an engineer and improve my critical thinking. Also, I would like to show my appreciation for the high-end infrastructure which I was able to use for my experiments. Special thanks go to my second marker, Dr. Daphné Tuncer, for the valuable feedback at the beginning of my work and for the interesting discussion on Networking. Special thanks go to Luo Mai and Guo Li for collaboration on the larger Kungfu project and for the effective supervision during my project. Our discussions were unique learning experiences which helped me improve in all aspects of my practice and shaped my vision for developing Deep Learning systems. Special thanks go to Alexandros Koliousis, for the valuable discussions about training algo- rithms, models and synchronisation and for inspiring me to become better at running Deep Learn- ing experiments. Contents 1 Introduction 7 1.1 Motivation........................................7 1.2 Problem statement....................................8 1.3 Existing work.......................................8 1.4 My approaches and evaluation result..........................9 1.5 Contributions....................................... 10 2 Background and Motivation 11 2.1 Deep learning....................................... 11 2.1.1 Deep neural networks.............................. 11 2.1.2 Learning through gradient descent....................... 12 2.2 Deep learning systems.................................. 14 2.3 Distributed deep learning................................ 16 2.4 Synchronisation challenges................................ 18 2.4.1 Communication infrastructure of deep learning systems........... 18 2.4.2 Scaling challenges................................ 20 2.4.3 Case study: synchronisation performance in cloud networks......... 21 2.5 Limitations of existing synchronisation systems.................... 22 2.5.1 Parameter servers................................ 22 2.5.2 High-performance all-reduce-based systems.................. 23 2.6 The need for supporting flexible synchronisation................... 24 2.6.1 Synchronising partial gradients......................... 25 2.6.2 Synchronising models among selective peers.................. 25 2.7 Summary......................................... 26 3 Kungfu: Enabling Flexible Synchronisation in TensorFlow 27 3.1 System requirements................................... 27 3.2 Synchronisation abstraction............................... 28 3.3 Synchronisation operators................................ 30 3.4 Worker implementation................................. 32 3.4.1 Overview..................................... 32 3.4.2 Partial Gradient Exchange........................... 34 3.4.3 Peer Model Averaging.............................. 35 3.4.4 Worker Architecture............................... 38 3.5 Evaluation......................................... 39 3.5.1 Cluster Hardware................................. 40 3.5.2 Models and Datasets............................... 40 3.5.3 Overview..................................... 41 3.5.4 Peer Model Averaging.............................. 42 3.5.5 Partial Gradient Exchange........................... 46 3.5.6 Convergence under different conditions..................... 47 3.6 Summary......................................... 51 2 4 Supporting Users to Choose the Right Synchronisation 53 4.1 Design principles..................................... 53 4.2 Network statistics monitoring.............................. 54 4.3 Training statistics monitoring.............................. 55 4.4 Adaptation of synchronisation parameters....................... 57 4.5 Use case: adaptive synchronisation based on training statistics........... 59 4.5.1 Key idea...................................... 59 4.5.2 Dynamic policy design.............................. 59 4.5.3 Experimental result............................... 61 4.6 Use case: optimising peer selection based on network monitoring.......... 61 4.6.1 Key idea...................................... 61 4.6.2 Measuring network variability in commodity cloud.............. 62 4.6.3 Optimising the selection of peers online.................... 64 4.6.4 Ongoing work................................... 65 4.7 Adaptation beyond synchronisation........................... 65 4.8 Summary......................................... 66 5 Conclusion and Future Work 68 A Appendix 79 A.1 Horovod All-reduce Protocol Analysis......................... 79 A.2 Collective All-reduce Communication Theory..................... 80 A.3 Gradient noise estimation proposed by OpenAI.................... 82 3 List of Figures 2.1 (a) Loss Landscape of Neural Networks [1], (b) Empirical Loss in Deep Neural Networks [2]........................................ 12 2.2 TensorFlow architecture overview [3].......................... 15 2.3 From dataflow specification to computation Graph.................. 16 2.4 Four types of networks present in the Cloud: (a) Virtual network (b) Docker network (c) InifiniBand Network (d) NVLink GPU device communication.......... 19 2.5 Over-subscription problem [4].............................. 19 2.6 (a) View of system barrier (b) Straggler issue..................... 20 2.7 Scaling Horovod: Trade-off Number of Nodes-Batch Size, as compared with ideal throughput........................................ 21 2.8 Horovod timeline showing frequent all-reduce barriers................ 22 2.9 Parameter Server communication overview [5]..................... 23 2.10 Typical all-reduce system architecture where a master coordinates parallel ring all- reduces.......................................... 24 2.11 Partial Exchange Key Intuition............................. 25 3.1 Worker architecture: PGX is the Partial Gradient Exchange strategy, SPMA is the Synchronous Peer Model Averaging strategy and APMA is the Asynchronous Peer Model Averaging Strategy................................ 33 3.2 Partial gradient exchange dataflow........................... 34 3.3 Round-robin partition synchronisation......................... 35 3.4 System-wide communication reduction through partial exchange.......... 36 3.5 Synchronous peer-to-peer model averaging dataflow.................. 37 3.6 Asynchronous peer-to-peer model averaging dataflow................. 38 3.7 Unidirectional Peer Model Averaging.......................... 38 3.8 Shift to a decentralised architecture.......................... 39 3.9 (a) Async Peer Model Averaging vs TensorFlow Replicated over epochs, (b) Sync Peer Model Averaging vs