Towards Effective and Efficient Learning at Scale
Total Page:16
File Type:pdf, Size:1020Kb
Effective and Efficient Learning at Scale Adams Wei Yu AUGUST 2019 CMU-ML-19-111 Machine Learning Department School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 Thesis Committee: Jaime Carbonell, Co-Chair Alexander Smola, Co-Chair (Amazon) Ruslan Salakhutdinov Quoc Le (Google) Christopher Manning (Stanford) Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy. Copyright c 2019 Adams Wei Yu This research was supported by a grant from the Boeing Company, a fellowship from the Nvidia Corporation, a fellowship from the Snap Inc, a fellowship from the Thomas and Stacey Siebel Foundation and a Carnegie Mellon University presidential fellowship. Keywords: Large Scale Optmization, Deep Neural Networks, Machine Reading Compre- hension, Natural Language Processing To my beloved wife, Yanyu Long and son, Ziheng Yu. iv Abstract How to enable efficient and effective machine learning at scale has been a long- standing problem in modern artificial intelligence, which also motivates this thesis research. In particular, we aim at solving the following problems: 1. How to efficiently train a machine learning model? 2. How to speed up inference after the model is trained? 3. How to make the model generalize better? We approach those problems from two perspectives: models and algorithms. On one hand, we design novel models that are intrinsically fast to train and/or test. On the other, we develop new algorithms with rapid convergence guarantee. Not surprisingly, the above three problem are not mutually exclusive and thus solving one of them might also benefit others. For example, 1) a new model that can enable parallel computation helps accelerate both training and inference; 2) a fast algorithm can save time for hyper-parameter tuning and/or make it affordable for training with more data, which in return boosts the generalization performance. This thesis consists of two parts. The first part presents new machine learn- ing models with a focus on sequential data such as natural language processing and question answering. Firstly, we propose a model, LSTM-Jump, that can skip unim- portant information in text, mimicking the skimming behavior of human reading. Trained with an efficient reinforcement learning algorithm, this model can be sev- eral times faster than a vanilla LSTM in inference time. Then we introduce a text encoding model that totally discards recurrent networks, which thus fully supports parallel training and inference. Based on this technique, a new question-answering model, QANet, is proposed. Combined with data augmentation approach via back- translation, this model stays at the No.1 place in the competitive Stanford Question and Answer Dataset (SQuAD) from March to Aug 2018, while being times faster than the prevalent models. It was also the deepest neural network model for NLP when invented. The second part proposes large scale learning algorithms with provable conver- gence guarantee. To enable fast training of neural networks, we propose a general gradient normalization algorithm for efficient deep networks training. This method can not only alleviate the gradient vanishing problem, but also regularize the model to achieve better generalization. When the amount of training data becomes huge, we need appeal to distributed computation. We are particularly interested in the ubiquitous problem, empirical risk minimization, and propose a few algorithms to attack the challenges posed by the distributed environment. We first show that a ran- domized primal-dual coordinate method DSPDC can achieve a linear convergence rate in the single-machine setting. Then by further leveraging the structural informa- tion of the problem, we propose an asynchronous algorithm DSCOVR, which enjoys a linear convergence rate as well on the distributed parameter server environment, by executing periodical variance reduction. vi Acknowledgments It has been a great journey for me in the past few years as a PhD student at CMU, filled with growing, learning and developing. I could not have asked for more. The first person I am greatly indebted to along this journey is my advisor Jaime Carbonell, the nicest person I have ever met in my life. Jaime always gives me a lot of freedom in choosing research topics and publishing in different areas, which helps me become a well-rounded young researcher. As a guru in machine learning and natural language processing, he knows almost everything in these two areas and provides plenty of insightful advices for each topic I had worked on. Without his kind support and guidance, I could not have achieved what I have now, not even close. Jaime is also a perfect role model of diligence. He is the only professor I know who had worked at CMU for 40 years but never taken a sabbatical break. Working with such an intelligent and hard-working person makes me hardly slow down my pace in research. I would also like to thank Alex Smola, my another advisor. Alex has tons of research ideas every day, a lot more than what I can digest, yet most of which would lead to great papers. Although it was a pity that Alex left CMU in the middle, his last wisdom proved to be very important for me – “Walk out of your comfort zone”. I followed his suggestion and started to work on deep learning in 2016, which totally changed my research path. I should thank Quoc Le as well, who recruited me to Google Brain as an intern, even when he knew I was completely a layman of deep learning. Quoc opens the door for me to neural networks, teaches me how to think out of the box, and shares with me uncountably many crazy ideas. I am really lucky to have him as a mentor and close friend in my career. Thanks also goes to the other thesis committee members, Ruslan Salakhutdinov and Chris Manning, for their time and effort on giving me invaluable feedbacks. As a world-class expert, Russ always impresses me by his deep understanding of every aspect of neural networks. I really enjoy the discussion with him, whose output turns out to be part of this thesis. I still remember the questions raised by Chris when I was giving a talk in Stanford NLP group, which helped shape my oral presentation. The meeting in person after the talk was also enlightening. I was grateful that Chris shared his viewpoint on the future of NLP under the deep learning era. I am fortunate enough to have worked with so many talents in optimization, which is a major part of this thesis. Firstly, I would thank Fatma Kılınc¸-Karzan, who mentored me on my first paper in CMU, which turned out to be an INFORMS data mining best student paper candidate. I need to thank Suvrit Sra, who taught a fantas- tic course Advanced Optimization and Randomized Methods, the course project of which turned out to be my second paper in CMU. The discussion and collaboration after he moved to MIT continued, which was enjoyable. I also enjoy the collab- oration with Qihang Lin, one of the solidest optimization researchers I have ever met. I really appreciate Lin Xiao for hosting me in Microsoft Research Redmond for a fruitful summer internship. He set up a good example for me, teaching me the importance of patience in doing solid research in optimization. I was impressed by the breadth of Yaoliang Yu’s horizon, and the discussion with him was always enlightening. I would not be able to grow to what I have been today without my fantastic collaborators, whom I would like to thank here as well: Maria Florina Balcan, Jaime Carbonell, Kai Chen, Weizhu Chen, David Dohan, Simon S. Du, Lei Huang, Fatma Kılınc¸-Karzan, Quoc Le, Hongrae Lee, Bo Li, Mu Li, Qihang Lin, Thang Luong, Wanli Ma, Mohammad Norouzi, Ruslan Salakhutdinov, Aarti Singh, Alex Smola, Suvrit Sra, Yining Wang, Lin Xiao, Tianbao Yang, Yaoliang Yu and Rui Zhao. I am also grateful to Diane Stidle, Mallory Deptola, Alison Chiocchi and all other MLD stuff members for making our big ML family well-organized. I want to say sorry to Diane for always bringing her trouble but she is consistently patient and helpful. I apologize for forgetting someone who should be acknowledged. Last but not the least, I need to thank my parents for their long-lasting support and help during my life. I am particularly thankful to my wife Yanyu Long. Yanyu, to maintain our family ship, you gave up the job in China, came with me to CMU, started from scratch to become a surgeon in the US and gave birth to our lovely boy in the meanwhile. You have faced much higher pressure and conquered way more challenges than I have, both physically and psychologically, with your bravery, courage and determination. You are simply my hero and this thesis is dedicated to you! viii Contents 1 Introduction 1 1.1 Background and Motivation . .1 1.2 Part One: Efficient and Effective Deep Learning Models for Sequential Data Processing . .2 1.2.1 Fast Recurrent Model to Capture Important Information . .2 1.2.2 Recurrence Free Model for Parallel Training and Inference . .2 1.3 Part Two: Efficient Algorithms for General Model Training . .3 1.3.1 Layer-wise Normalized Gradient Method for Training Deep Neural Net- works....................................3 1.3.2 Distributed Optimization on Parameter Servers . .3 1.4 Excluded Research . .5 I Efficient and Effective Models for Sequential Data Processing 7 2 Learning to Skip Unimportant Information in Sequential Data 9 2.1 Introduction . .9 2.2 Related Work . 10 2.3 Methodology . 11 2.3.1 Model Overview . 11 2.3.2 Training with REINFORCE . 12 2.3.3 Inference . 13 2.4 Experimental Results . 13 2.4.1 Number Prediction with a Synthetic Dataset .