End-To-End Large Scale Machine Learning with Keystoneml by Evan

End-to-End Large Scale Machine Learning with KeystoneML by Evan Randall Sparks A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Computer Science in the Graduate Division of the University of California, Berkeley Committee in charge: Professor Michael J. Franklin, Co-chair Professor Benjamin Recht, Co-chair Professor Ion Stoica Professor Joshua S. Bloom Fall 2016 End-to-End Large Scale Machine Learning with KeystoneML Copyright 2016 by Evan Randall Sparks 1 Abstract End-to-End Large Scale Machine Learning with KeystoneML by Evan Randall Sparks Doctor of Philosophy in Computer Science University of California, Berkeley Professor Michael J. Franklin, Co-chair Professor Benjamin Recht, Co-chair The rise of data center computing and Internet-connected devices has led to an unparal- leled explosion in the volumes of data collected across a multitude of industries and academic disciplines. This data serves as fuel for statistical machine learning techniques that in turn enable some of today's most advanced applications including those powered by image classification, speech recognition, and natural language understanding, which we broadly term machine learning applications. Unfortunately, until recently the tools and techniques used to leverage recent advances in machine learning at the scales demanded by modern datasets, and thus develop these applications, have been available only to experts in fields such as distributed computing, statistics, and optimization. I describe my efforts to render these tools accessible to a broader audience of application developers, and further demonstrate that by taking a holistic approach and capturing end-to-end high level specifications of machine learning applications the systems I present here can make novel, high impact optimizations to decrease resource consumption while simultaneously increasing throughput. These improvements are designed to decrease ML application development time, increase quality, and increase machine learning application developer productivity. I demonstrate the viability of these optimizations via experiments on a number of real-world applications in domains such as collaborative filtering, computer vision, and natural language processing. Many of the ideas presented in this thesis have already had practical impact as embodied in the open source software packages KeystoneML and Apache Spark MLlib. i For my family. ii Contents Contents ii List of Figures vi List of Tables ix 1 Introduction 1 1.1 Datacenter Explosion . 2 1.2 The Promise of Machine Learning . 3 1.3 Machine Learning Pipelines . 3 1.4 The Challenges of Large Scale ML . 4 1.5 Decisions, Decisions . 6 1.6 Thesis Overview and Contributions . 7 1.7 Organization . 8 2 Background 10 2.1 Machine Learning Application Development . 10 2.1.1 Data Collection . 11 2.1.2 Data Preprocessing and Featurization . 12 2.1.3 Machine Learning Algorithms . 13 2.1.4 Deployment . 15 2.1.5 Overfitting and Hyperparameter Tuning . 15 2.1.6 Scaling Up Machine Learning . 17 2.2 Related Work . 19 2.2.1 Learning Systems . 19 2.2.2 Large Scale Learning . 19 2.2.3 Hyperparameter Tuning . 20 2.2.4 Declarative Programming and Database Optimization . 22 2.2.5 Non-Goals: Generative Models, Tiny Data, and Deep Learning . 22 2.2.6 Other Related Work . 23 3 MLI: An API for Distributed Machine Learning 25 iii 3.1 Introduction . 25 3.2 The MLI Interface . 27 3.2.1 MLTable . 28 3.2.2 LocalMatrix . 28 3.2.3 Optimization, Models, and Algorithms . 29 3.3 Examples . 30 3.3.1 Binary Classification: Logistic Regression . 33 3.3.2 Collaborative Filtering: Alternating Least Squares . 35 3.3.3 Configuration Considerations . 39 3.4 Related work . 40 3.5 Conclusion . 41 4 TuPAQ: Automating Model Search for Large Scale Machine Learning 42 4.1 Introduction . 42 4.2 Model Search and TUPAQ............................ 44 4.2.1 Defining Model Search . 44 4.2.2 Problem Setting . 45 4.2.3 Connections to Query Optimization . 46 4.2.4 Baseline Model Search . 47 4.2.5 TuPAQ Model Search . 47 4.3 TUPAQ Design Choices . 51 4.3.1 Cost-based Execution Strategy Selection . 51 4.3.2 Advanced Hyperparameter Tuning . 53 4.3.3 Bandit Resource Allocation . 54 4.3.4 Batching . 55 4.4 Design Space Evaluation . 58 4.4.1 Execution Strategy Selection . 58 4.4.2 Hyperparameter Tuning . 60 4.4.3 Bandit Resource Allocation . 62 4.4.4 Batching . 63 4.5 Putting It All Together . 64 4.5.1 Platform Configuration . 65 4.5.2 Experimental Setup and Datasets . 65 4.5.3 Optimization Effects . 66 4.5.4 Large Scale Speech and Vision . 67 4.6 Related Work . 68 4.7 Future Work and Conclusions . 69 5 KeystoneML: Optimizing Pipelines for Large-Scale Advanced Analytics 71 5.1 Introduction . 71 5.2 Pipeline Construction and Core APIs . 74 5.2.1 Logical ML Operators . 74 iv 5.2.2 Pipeline Construction . 75 5.2.3 Pipeline Execution . 76 5.3 Operator-Level Optimization . 76 5.4 Whole-Pipeline Optimization . 81 5.4.1 Execution Subsampling . 81 5.4.2 Common Sub-expression Elimination . 81 5.4.3 Automatic Materialization . 81 5.5 Evaluation . 84 5.5.1 End-to-End ML Applications . 85 5.5.2 KeystoneML vs. Other Systems . 86 5.5.3 Optimization Levels . 89 5.5.4 Automatic Materialization Strategies . 90 5.5.5 Scalability . 91 5.6 Related Work . 92 5.7 Future Work and Conclusion . 93 6 Piperplanned: Resource Aware Pipeline Hyperparameter Tuning 95 6.1 Introduction . 95 6.2 Hyperparameter Optimization . 98 6.2.1 Hyperparameter Search . 98 6.2.2 Pipelines . 99 6.2.3 Pipeline Tuning . 99 6.3 Implementation . 100 6.3.1 Search Space Description . 100 6.3.2 Operator Graphs . 102 6.3.3 Execution Planner . 102 6.4 Optimal Reuse in Pipeline Execution . 103 6.4.1 Maximum Achievable Reuse . 103 6.4.2 Memory Constraints . 104 6.4.3 Optimal Dataflow Reuse with Constrained Memory . 106 6.5 Evaluation . 108 6.5.1 Experimental Setup . 108 6.5.2 Limitations on Solving the ILP . 108 6.5.3 Evaluating Cache Strategies . 109 6.5.4 Results on Real Hyperparameter Configurations . 111 6.6 Related Work . 113 6.7 Future Work and Conclusions . 114 7 Future Work and Conclusions 115 7.1 Contributions . 115 7.2 Limitations . 116 7.3 Future Challenges . 117 v 7.4 Final Remarks . 118 Bibliography 119 vi List of Figures 1.1 Cost of storage over time [109]. 2 1.2 A \simple" image classification pipeline. 4 1.3 A \simple" image classification pipeline annotated by ease of scalability. 5 1.4 A \simple" image classification pipeline annotated with hyperparameters. 7 2.1 A high level view of the ML application development cycle. 11 3.1 Landscape of existing development platforms for ML. 26 3.2 MLTable API.

Load more