Keystoneml: Optimizing Pipelines for Large-Scale Advanced Analytics

KeystoneML: Optimizing Pipelines for Large-Scale Advanced Analytics Evan R. Sparks∗, Shivaram Venkataraman∗, Tomer Kaftan∗z, Michael J. Franklin∗y, Benjamin Recht∗ fsparks,shivaram,tomerk11,franklin,[email protected] ∗AMPLab, Department of Computer Science, University of California, Berkeley yDepartment of Computer Science, University of Chicago zDepartment of Computer Science, University of Washington Abstract—Modern advanced analytics applications make use 1. Pipeline Specification 2. Logical Operator DAG of machine learning techniques and contain multiple steps of domain-specific and general-purpose processing with high re- source requirements. We present KeystoneML, a system that cap- val pipe = Preprocess andThen tures and optimizes the end-to-end large-scale machine learning Featurize andThen applications for high-throughput training in a distributed envi- (Est, data, labels) ronment with a high-level API. This approach offers increased ease of use and higher performance over existing systems for large scale learning. We demonstrate the effectiveness of KeystoneML in achieving high quality statistical accuracy and scalable training using real world datasets in several domains. I. INTRODUCTION Today’s advanced analytics applications increasingly use machine learning (ML) as a core technique in areas ranging from business intelligence to recommendation to natural lan- 3. Optimized Physical DAG 4. Distributed Training guage processing [1] and speech recognition [2]. Practitioners Fig. 1: KeystoneML takes a high-level ML application speci- build complex, multi-stage pipelines involving feature extrac- fication, optimizes and trains it in a distributed environment. tion, dimensionality reduction, data transformations, and train- The trained pipeline is used to make predictions on new data. ing supervised learning models to achieve high accuracy [3]. However, current systems provide little support for automating the construction and optimization of these pipelines. We present KeystoneML, a framework for ML pipelines designed to satisfy the above requirements. Fundamental to the To assemble such pipelines, developers typically piece to- design of KeystoneML is the observation that model training is gether domain specific libraries [4], [5] for feature extraction only one component of an ML application. While a significant and general purpose numerical optimization packages [6], body of recent work has focused on high performance algo- [7] for supervised learning. This is often a cumbersome and rithms [14], [15], and scalable implementations [16], [7] for error-prone process [8]. Further, these pipelines need to be model training, they do not capture the featurization process completely re-engineered when the training data or features or the logical intent of the workflow. KeystoneML provides grow by an order of magnitude–often the difference between a high-level, type-safe API (Figure 1) built around logical an application that provides good statistical accuracy and one operators to capture end-to-end applications. that does not [9]. As no broader system has purview of the end- to-end application, only narrow optimizations can be applied. To optimize ML pipelines, database query optimization These challenges motivate the need for a system that provides a natural motivation for the core design of such a • Allows users to specify end-to-end ML applications in system [17]. However, compared to relational database query a single system using high level logical operators. optimization, ML applications present an additional set of concerns. First, ML operators are often iterative and may • Scales out dynamically as data volumes and problem require multiple passes over their inputs, presenting opportu- complexity change. nities for data reuse. Second, many ML operators provide only • Automatically optimizes these applications given a li- approximate answers to their inputs [15]. Third, numerical data brary of ML operators and the user’s compute resources. properties such as sparsity and dimensionality are a necessary While existing efforts in the data management commu- source of information when selecting optimal execution plans nity [10], [11], [7] and in the broader machine learning systems and conventional optimizers do not consider them. Finally, the community [6], [12], [13] have built systems to address some system should be aware of the computation-vs-communication of these problems, each of them misses the mark on at least tradeoffs inherent in distributed processing [11], [6] and one of the points above. choose appropriate distributed execution strategies. To address these challenges we develop techniques to do val textClassifier = Trim andThen both per-operator optimization and end-to-end pipeline op- LowerCase andThen Tokenizer andThen timization for ML pipelines. We use a cost-based optimizer NGramsFeaturizer(1 to 2) andThen that accounts for both computation and communication costs TermFrequency(x => 1) andThen (CommonSparseFeatures(1e5), data) andThen and our cost model can easily accommodate new operators (LinearSolver(), data, labels) and hardware configurations. To determine which intermediate val predictions = textClassifier(testData) states are materialized in memory during iterative execution, we formulate an optimization problem and present a greedy Fig. 2: A text classification pipeline is specified using a small algorithm that works efficiently and accurately in practice. set of logical operators. We measure the importance of cost-based optimization and trait Transformer[A, B] extends Pipeline[A, B] { its associated overheads using real-world workloads from def apply(in: Dataset[A]): Dataset[B] = in.map(apply) def apply(in: A): B computer vision, speech and natural language processing. We } find that end-to-end optimization can improve performance trait Estimator[A, B] { by 7× and that physical operator optimizations combined def fit(data: Dataset[A]): Transformer[A, B] } with end-to-end optimizations can improve performance by trait Optimizable[T, A, B] { up to 15× versus unoptimized execution. We show that in our val options: List[(CostModel, T[A,B])] experiments, poor physical operator selection can result in up def optimize(sample: Dataset[A], d: ResourceDesc): T[A,B] to a 260× slowdown. Using an image classification pipeline } on over 1M images [3], we show that KeystoneML provides class CostProfile(flops: Long, bytes: Long, network: Long) linear performance scalability across various cluster sizes, trait CostModel { and statistical performance comparable to recent results [18], def cost(sample: Dataset[A], workers: Int): CostProfile [3]. Additionally, KeystoneML can match the performance of } a specialized phoneme classification system on a BlueGene trait Iterative { def weight: Int supercomputer while using 8× fewer resources. } In summary, we make the following contributions: • We present KeystoneML, a system for describing ML Fig. 3: The KeystoneML API consists of two extendable applications using high level logical operators. Key- operator types and interfaces for optimization. stoneML enables end-to-end optimization of ML applications at both the operator and pipeline level. for a complete text classification pipeline. We next describe the building blocks of our API. • We demonstrate the importance of physical operator selection in the context of input characteristics of three A. Logical ML Operators commonly used logical ML operators, and propose a cost Conventional analytics queries are typically composed using model for making this selection. a small number of well studied relational database operators. • We present and evaluate an initial set of whole-pipeline This well-defined environment enables important optimiza- optimizations, including a novel algorithm that automati- tions. However, ML applications lack such an abstraction cally identifies a subset of intermediate data to materialize and practitioners typically piece together imperative libraries. to speed up pipeline execution. Recent efforts have proposed using linear algebra operators • We evaluate these optimizations in the context of real- such as matrix multiplication [11], convex optimization rou- world pipelines in a diverse set of domains: phoneme tines [21] or multi-dimensional arrays as logical building classification, image classification, and textual sentiment blocks [22]. analysis, and demonstrate near-linear scalability over In contrast, with KeystoneML we propose a design where 100s of machines with strong statistical performance. high-level ML operations (such as PCA, LinearSolver) • We compare KeystoneML with several recent systems are used as building blocks. Our approach has two major ben- for large-scale learning and demonstrate superior runtime efits: First, it simplifies building applications. Even complex from our optimization techniques and scale-out strategy. pipelines can be built using just a handful of operators. Second, KeystoneML is open source software1 and is being used in this higher level abstraction allows us to perform a wider range scientific applications in solar physics [19] and genomics [20] of optimizations. Our key insight here is that there are usually multiple well studied algorithms for a given ML operator, but II. PIPELINE CONSTRUCTION AND CORE APIS that their performance and statistical characteristics vary based In this section we introduce the KeystoneML API that can on the inputs and system configuration. We next describe the be used to express end-to-end ML pipelines. Each pipeline is API for

Load more