A Machine Learning Data Processing Framework

Total Page:16

File Type:pdf, Size:1020Kb

A Machine Learning Data Processing Framework tf.data: A Machine Learning Data Processing Framework Derek G. Murray Jiří Šimša Lacework* Google [email protected] [email protected] Ana Klimovic Ihor Indyk ETH Zurich* Google [email protected] [email protected] ABSTRACT Training machine learning models requires feeding input data for models to ingest. Input pipelines for machine learning jobs are often challenging to implement efficiently as they require reading large volumes of data, applying complex transformations, and transfer- ring data to hardware accelerators while overlapping computation and communication to achieve optimal performance. We present tf.data, a framework for building and executing efficient input pipelines for machine learning jobs. The tf.data API provides op- erators that can be parameterized with user-defined computation, composed, and reused across different machine learning domains. Figure 1: CDF showing the fraction of compute time that These abstractions enable users to focus on the application logic millions of ML training jobs executed in our fleet over one of data processing, while tf.data’s runtime ensures that pipelines month spend in the input pipeline. 20% of jobs spend more run efficiently. than a third of their compute time ingesting data. We demonstrate that input pipeline performance is critical to the end-to-end training time of state-of-the-art machine learning models. tf.data delivers the high performance required, while 1 INTRODUCTION avoiding the need for manual tuning of performance knobs. We Data is the lifeblood of machine learning (ML). Training ML models show that tf.data features, such as parallelism, caching, static op- requires steadily pumping examples for models to ingest and learn timizations, and optional non-deterministic execution are essential from. While prior work has focused on optimizing the accuracy and for high performance. Finally, we characterize machine learning speed of model training and serving, how we store and preprocess input pipelines for millions of jobs that ran in Google’s datacenter data for machine learning jobs has received significantly less atten- fleet, showing that input data processing is highly diverse andcon- tion. Across the millions of ML jobs we run in Google’s datacenters sumes a significant fraction of job resources. Our analysis motivates every month, we observe that the input data pipeline accounts future research directions, such as sharing computation across jobs for significant resource usage and can greatly impact end-to-end and pushing data projection to the storage layer. performance. Figure 1 shows how the fraction of compute time that jobs spend in the input pipeline varies, where we define compute time as the time spent on a hardware resource – such as a CPU or an PVLDB Reference Format: accelerator core – scaled by the compute capability of that resource. Derek G. Murray, Jiří Šimša, Ana Klimovic, and Ihor Indyk. tf.data: A The marked point shows that 20% of jobs spend more than a third of Machine Learning Data Processing Framework. PVLDB, 14(12): 2945-2958, their compute time in the input pipeline. When taking into account 2021. the total compute time from all jobs in our analysis (§ 5), we find doi:10.14778/3476311.3476374 that 30% of the total compute time is spent ingesting data. A com- PVLDB Artifact Availability: plementary study of ML model training with public datasets found The source code, data, and/or other artifacts have been made available at that preprocessing data accounts for up to 65% of epoch time [44]. https://github.com/tensorflow/tensorflow. This shows that input data pipelines consume a significant fraction of ML job resources and are important to optimize. Input pipelines of machine learning jobs are often challenging to *Work done while at Google. implement efficiently as they typically need to ingest large volumes of data, apply complex transformations, overlap communication This work is licensed under the Creative Commons BY-NC-ND 4.0 International and computation, and shuffle and batch data with various data License. Visit https://creativecommons.org/licenses/by-nc-nd/4.0/ to view a copy of this license. For any use beyond those covered by this license, obtain permission by ordering guarantees. For example, some jobs require that each ex- emailing [email protected]. Copyright is held by the owner/author(s). Publication rights ample is visited exactly once before any example is visited a second licensed to the VLDB Endowment. time during training. Moreover, to achieve good performance and Proceedings of the VLDB Endowment, Vol. 14, No. 12 ISSN 2150-8097. doi:10.14778/3476311.3476374 avoid input pipeline stalls, the data preprocessing should leverage parallelism and pipelining to overlap preprocessing with model 2945 1.0 training computations. Determining the optimal degree of paral- Number of jobs lelism and amount of data to prefetch is often challenging as it Job compute time 0.8 depends on the nature of the workload and the hardware resources available. 0.6 Hardware accelerators used for ML training further increase the need for efficient input pipelines. Today’s accelerators, such as 0.4 GPUs and TPUs, are tailored towards executing the linear algebra Cumulative fraction operations that are common in ML computations, but have limited 0.2 support for common data preprocessing operations. Hence, input 0.0 data is commonly processed on the CPU and feeding an accelerator 103 106 109 1012 1015 1018 with data at a sufficient rate to saturate its compute capabilities is Bytes read from storage becoming increasingly challenging. The high cost of accelerators compared to their CPU hosts makes it particularly important to Figure 2: CDF of input data size across ML training jobs. 13% ensure that accelerators operate at high utilization [6, 20]. of jobs read more than 1 TB of data. These jobs consume over We present tf.data, an API and a runtime for building and 96% of total compute resources. executing efficient input data pipelines for machine learning jobs. The tf.data API provides generic operators that can be param- eterized by user-defined functions, composed, and reused across which implies that preprocessing commonly decreases the volume ML domains. Inspired by the programming models of relational of data. Most notably, we observe that identical input pipelines databases [21, 24], declarative collection libraries [29, 42], and data- are frequently re-executed within and across jobs, suggesting that parallel big-data systems [66, 67], the tf.data API consists of state- caching materialized datasets is a promising future direction to less datasets, which are an abstraction for users to define their explore to improve the performance and efficiency of input data input pipeline, and stateful iterators, which produce a sequence processing for ML. Our findings motivate several other directions of elements and maintain the current position within a dataset. for future research, such as processing data closer to storage and These abstractions allow users to focus on the application logic of disaggregating input data processing from model training to avoid their input pipeline and leave the task of executing the pipeline host resource bottlenecks. efficiently to the tf.data runtime. In particular, tf.data internally represents an input pipeline dataset as a graph and applies static 2 INPUT PIPELINE REQUIREMENTS optimizations using graph rewrites. Furthermore, tf.data can au- Raw input data, such as images, audio, and text files, undergo both tomatically tune parameters such as the degree of parallelism and offline and online preprocessing before being ingested for model data prefetch buffer sizes, which are critical for performance yet training. Offline data preprocessing involves extracting features often challenging for an average ML user to tune by hand. from raw data, validating data [12], and converting data to binary Our evaluation demonstrates that 1) input pipeline performance formats, such as Avro [8], Parquet [9], or TFRecord [58], to enable is critical to end-to-end training time of state-of-the-art ML bench- higher throughput data ingestion. Batch computing frameworks marks, 2) tf.data is capable of improving input pipeline latency such as Apache Spark [67], Beam [1], and Flume [2] are well suited through a combination of software pipelining, parallelization, and to offline preprocessing. While some data transformations, such static optimizations, and 3) tf.data dynamic optimizations avoid as normalization, can be applied during offline preprocessing, ML the need to manually tune performance knobs. For example, we training also requires applying transformations online as examples show that introducing parallelism and software pipelining to the are fed to the model. For instance, image models commonly rely input pipeline of a Resnet50 model training on the ImageNet on data augmentation, e.g. randomly distorting images, to improve dataset results in a 10.4× decrease in time to convergence. Apply- accuracy [17, 54]. Data augmentation multiplies the size of the ing further optimizations with tf.data, such as caching and static original dataset, making it prohibitive to store outputs in interme- optimizations, improves training time by an additional 2×. We also diate files. Our work focuses on online data preprocessing, which demonstrate that tf.data’s auto-tuning matches the performance executes as part of the input pipeline of ML training jobs. of expert hand-tuned input pipelines. The input pipeline of ML training can be characterized as a three- The tf.data API and runtime are open source and integrated stage extract, transform, load (ETL) process. The first stage reads in TensorFlow [3]. At Google, tf.data has been in production use input data from a storage system. Machine learning jobs commonly since 2017 for various ML training jobs, such as supervised learning, train on large data sets. Figure 2 shows that 13% of jobs, out of federated learning, and reinforcement learning; with different data the millions of jobs we analyzed, read at least 1 TB of input data.
Recommended publications
  • Is 'Distributed' Worth It? Benchmarking Apache Spark with Mesos
    Is `Distributed' worth it? Benchmarking Apache Spark with Mesos N. Satra (ns532) January 13, 2015 Abstract A lot of research focus lately has been on building bigger dis- tributed systems to handle `Big Data' problems. This paper exam- ines whether typical problems for web-scale companies really benefits from the parallelism offered by these systems, or can be handled by a single machine without synchronisation overheads. Repeated runs of a movie review sentiment analysis task in Apache Spark were carried out using Apache Mesos and Mesosphere, on Google Compute Engine clusters of various sizes. Only a marginal improvement in run-time was observed on a distributed system as opposed to a single node. 1 Introduction Research in high performance computing and `big data' has recently focussed on distributed computing. The reason is three-pronged: • The rapidly decreasing cost of commodity machines, compared to the slower decline in prices of high performance or supercomputers. • The availability of cloud environments like Amazon EC2 or Google Compute Engine that let users access large numbers of computers on- demand, without upfront investment. • The development of frameworks and tools that provide easy-to-use id- ioms for distributed computing and managing such large clusters of machines. Large corporations like Google or Yahoo are working on petabyte-scale problems which simply can't be handled by single computers [9]. However, 1 smaller companies and research teams with much more manageable sizes of data have jumped on the bandwagon, using the tools built by the larger com- panies, without always analysing the performance tradeoffs. It has reached the stage where researchers are suggesting using the same MapReduce idiom hammer on all problems, whether they are nails or not [7].
    [Show full text]
  • Tachyon: Reliable, Memory Speed Storage for Cluster Computing Frameworks
    Tachyon: Reliable, Memory Speed Storage for Cluster Computing Frameworks The MIT Faculty has made this article openly available. Please share how this access benefits you. Your story matters. Citation Haoyuan Li, Ali Ghodsi, Matei Zaharia, Scott Shenker, and Ion Stoica. 2014. Tachyon: Reliable, Memory Speed Storage for Cluster Computing Frameworks. In Proceedings of the ACM Symposium on Cloud Computing (SOCC '14). ACM, New York, NY, USA, Article 6 , 15 pages. As Published http://dx.doi.org/10.1145/2670979.2670985 Publisher Association for Computing Machinery (ACM) Version Author's final manuscript Citable link http://hdl.handle.net/1721.1/101090 Terms of Use Creative Commons Attribution-Noncommercial-Share Alike Detailed Terms http://creativecommons.org/licenses/by-nc-sa/4.0/ Tachyon: Reliable, Memory Speed Storage for Cluster Computing Frameworks Haoyuan Li Ali Ghodsi Matei Zaharia Scott Shenker Ion Stoica University of California, Berkeley MIT, Databricks University of California, Berkeley fhaoyuan,[email protected] [email protected] fshenker,[email protected] Abstract Even replicating the data in memory can lead to a signifi- cant drop in the write performance, as both the latency and Tachyon is a distributed file system enabling reliable data throughput of the network are typically much worse than sharing at memory speed across cluster computing frame- that of local memory. works. While caching today improves read workloads, Slow writes can significantly hurt the performance of job writes are either network or disk bound, as replication is pipelines, where one job consumes the output of another. used for fault-tolerance. Tachyon eliminates this bottleneck These pipelines are regularly produced by workflow man- by pushing lineage, a well-known technique, into the storage agers such as Oozie [4] and Luigi [7], e.g., to perform data layer.
    [Show full text]
  • Comparison of Spark Resource Managers and Distributed File Systems
    2016 IEEE International Conferences on Big Data and Cloud Computing (BDCloud), Social Computing and Networking (SocialCom), Sustainable Computing and Communications (SustainCom) Comparison of Spark Resource Managers and Distributed File Systems Solmaz Salehian Yonghong Yan Department of Computer Science & Engineering, Department of Computer Science & Engineering, Oakland University, Oakland University, Rochester, MI 48309-4401, United States Rochester, MI 48309-4401, United States [email protected] [email protected] Abstract— The rapid growth in volume, velocity, and variety Section 4 provides information of different resource of data produced by applications of scientific computing, management subsystems of Spark. In Section 5, Four commercial workloads and cloud has led to Big Data. Traditional Distributed File Systems (DFSs) are reviewed. Then, solutions of data storage, management and processing cannot conclusion is included in Section 6. meet demands of this distributed data, so new execution models, data models and software systems have been developed to address II. CHALLENGES IN BIG DATA the challenges of storing data in heterogeneous form, e.g. HDFS, NoSQL database, and for processing data in parallel and Although Big Data provides users with lot of opportunities, distributed fashion, e.g. MapReduce, Hadoop and Spark creating an efficient software framework for Big Data has frameworks. This work comparatively studies Apache Spark many challenges in terms of networking, storage, management, distributed data processing framework. Our study first discusses analytics, and ethics [5, 6]. the resource management subsystems of Spark, and then reviews Previous studies [6-11] have been carried out to review open several of the distributed data storage options available to Spark. Keywords—Big Data, Distributed File System, Resource challenges in Big Data.
    [Show full text]
  • Alluxio: a Virtual Distributed File System by Haoyuan Li A
    Alluxio: A Virtual Distributed File System by Haoyuan Li A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Computer Science in the Graduate Division of the University of California, Berkeley Committee in charge: Professor Ion Stoica, Co-chair Professor Scott Shenker, Co-chair Professor John Chuang Spring 2018 Alluxio: A Virtual Distributed File System Copyright 2018 by Haoyuan Li 1 Abstract Alluxio: A Virtual Distributed File System by Haoyuan Li Doctor of Philosophy in Computer Science University of California, Berkeley Professor Ion Stoica, Co-chair Professor Scott Shenker, Co-chair The world is entering the data revolution era. Along with the latest advancements of the Inter- net, Artificial Intelligence (AI), mobile devices, autonomous driving, and Internet of Things (IoT), the amount of data we are generating, collecting, storing, managing, and analyzing is growing ex- ponentially. To store and process these data has exposed tremendous challenges and opportunities. Over the past two decades, we have seen significant innovation in the data stack. For exam- ple, in the computation layer, the ecosystem started from the MapReduce framework, and grew to many different general and specialized systems such as Apache Spark for general data processing, Apache Storm, Apache Samza for stream processing, Apache Mahout for machine learning, Ten- sorflow, Caffe for deep learning, Presto, Apache Drill for SQL workloads. There are more than a hundred popular frameworks for various workloads and the number is growing. Similarly, the storage layer of the ecosystem grew from the Apache Hadoop Distributed File System (HDFS) to a variety of choices as well, such as file systems, object stores, blob stores, key-value systems, and NoSQL databases to realize different tradeoffs in cost, speed and semantics.
    [Show full text]
  • An Architecture for Fast and General Data Processing on Large Clusters
    An Architecture for Fast and General Data Processing on Large Clusters Matei Zaharia Electrical Engineering and Computer Sciences University of California at Berkeley Technical Report No. UCB/EECS-2014-12 http://www.eecs.berkeley.edu/Pubs/TechRpts/2014/EECS-2014-12.html February 3, 2014 Copyright © 2014, by the author(s). All rights reserved. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission. An Architecture for Fast and General Data Processing on Large Clusters by Matei Alexandru Zaharia A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Computer Science in the GRADUATE DIVISION of the UNIVERSITY OF CALIFORNIA, BERKELEY Committee in charge: Professor Scott Shenker, Chair Professor Ion Stoica Professor Alexandre Bayen Professor Joshua Bloom Fall 2013 An Architecture for Fast and General Data Processing on Large Clusters Copyright c 2013 by Matei Alexandru Zaharia Abstract An Architecture for Fast and General Data Processing on Large Clusters by Matei Alexandru Zaharia Doctor of Philosophy in Computer Science University of California, Berkeley Professor Scott Shenker, Chair The past few years have seen a major change in computing systems, as growing data volumes and stalling processor speeds require more and more applications to scale out to distributed systems.
    [Show full text]
  • Thesis Enabling Autoscaling for In-Memory Storage In
    THESIS ENABLING AUTOSCALING FOR IN-MEMORY STORAGE IN CLUSTER COMPUTING FRAMEWORK Submitted by Bibek Raj Shrestha Department of Computer Science In partial fulfillment of the requirements For the Degree of Master of Science Colorado State University Fort Collins, Colorado Spring 2019 Master’s Committee: Advisor: Sangmi Lee Pallickara Shrideep Pallickara Stephen C. Hayne Copyright by Bibek Raj Shrestha 2019 All Rights Reserved ABSTRACT ENABLING AUTOSCALING FOR IN-MEMORY STORAGE IN CLUSTER COMPUTING FRAMEWORK IoT enabled devices and observational instruments continuously generate voluminous data. A large portion of these datasets are delivered with the associated geospatial locations. The increased volumes of geospatial data, alongside the emerging geospatial services, pose computational chal- lenges for large-scale geospatial analytics. We have designed and implemented STRETCH, an in-memory distributed geospatial storage that preserves spatial proximity and enables proactive autoscaling for frequently accessed data. STRETCH stores data with a delayed data dispersion scheme that incrementally adds data nodes to the storage system. We have devised an autoscaling feature that proactively repartitions data to alleviate computational hotspots before they occur. We compared the performance of STRETCH with Apache Ignite and the results show that STRETCH provides up to 3 times the throughput when the system encounters hotspots. STRETCH is built on Apache Spark and Ignite and interacts with them at runtime. ii ACKNOWLEDGEMENTS I would like to thank my advisor, Dr. Sangmi Lee Pallickara, for her enthusiastic support and sincere guidance that have resulted in driving this research project to completion. I would also like to thank my thesis committee members, Dr. Shrideep Pallickara and Dr.
    [Show full text]
  • Performance Tuning and Evaluation of Iterative Algorithms in Spark
    Performance Tuning and Evaluation of Iterative Algorithms in Spark Janani Gururam Department of Computer Science University of Maryland College Park, MD 20742 [email protected] Abstract. Spark is a widely used distributed, open-source framework for machine learning, relational queries, graph analytics and stream process- ing. This paper discusses key features of Spark that differentiate it from other MapReduce frameworks, focusing specifically on how iterative algo- rithms are implemented and optimized. We then analyze the impact of Resilient Distributed Dataset (RDD) caching, Serialization, Disk I/O, and other parameter considerations to improve performance in Spark. Finally, we discuss benchmarking tools currently available to Spark users. While there are several general purpose tools available, we do note that only a few Spark-specific solutions exist, and do not fully bridge the gap between workload characterization and performance implications. 1 Introduction Spark is an open-source, in-memory distributed framework for big data processing. The core engine of Spark underlays several modules for specialized analytical computation, including MLlib for machine learning applications, GraphX for graph analytics, SparkSQL for queries on structured, relational data, and Spark Streaming [25]. Spark provides several APIs for languages like Java, Scala, R, and Python. The base data structure in Spark is a Resilient Distributed Dataset(RDD), an immutable, dis- tributed collection of records. There are two types of operations in Spark: transformations, which perform operations on an input RDD to produce one or more new RDDs, and actions, which launch the series of transformations to return the final result RDD. Operations in Spark are lazily evaluated.
    [Show full text]
  • From Cloud Computing to Sky Computing Ion Stoica and Scott Shenker UC Berkeley
    From Cloud Computing to Sky Computing Ion Stoica and Scott Shenker UC Berkeley Abstract and switching providers is relatively easy; you can even keep We consider the future of cloud computing and ask how we your number when switching. might guide it towards a more coherent service we call sky Returning to McCarthy’s prediction, what concerns us computing. The barriers are more economic than technical, here is not that there is no single public computing utility, and we propose reciprocal peering as a key enabling step. but that cloud computing is not an undifferentiated commod- ity. In contrast to telephony, the cloud computing market has ACM Reference Format: evolved away from commoditization, with cloud providers Ion Stoica and Scott Shenker. 2021. From Cloud Computing to Sky striving to differentiate themselves through proprietary ser- Computing. In Workshop on Hot Topics in Operating Systems (HotOS vices. In this paper we suggest steps we can take to overcome ’21), May 31-June 2, 2021, Ann Arbor, MI, USA. ACM, New York, NY, this differentiation and help create a more commoditized ver- USA, 7 pages. https://doi.org/10.1145/3458336.3465302 sion of cloud computing, which we call the Sky computing. Before doing so, we briefly summarize the history of cloud 1 Introduction computing to provide context for the rest of the paper. In 1961, John McCarthy outlined what he saw as the future of computing (we regretfully note the gendered language): 2 Historical Context “computation may someday be organized as a pub- lic utility, just as the telephone system is a public The NSF high performance computing (HPC) initiative in utility.
    [Show full text]
  • ©Copyright Haichen Shen
    ©Copyright òýÔÀ Haichen Shen System for Serving Deep Neural Networks Eciently Haichen Shen A dissertation submitted in partial fulllment of the requirements for the degree of Doctor of Philosophy University of Washington òýÔÀ Reading Committee: Arvind Krishnamurthy, Chair Matthai Philipose, Chair omas E. Anderson Xi Wang Program Authorized to Oer Degree: Paul G. Allen School of Computer Science and Engineering University of Washington Abstract System for Serving Deep Neural Networks Eciently Haichen Shen Co-Chairs of the Supervisory Committee: Professor Arvind Krishnamurthy Computer Science & Engineering Aliate Professor Matthai Philipose Computer Science & Engineering Today, Deep Neural Networks (DNNs) can recognize faces, detect objects, and transcribe speech, with (almost) human performance. DNN-based applications have become an important workload for both edge and cloud computing. However, it is challenging to design a DNN serving system that satises various constraints such as latency, energy, and cost, and still achieve high eciency. First, though server-class accelerators provide signicant computing power, it is hard to achieve high eciency and utilization due to limits on batching induced by the latency constraint. Second, resource management is a challenging problem for mobile-cloud applications because DNN inference strains device battery capacities and cloud cost budgets. ird, model optimization allows systems to trade o accuracy for lower computation demand. Yet it introduces a model selection problem regarding which optimized model to use and when to use it. is dissertation provides techniques to improve the throughput and reduce the cost and energy consumption signicantly while meeting all sorts of constraints via better scheduling and resource management algorithms in the serving system.
    [Show full text]
  • Learning Spark, Second Edition
    2nd Edition Apache Spark 3.0 Covers Learning Spark Lightning-Fast Data Analytics Compliments of Jules S. Damji, Brooke Wenig, Tathagata Das & Denny Lee Foreword by Matei Zaharia Praise for Learning Spark, Second Edition This book offers a structured approach to learning Apache Spark, covering new developments in the project. It is a great way for Spark developers to get started with big data. —Reynold Xin, Databricks Chief Architect and Cofounder and Apache Spark PMC Member For data scientists and data engineers looking to learn Apache Spark and how to build scalable and reliable big data applications, this book is an essential guide! —Ben Lorica, Databricks Chief Data Scientist, Past Program Chair O’Reilly Strata Conferences, Program Chair for Spark + AI Summit SECOND EDITION Learning Spark Lightning-Fast Data Analytics Jules S. Damji, Brooke Wenig, Tathagata Das, and Denny Lee Beijing Boston Farnham Sebastopol Tokyo Learning Spark by Jules S. Damji, Brooke Wenig, Tathagata Das, and Denny Lee Copyright © 2020 Databricks, Inc. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or [email protected]. Acquisitions Editor: Jonathan Hassell Indexer: Potomac Indexing, LLC Development Editor: Michele Cronin Interior Designer: David Futato Production Editor: Deborah Baker Cover Designer: Karen Montgomery Copyeditor: Rachel Head Illustrator: Rebecca Demarest Proofreader: Penelope Perkins January 2015: First Edition July 2020: Second Edition Revision History for the Second Edition 2020-06-24: First Release See http://oreilly.com/catalog/errata.csp?isbn=9781492050049 for release details.
    [Show full text]
  • A Survey on Spark Ecosystem for Big Data Processing
    1 A Survey on Spark Ecosystem for Big Data Processing Shanjiang Tang, Bingsheng He, Ce Yu, Yusen Li, Kun Li Abstract—With the explosive increase of big data in industry and academic fields, it is necessary to apply large-scale data processing systems to analysis Big Data. Arguably, Spark is state of the art in large-scale data computing systems nowadays, due to its good properties including generality, fault tolerance, high performance of in-memory data processing, and scalability. Spark adopts a flexible Resident Distributed Dataset (RDD) programming model with a set of provided transformation and action operators whose operating functions can be customized by users according to their applications. It is originally positioned as a fast and general data processing system. A large body of research efforts have been made to make it more efficient (faster) and general by considering various circumstances since its introduction. In this survey, we aim to have a thorough review of various kinds of optimization techniques on the generality and performance improvement of Spark. We introduce Spark programming model and computing system, discuss the pros and cons of Spark, and have an investigation and classification of various solving techniques in the literature. Moreover, we also introduce various data management and processing systems, machine learning algorithms and applications supported by Spark. Finally, we make a discussion on the open issues and challenges for large-scale in-memory data processing with Spark. Index Terms—Spark, Shark, RDD, In-Memory Data Processing. ✦ 1 INTRODUCTION streaming computations in the same runtime, which is use- ful for complex applications that have different computation In the current era of ‘big data’, the data is collected at modes.
    [Show full text]
  • Mllib: Machine Learning in Apache Spark
    MLlib: Machine learning in Apache Spark The MIT Faculty has made this article openly available. Please share how this access benefits you. Your story matters. Citation Meng, Xiangrui et al. "MLlib: Machine Learning in Apache Spark." Journal of Machine Learning Research, 17, 2016, pp. 1-7. © 2016 Xiangrui Meng et al. As Published http://www.jmlr.org/papers/volume17/15-237/15-237.pdf Publisher JMLR, Inc. Version Final published version Citable link http://hdl.handle.net/1721.1/116816 Terms of Use Article is made available in accordance with the publisher's policy and may be subject to US copyright law. Please refer to the publisher's site for terms of use. Journal of Machine Learning Research 17 (2016) 1-7 Submitted 5/15; Published 4/16 MLlib: Machine Learning in Apache Spark Xiangrui Mengy [email protected] Databricks, 160 Spear Street, 13th Floor, San Francisco, CA 94105 Joseph Bradley [email protected] Databricks, 160 Spear Street, 13th Floor, San Francisco, CA 94105 Burak Yavuz [email protected] Databricks, 160 Spear Street, 13th Floor, San Francisco, CA 94105 Evan Sparks [email protected] UC Berkeley, 465 Soda Hall, Berkeley, CA 94720 Shivaram Venkataraman [email protected] UC Berkeley, 465 Soda Hall, Berkeley, CA 94720 Davies Liu [email protected] Databricks, 160 Spear Street, 13th Floor, San Francisco, CA 94105 Jeremy Freeman [email protected] HHMI Janelia Research Campus, 19805 Helix Dr, Ashburn, VA 20147 DB Tsai [email protected] Netflix, 970 University Ave, Los Gatos, CA 95032 Manish Amde [email protected] Origami Logic, 1134 Crane Street, Menlo Park, CA 94025 Sean Owen [email protected] Cloudera UK, 33 Creechurch Lane, London EC3A 5EB United Kingdom Doris Xin [email protected] UIUC, 201 N Goodwin Ave, Urbana, IL 61801 Reynold Xin [email protected] Databricks, 160 Spear Street, 13th Floor, San Francisco, CA 94105 Michael J.
    [Show full text]