©Copyright Haichen Shen

Total Page:16

File Type:pdf, Size:1020Kb

©Copyright Haichen Shen ©Copyright òýÔÀ Haichen Shen System for Serving Deep Neural Networks Eciently Haichen Shen A dissertation submitted in partial fulllment of the requirements for the degree of Doctor of Philosophy University of Washington òýÔÀ Reading Committee: Arvind Krishnamurthy, Chair Matthai Philipose, Chair omas E. Anderson Xi Wang Program Authorized to Oer Degree: Paul G. Allen School of Computer Science and Engineering University of Washington Abstract System for Serving Deep Neural Networks Eciently Haichen Shen Co-Chairs of the Supervisory Committee: Professor Arvind Krishnamurthy Computer Science & Engineering Aliate Professor Matthai Philipose Computer Science & Engineering Today, Deep Neural Networks (DNNs) can recognize faces, detect objects, and transcribe speech, with (almost) human performance. DNN-based applications have become an important workload for both edge and cloud computing. However, it is challenging to design a DNN serving system that satises various constraints such as latency, energy, and cost, and still achieve high eciency. First, though server-class accelerators provide signicant computing power, it is hard to achieve high eciency and utilization due to limits on batching induced by the latency constraint. Second, resource management is a challenging problem for mobile-cloud applications because DNN inference strains device battery capacities and cloud cost budgets. ird, model optimization allows systems to trade o accuracy for lower computation demand. Yet it introduces a model selection problem regarding which optimized model to use and when to use it. is dissertation provides techniques to improve the throughput and reduce the cost and energy consumption signicantly while meeting all sorts of constraints via better scheduling and resource management algorithms in the serving system. We present the design, implementation, and evalua- tion of three systems: (a) Nexus, a serving system on a cluster of accelerators in the cloud that includes a batch-aware scheduler and a query analyzer for complex query; (b) MCDNN, an approximation- based execution framework across mobile devices and the cloud that manages resource budgets proportionally to their frequency of use and systematically trades o accuracy for lower resource use; (c) Sequential specialization that generates and exploits low-cost and high-accuracy models at runtime once it detects temporal locality in the streaming applications. Our evaluation on realistic workloads shows that these systems can achieve signicant improvements on cost, accuracy, and utilization. Table of Contents Page List of Figures . iv ListofTables................................................ vi ChapterÔ: Introduction ......................................Ô Ô.Ô Nexus.............................................. Ô.ò MCDNN............................................â Ô.ç Sequential Specialization . .Þ Ô.¥ Contributions ......................................... Ô. Organization . .À Chapter ò: Background . Ôý ò.Ô Deep Neural Networks . Ôý ò.ò Model Optimization . Ô¥ ò.ç RelatedWork.......................................... Ô ò.ç.Ô Nexus . Ô ò.ç.ò MCDNN . Ô ò.ç.ç Sequential Specialization . òý Chapter ç: Nexus: Scalable and Ecient DNN Execution on GPU Clusters . òò ç.Ô Background . òò ç.Ô.Ô Vision pipelines and DNN-based analysis . òç ç.Ô.ò Accelerators and the challenge of utilizing them . ò¥ ç.Ô.ç Placing, packing and batching DNNs . ò ç.ò Examples . òÞ ç.ò.Ô Squishy bin packing . òÞ i ç.ò.ò Complex query scheduling . çý ç.ç System Overview . çÔ ç.ç.Ô Nexus API . ç¥ ç.¥ Global Scheduler . ç ç.¥.Ô Scheduling streams of individual DNN tasks . çâ ç.¥.ò Scheduling Complex Queries . ¥ý ç. Runtime............................................. ¥ò ç. .Ô Adaptive Batching . ¥ò ç. .ò GPU Multiplexing . ¥¥ ç. .ç Load balancing . ¥ ç.â Evaluation ........................................... ¥â ç.â.Ô Methodology . ¥Þ ç.â.ò Case Study . ¥ ç.â.ç Microbenchmark . ý ç.â.¥ Large-scale deployment with changing workload . ç.Þ Conclusion........................................... â Chapter ¥: MCDNN: Approximation-Based Execution Framework for Mobile-Cloud Environment . ¥.Ô Background . ¥.ò Approximate Model Scheduling . âò ¥.ò.Ô Fractional packing and paging . âò ¥.ò.ò AMS problem denition . ⥠¥.ò.ç Discussion . ââ ¥.ç System Design . âÞ ¥.ç.Ô Model catalogs . âÞ ¥.ç.ò System support for novel model optimizations . Þò ¥.ç.ç MCDNN’s approximate model scheduler . Þ¥ ¥.ç.¥ End-to-end architecture . Þ ¥.¥ Evaluation ........................................... ÞÀ ¥.¥.Ô Evaluating optimizations . ý ¥.¥.ò Evaluating the runtime . ç ¥.¥.ç MCDNN in applications . â ii ¥. Conclusions .......................................... Chapter : Sequential Specialization for Streaming Applications . À .Ô Class skew in day-to-day video . À .ò Specializing Models . ÀÔ .ç Sequential Model Specialization . À¥ .ç.Ô e Oracle Bandit Problem (OBP) . À¥ .ç.ò e Windowed ε-Greedy (WEG) Algorithm . Àâ .¥ Evaluation ........................................... ÀÀ .¥.Ô Synthetic experiments . ÀÀ .¥.ò Video experiments . ÔýÔ . Conclusion........................................... Ôý¥ Chapterâ: Conclusion ....................................... Ôýâ Bibliography................................................ ÔýÀ Appendix A: Hardness of Fixed-rate GPU Scheduling Problem (FGSP) . Ôòç Appendix B: Compact Model Architectures . Ôò Appendix C: Determine Dominant classes . ÔòÞ iii List of Figures Figure Number Page Ô.Ô eimpactofbatching....................................ò Ô.ò Accuracy of object classication versus number of operations . .ç Ô.ç System suite overview . .¥ ò.Ô Four common operators in deep neural networks. .ÔÔ ò.ò LeNet [ý] model architecture. Ôò ò.ç AlexNet [ÞÞ] model architecture. Ôò ò.¥ VGG [ÔÔÞ] model architecture . Ôç ò. ResNet [ À] model architecture . Ôç ò.â Inception [ÔòÔ] model architecture. Ôç ç.Ô Example video streams . òç ç.ò A typical vision processing pipeline . ò¥ ç.ç Impact of batching DNN inputs on model execution throughput and latency . ò ç.¥ Resource allocation example. òÀ ç. Query pipeline . òÀ ç.â Nexus runtime system overview. çò ç.Þ Nexus application API. çç ç. Application example of live game streaming analysis. ç¥ ç.À Trac monitoring application example. ç ç.Ôý Process of merging two nodes . çÀ ç.ÔÔ Dataow representations of two example applications. ¥Ô ç.Ôò Compare the percent of requests that miss their deadlines under three batching policies ¥ò ç.Ôç Batch size trace by three batching policies . ¥¥ ç.Ô¥ Compare the throughput of live game analysis . ¥ ç.Ô Compare the throughput of trac monitoring application . ¥À ç.Ôâ Compare the throughput when having multiple models on a single GPU . ý iv ç.ÔÞ Compare the throughput improvement of batching the common prex . ò ç.Ô Evaluate the overhead and memory use of prex batching . ç ç.ÔÀ Compare the throughput between Nexus and Nexus using batch-oblivious resource allocator. ............................................ ¥ ç.òý Compare the throughput between Nexus with and without query analyzer . ç.òÔ Compare the throughput of dierent latency split plans . â ç.òò Changing workload of Ôý applications over time on ⥠GPUs . Þ ¥.Ô Basic components of a continuous mobile vision system. À ¥.ò Memory/accuracy tradeos in MCDNN catalogs. âÀ ¥.ç Energy/accuracy tradeos in MCDNN catalogs. Þý ¥.¥ Latency/accuracy tradeos in MCDNN catalogs. ÞÔ ¥. System support for model specialization . Þò ¥.â System support for model sharing . Þ¥ ¥.Þ Architecture of the MCDNN system. Þ ¥. Impact of collective optimization on memory consumption . Ô ¥.À Impact of collective optimization on energy consumption . ò ¥.Ôý Impact of collective optimization on execution latency . ç ¥.ÔÔ Impact of MCDNN’s dynamically-sized caching scheme. ..
Recommended publications
  • Is 'Distributed' Worth It? Benchmarking Apache Spark with Mesos
    Is `Distributed' worth it? Benchmarking Apache Spark with Mesos N. Satra (ns532) January 13, 2015 Abstract A lot of research focus lately has been on building bigger dis- tributed systems to handle `Big Data' problems. This paper exam- ines whether typical problems for web-scale companies really benefits from the parallelism offered by these systems, or can be handled by a single machine without synchronisation overheads. Repeated runs of a movie review sentiment analysis task in Apache Spark were carried out using Apache Mesos and Mesosphere, on Google Compute Engine clusters of various sizes. Only a marginal improvement in run-time was observed on a distributed system as opposed to a single node. 1 Introduction Research in high performance computing and `big data' has recently focussed on distributed computing. The reason is three-pronged: • The rapidly decreasing cost of commodity machines, compared to the slower decline in prices of high performance or supercomputers. • The availability of cloud environments like Amazon EC2 or Google Compute Engine that let users access large numbers of computers on- demand, without upfront investment. • The development of frameworks and tools that provide easy-to-use id- ioms for distributed computing and managing such large clusters of machines. Large corporations like Google or Yahoo are working on petabyte-scale problems which simply can't be handled by single computers [9]. However, 1 smaller companies and research teams with much more manageable sizes of data have jumped on the bandwagon, using the tools built by the larger com- panies, without always analysing the performance tradeoffs. It has reached the stage where researchers are suggesting using the same MapReduce idiom hammer on all problems, whether they are nails or not [7].
    [Show full text]
  • Tachyon: Reliable, Memory Speed Storage for Cluster Computing Frameworks
    Tachyon: Reliable, Memory Speed Storage for Cluster Computing Frameworks The MIT Faculty has made this article openly available. Please share how this access benefits you. Your story matters. Citation Haoyuan Li, Ali Ghodsi, Matei Zaharia, Scott Shenker, and Ion Stoica. 2014. Tachyon: Reliable, Memory Speed Storage for Cluster Computing Frameworks. In Proceedings of the ACM Symposium on Cloud Computing (SOCC '14). ACM, New York, NY, USA, Article 6 , 15 pages. As Published http://dx.doi.org/10.1145/2670979.2670985 Publisher Association for Computing Machinery (ACM) Version Author's final manuscript Citable link http://hdl.handle.net/1721.1/101090 Terms of Use Creative Commons Attribution-Noncommercial-Share Alike Detailed Terms http://creativecommons.org/licenses/by-nc-sa/4.0/ Tachyon: Reliable, Memory Speed Storage for Cluster Computing Frameworks Haoyuan Li Ali Ghodsi Matei Zaharia Scott Shenker Ion Stoica University of California, Berkeley MIT, Databricks University of California, Berkeley fhaoyuan,[email protected] [email protected] fshenker,[email protected] Abstract Even replicating the data in memory can lead to a signifi- cant drop in the write performance, as both the latency and Tachyon is a distributed file system enabling reliable data throughput of the network are typically much worse than sharing at memory speed across cluster computing frame- that of local memory. works. While caching today improves read workloads, Slow writes can significantly hurt the performance of job writes are either network or disk bound, as replication is pipelines, where one job consumes the output of another. used for fault-tolerance. Tachyon eliminates this bottleneck These pipelines are regularly produced by workflow man- by pushing lineage, a well-known technique, into the storage agers such as Oozie [4] and Luigi [7], e.g., to perform data layer.
    [Show full text]
  • Comparison of Spark Resource Managers and Distributed File Systems
    2016 IEEE International Conferences on Big Data and Cloud Computing (BDCloud), Social Computing and Networking (SocialCom), Sustainable Computing and Communications (SustainCom) Comparison of Spark Resource Managers and Distributed File Systems Solmaz Salehian Yonghong Yan Department of Computer Science & Engineering, Department of Computer Science & Engineering, Oakland University, Oakland University, Rochester, MI 48309-4401, United States Rochester, MI 48309-4401, United States [email protected] [email protected] Abstract— The rapid growth in volume, velocity, and variety Section 4 provides information of different resource of data produced by applications of scientific computing, management subsystems of Spark. In Section 5, Four commercial workloads and cloud has led to Big Data. Traditional Distributed File Systems (DFSs) are reviewed. Then, solutions of data storage, management and processing cannot conclusion is included in Section 6. meet demands of this distributed data, so new execution models, data models and software systems have been developed to address II. CHALLENGES IN BIG DATA the challenges of storing data in heterogeneous form, e.g. HDFS, NoSQL database, and for processing data in parallel and Although Big Data provides users with lot of opportunities, distributed fashion, e.g. MapReduce, Hadoop and Spark creating an efficient software framework for Big Data has frameworks. This work comparatively studies Apache Spark many challenges in terms of networking, storage, management, distributed data processing framework. Our study first discusses analytics, and ethics [5, 6]. the resource management subsystems of Spark, and then reviews Previous studies [6-11] have been carried out to review open several of the distributed data storage options available to Spark. Keywords—Big Data, Distributed File System, Resource challenges in Big Data.
    [Show full text]
  • Alluxio: a Virtual Distributed File System by Haoyuan Li A
    Alluxio: A Virtual Distributed File System by Haoyuan Li A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Computer Science in the Graduate Division of the University of California, Berkeley Committee in charge: Professor Ion Stoica, Co-chair Professor Scott Shenker, Co-chair Professor John Chuang Spring 2018 Alluxio: A Virtual Distributed File System Copyright 2018 by Haoyuan Li 1 Abstract Alluxio: A Virtual Distributed File System by Haoyuan Li Doctor of Philosophy in Computer Science University of California, Berkeley Professor Ion Stoica, Co-chair Professor Scott Shenker, Co-chair The world is entering the data revolution era. Along with the latest advancements of the Inter- net, Artificial Intelligence (AI), mobile devices, autonomous driving, and Internet of Things (IoT), the amount of data we are generating, collecting, storing, managing, and analyzing is growing ex- ponentially. To store and process these data has exposed tremendous challenges and opportunities. Over the past two decades, we have seen significant innovation in the data stack. For exam- ple, in the computation layer, the ecosystem started from the MapReduce framework, and grew to many different general and specialized systems such as Apache Spark for general data processing, Apache Storm, Apache Samza for stream processing, Apache Mahout for machine learning, Ten- sorflow, Caffe for deep learning, Presto, Apache Drill for SQL workloads. There are more than a hundred popular frameworks for various workloads and the number is growing. Similarly, the storage layer of the ecosystem grew from the Apache Hadoop Distributed File System (HDFS) to a variety of choices as well, such as file systems, object stores, blob stores, key-value systems, and NoSQL databases to realize different tradeoffs in cost, speed and semantics.
    [Show full text]
  • An Architecture for Fast and General Data Processing on Large Clusters
    An Architecture for Fast and General Data Processing on Large Clusters Matei Zaharia Electrical Engineering and Computer Sciences University of California at Berkeley Technical Report No. UCB/EECS-2014-12 http://www.eecs.berkeley.edu/Pubs/TechRpts/2014/EECS-2014-12.html February 3, 2014 Copyright © 2014, by the author(s). All rights reserved. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission. An Architecture for Fast and General Data Processing on Large Clusters by Matei Alexandru Zaharia A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Computer Science in the GRADUATE DIVISION of the UNIVERSITY OF CALIFORNIA, BERKELEY Committee in charge: Professor Scott Shenker, Chair Professor Ion Stoica Professor Alexandre Bayen Professor Joshua Bloom Fall 2013 An Architecture for Fast and General Data Processing on Large Clusters Copyright c 2013 by Matei Alexandru Zaharia Abstract An Architecture for Fast and General Data Processing on Large Clusters by Matei Alexandru Zaharia Doctor of Philosophy in Computer Science University of California, Berkeley Professor Scott Shenker, Chair The past few years have seen a major change in computing systems, as growing data volumes and stalling processor speeds require more and more applications to scale out to distributed systems.
    [Show full text]
  • Thesis Enabling Autoscaling for In-Memory Storage In
    THESIS ENABLING AUTOSCALING FOR IN-MEMORY STORAGE IN CLUSTER COMPUTING FRAMEWORK Submitted by Bibek Raj Shrestha Department of Computer Science In partial fulfillment of the requirements For the Degree of Master of Science Colorado State University Fort Collins, Colorado Spring 2019 Master’s Committee: Advisor: Sangmi Lee Pallickara Shrideep Pallickara Stephen C. Hayne Copyright by Bibek Raj Shrestha 2019 All Rights Reserved ABSTRACT ENABLING AUTOSCALING FOR IN-MEMORY STORAGE IN CLUSTER COMPUTING FRAMEWORK IoT enabled devices and observational instruments continuously generate voluminous data. A large portion of these datasets are delivered with the associated geospatial locations. The increased volumes of geospatial data, alongside the emerging geospatial services, pose computational chal- lenges for large-scale geospatial analytics. We have designed and implemented STRETCH, an in-memory distributed geospatial storage that preserves spatial proximity and enables proactive autoscaling for frequently accessed data. STRETCH stores data with a delayed data dispersion scheme that incrementally adds data nodes to the storage system. We have devised an autoscaling feature that proactively repartitions data to alleviate computational hotspots before they occur. We compared the performance of STRETCH with Apache Ignite and the results show that STRETCH provides up to 3 times the throughput when the system encounters hotspots. STRETCH is built on Apache Spark and Ignite and interacts with them at runtime. ii ACKNOWLEDGEMENTS I would like to thank my advisor, Dr. Sangmi Lee Pallickara, for her enthusiastic support and sincere guidance that have resulted in driving this research project to completion. I would also like to thank my thesis committee members, Dr. Shrideep Pallickara and Dr.
    [Show full text]
  • Performance Tuning and Evaluation of Iterative Algorithms in Spark
    Performance Tuning and Evaluation of Iterative Algorithms in Spark Janani Gururam Department of Computer Science University of Maryland College Park, MD 20742 [email protected] Abstract. Spark is a widely used distributed, open-source framework for machine learning, relational queries, graph analytics and stream process- ing. This paper discusses key features of Spark that differentiate it from other MapReduce frameworks, focusing specifically on how iterative algo- rithms are implemented and optimized. We then analyze the impact of Resilient Distributed Dataset (RDD) caching, Serialization, Disk I/O, and other parameter considerations to improve performance in Spark. Finally, we discuss benchmarking tools currently available to Spark users. While there are several general purpose tools available, we do note that only a few Spark-specific solutions exist, and do not fully bridge the gap between workload characterization and performance implications. 1 Introduction Spark is an open-source, in-memory distributed framework for big data processing. The core engine of Spark underlays several modules for specialized analytical computation, including MLlib for machine learning applications, GraphX for graph analytics, SparkSQL for queries on structured, relational data, and Spark Streaming [25]. Spark provides several APIs for languages like Java, Scala, R, and Python. The base data structure in Spark is a Resilient Distributed Dataset(RDD), an immutable, dis- tributed collection of records. There are two types of operations in Spark: transformations, which perform operations on an input RDD to produce one or more new RDDs, and actions, which launch the series of transformations to return the final result RDD. Operations in Spark are lazily evaluated.
    [Show full text]
  • A Machine Learning Data Processing Framework
    tf.data: A Machine Learning Data Processing Framework Derek G. Murray Jiří Šimša Lacework* Google [email protected] [email protected] Ana Klimovic Ihor Indyk ETH Zurich* Google [email protected] [email protected] ABSTRACT Training machine learning models requires feeding input data for models to ingest. Input pipelines for machine learning jobs are often challenging to implement efficiently as they require reading large volumes of data, applying complex transformations, and transfer- ring data to hardware accelerators while overlapping computation and communication to achieve optimal performance. We present tf.data, a framework for building and executing efficient input pipelines for machine learning jobs. The tf.data API provides op- erators that can be parameterized with user-defined computation, composed, and reused across different machine learning domains. Figure 1: CDF showing the fraction of compute time that These abstractions enable users to focus on the application logic millions of ML training jobs executed in our fleet over one of data processing, while tf.data’s runtime ensures that pipelines month spend in the input pipeline. 20% of jobs spend more run efficiently. than a third of their compute time ingesting data. We demonstrate that input pipeline performance is critical to the end-to-end training time of state-of-the-art machine learning models. tf.data delivers the high performance required, while 1 INTRODUCTION avoiding the need for manual tuning of performance knobs. We Data is the lifeblood of machine learning (ML). Training ML models show that tf.data features, such as parallelism, caching, static op- requires steadily pumping examples for models to ingest and learn timizations, and optional non-deterministic execution are essential from.
    [Show full text]
  • From Cloud Computing to Sky Computing Ion Stoica and Scott Shenker UC Berkeley
    From Cloud Computing to Sky Computing Ion Stoica and Scott Shenker UC Berkeley Abstract and switching providers is relatively easy; you can even keep We consider the future of cloud computing and ask how we your number when switching. might guide it towards a more coherent service we call sky Returning to McCarthy’s prediction, what concerns us computing. The barriers are more economic than technical, here is not that there is no single public computing utility, and we propose reciprocal peering as a key enabling step. but that cloud computing is not an undifferentiated commod- ity. In contrast to telephony, the cloud computing market has ACM Reference Format: evolved away from commoditization, with cloud providers Ion Stoica and Scott Shenker. 2021. From Cloud Computing to Sky striving to differentiate themselves through proprietary ser- Computing. In Workshop on Hot Topics in Operating Systems (HotOS vices. In this paper we suggest steps we can take to overcome ’21), May 31-June 2, 2021, Ann Arbor, MI, USA. ACM, New York, NY, this differentiation and help create a more commoditized ver- USA, 7 pages. https://doi.org/10.1145/3458336.3465302 sion of cloud computing, which we call the Sky computing. Before doing so, we briefly summarize the history of cloud 1 Introduction computing to provide context for the rest of the paper. In 1961, John McCarthy outlined what he saw as the future of computing (we regretfully note the gendered language): 2 Historical Context “computation may someday be organized as a pub- lic utility, just as the telephone system is a public The NSF high performance computing (HPC) initiative in utility.
    [Show full text]
  • Learning Spark, Second Edition
    2nd Edition Apache Spark 3.0 Covers Learning Spark Lightning-Fast Data Analytics Compliments of Jules S. Damji, Brooke Wenig, Tathagata Das & Denny Lee Foreword by Matei Zaharia Praise for Learning Spark, Second Edition This book offers a structured approach to learning Apache Spark, covering new developments in the project. It is a great way for Spark developers to get started with big data. —Reynold Xin, Databricks Chief Architect and Cofounder and Apache Spark PMC Member For data scientists and data engineers looking to learn Apache Spark and how to build scalable and reliable big data applications, this book is an essential guide! —Ben Lorica, Databricks Chief Data Scientist, Past Program Chair O’Reilly Strata Conferences, Program Chair for Spark + AI Summit SECOND EDITION Learning Spark Lightning-Fast Data Analytics Jules S. Damji, Brooke Wenig, Tathagata Das, and Denny Lee Beijing Boston Farnham Sebastopol Tokyo Learning Spark by Jules S. Damji, Brooke Wenig, Tathagata Das, and Denny Lee Copyright © 2020 Databricks, Inc. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or [email protected]. Acquisitions Editor: Jonathan Hassell Indexer: Potomac Indexing, LLC Development Editor: Michele Cronin Interior Designer: David Futato Production Editor: Deborah Baker Cover Designer: Karen Montgomery Copyeditor: Rachel Head Illustrator: Rebecca Demarest Proofreader: Penelope Perkins January 2015: First Edition July 2020: Second Edition Revision History for the Second Edition 2020-06-24: First Release See http://oreilly.com/catalog/errata.csp?isbn=9781492050049 for release details.
    [Show full text]
  • A Survey on Spark Ecosystem for Big Data Processing
    1 A Survey on Spark Ecosystem for Big Data Processing Shanjiang Tang, Bingsheng He, Ce Yu, Yusen Li, Kun Li Abstract—With the explosive increase of big data in industry and academic fields, it is necessary to apply large-scale data processing systems to analysis Big Data. Arguably, Spark is state of the art in large-scale data computing systems nowadays, due to its good properties including generality, fault tolerance, high performance of in-memory data processing, and scalability. Spark adopts a flexible Resident Distributed Dataset (RDD) programming model with a set of provided transformation and action operators whose operating functions can be customized by users according to their applications. It is originally positioned as a fast and general data processing system. A large body of research efforts have been made to make it more efficient (faster) and general by considering various circumstances since its introduction. In this survey, we aim to have a thorough review of various kinds of optimization techniques on the generality and performance improvement of Spark. We introduce Spark programming model and computing system, discuss the pros and cons of Spark, and have an investigation and classification of various solving techniques in the literature. Moreover, we also introduce various data management and processing systems, machine learning algorithms and applications supported by Spark. Finally, we make a discussion on the open issues and challenges for large-scale in-memory data processing with Spark. Index Terms—Spark, Shark, RDD, In-Memory Data Processing. ✦ 1 INTRODUCTION streaming computations in the same runtime, which is use- ful for complex applications that have different computation In the current era of ‘big data’, the data is collected at modes.
    [Show full text]
  • Mllib: Machine Learning in Apache Spark
    MLlib: Machine learning in Apache Spark The MIT Faculty has made this article openly available. Please share how this access benefits you. Your story matters. Citation Meng, Xiangrui et al. "MLlib: Machine Learning in Apache Spark." Journal of Machine Learning Research, 17, 2016, pp. 1-7. © 2016 Xiangrui Meng et al. As Published http://www.jmlr.org/papers/volume17/15-237/15-237.pdf Publisher JMLR, Inc. Version Final published version Citable link http://hdl.handle.net/1721.1/116816 Terms of Use Article is made available in accordance with the publisher's policy and may be subject to US copyright law. Please refer to the publisher's site for terms of use. Journal of Machine Learning Research 17 (2016) 1-7 Submitted 5/15; Published 4/16 MLlib: Machine Learning in Apache Spark Xiangrui Mengy [email protected] Databricks, 160 Spear Street, 13th Floor, San Francisco, CA 94105 Joseph Bradley [email protected] Databricks, 160 Spear Street, 13th Floor, San Francisco, CA 94105 Burak Yavuz [email protected] Databricks, 160 Spear Street, 13th Floor, San Francisco, CA 94105 Evan Sparks [email protected] UC Berkeley, 465 Soda Hall, Berkeley, CA 94720 Shivaram Venkataraman [email protected] UC Berkeley, 465 Soda Hall, Berkeley, CA 94720 Davies Liu [email protected] Databricks, 160 Spear Street, 13th Floor, San Francisco, CA 94105 Jeremy Freeman [email protected] HHMI Janelia Research Campus, 19805 Helix Dr, Ashburn, VA 20147 DB Tsai [email protected] Netflix, 970 University Ave, Los Gatos, CA 95032 Manish Amde [email protected] Origami Logic, 1134 Crane Street, Menlo Park, CA 94025 Sean Owen [email protected] Cloudera UK, 33 Creechurch Lane, London EC3A 5EB United Kingdom Doris Xin [email protected] UIUC, 201 N Goodwin Ave, Urbana, IL 61801 Reynold Xin [email protected] Databricks, 160 Spear Street, 13th Floor, San Francisco, CA 94105 Michael J.
    [Show full text]