Spark in Action

Total Page:16

File Type:pdf, Size:1020Kb

Spark in Action IN ACTION ^ Petar Zecevi´c Marko Bona´ci MANNING www.allitebooks.com MAIN SPARK COMPONENTS, VARIOUS RUNTIME INTERACTIONS, AND STORAGE OPTIONS Streaming sources include Spark Streaming can use Spark Streaming can use Kafka, Flume, Twitter, HDFS, GraphX features on the machine-learning models and and ZeroMQ. data it receives. Spark SQL to analyze streaming data. Spark Streaming Spark ML & MLlib Spark MLlib models Streaming sources use DataFrames to represent data. DStream ML model Spark ML uses RDDs. Both use features from Spark Core. Spark GraphX Spark Core Spark SQL Data sources include Graph RDD RDD Dataframe Hive, JSON, relational databases, NoSQL databases, and Parquet files. Spark Streaming uses DStreams to Filesystems Data sources periodically create RDDs. Spark GraphX uses Spark Filesystems include HDFS, Spark SQL transforms Core features behind Guster FS, and Amazon S3. operations on DataFrames the scenes. to operations on RDDs. RDD EXAMPLE DEPENDENCIES RDD lineage parallelize map reduceByKey map Collect listrdd pairs reduced finalrdd ParallelCollectionRDD MapPartitionsRDD ShuffledRDD MapPartitionsRDD Driver Driver OneToOne Shuffle OneToOne dependency dependency dependency www.allitebooks.com Spark in Action www.allitebooks.com www.allitebooks.com Spark in Action PETAR ZECEVICˇ ´ MARKO BONACIˇ MANN ING Shelter Island www.allitebooks.com For online information and ordering of this and other Manning books, please visit www.manning.com. The publisher offers discounts on this book when ordered in quantity. For more information, please contact Special Sales Department Manning Publications Co. 20 Baldwin Road PO Box 761 Shelter Island, NY 11964 Email: [email protected] ©2017 by Manning Publications Co. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by means electronic, mechanical, photocopying, or otherwise, without prior written permission of the publisher. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in the book, and Manning Publications was aware of a trademark claim, the designations have been printed in initial caps or all caps. Recognizing the importance of preserving what has been written, it is Manning’s policy to have the books we publish printed on acid-free paper, and we exert our best efforts to that end. Recognizing also our responsibility to conserve the resources of our planet, Manning books are printed on paper that is at least 15 percent recycled and processed without the use of elemental chlorine. Manning Publications Co. Development editor: Marina Michaels 20 Baldwin Road Technical development editor: Andy Hicks PO Box 761 Project editor: Karen Gulliver Shelter Island, NY 11964 Copyeditor: Tiffany Taylor Proofreader: Elizabeth Martin Technical proofreaders: Michiel Trimpe Robert Ormandi Typesetter: Gordan Salinovic Cover designer: Marija Tudor ISBN 9781617292606 Printed in the United States of America 1 2 3 4 5 6 7 8 9 10 – EBM – 21 20 19 18 17 16 To my mother in heaven. —P.Z. To my dear wife Suzana and our twins, Frane and Luka. —M.B. brief contents PART 1FIRST STEPS .....................................................................1 1 ■ Introduction to Apache Spark 3 2 ■ Spark fundamentals 18 3 ■ Writing Spark applications 40 4 ■ The Spark API in depth 66 PART 2MEET THE SPARK FAMILY ...............................................103 5 ■ Sparkling queries with Spark SQL 105 6 ■ Ingesting data with Spark Streaming 147 7 ■ Getting smart with MLlib 180 8 ■ ML: classification and clustering 218 9 ■ Connecting the dots with GraphX 254 PART 3SPARK OPS...................................................................283 10 ■ Running Spark 285 11 ■ Running on a Spark standalone cluster 306 12 ■ Running on YARN and Mesos 331 vii viii BRIEF CONTENTS PART 4BRINGING IT TOGETHER.................................................361 13 ■ Case study: real-time dashboard 363 14 ■ Deep learning on Spark with H2O 383 brief contents contents preface xvii acknowledgments xix about this book xx about the authors xxiv about the cover xxv PART 1FIRST STEPS ...........................................................1 Introduction to Apache Spark 3 1 1.1 What is Spark? 4 The Spark revolution 5 ■ MapReduce’s shortcomings 5 ■ What Spark brings to the table 6 1.2 Spark components 8 Spark Core 8 ■ Spark SQL 9 ■ Spark Streaming 10 ■ Spark MLlib 10 ■ Spark GraphX 10 1.3 Spark program flow 10 1.4 Spark ecosystem 13 1.5 Setting up the spark-in-action VM 14 Downloading and starting the virtual machine 15 ■ Stopping the virtual machine 16 1.6 Summary 16 ix x CONTENTS Spark fundamentals 18 2 2.1 Using the spark-in-action VM 19 Cloning the Spark in Action GitHub repository 20 ■ Finding Java 20 ■ Using the VM’s Hadoop installation 21 Examining the VM’s Spark installation 22 2.2 Using Spark shell and writing your first Spark program 23 Starting the Spark shell 23 ■ The first Spark code example 25 The notion of a resilient distributed dataset 27 2.3 Basic RDD actions and transformations 28 Using the map transformation 28 ■ Using the distinct and flatMap transformations 30 ■ Obtaining RDD’s elements with the sample, take, and takeSample operations 34 2.4 Double RDD functions 36 Basic statistics with double RDD functions 37 ■ Visualizing data distribution with histograms 38 ■ Approximate sum and mean 38 2.5 Summary 39 Writing Spark applications 40 3 3.1 Generating a new Spark project in Eclipse 41 3.2 Developing the application 46 Preparing the GitHub archive dataset 46 ■ Loading JSON 48 Running the application from Eclipse 50 ■ Aggregating the data 52 ■ Excluding non-employees 53 ■ Broadcast variables 55 ■ Using the entire dataset 57 3.3 Submitting the application 58 Building the uberjar 59 ■ Adapting the application 60 Using spark-submit 62 3.4 Summary 64 The Spark API in depth 66 4 4.1 Working with pair RDDs 67 Creating pair RDDs 67 ■ Basic pair RDD functions 68 4.2 Understanding data partitioning and reducing data shuffling 74 Using Spark’s data partitioners 74 ■ Understanding and avoiding unnecessary shuffling 76 ■ Repartitioning RDDs 80 Mapping data in partitions 81 CONTENTS xi 4.3 Joining, sorting, and grouping data 82 Joining data 82 ■ Sorting data 88 ■ Grouping data 91 4.4 Understanding RDD dependencies 94 RDD dependencies and Spark execution 94 ■ Spark stages and tasks 96 ■ Saving the RDD lineage with checkpointing 97 4.5 Using accumulators and broadcast variables to communicate with Spark executors 97 Obtaining data from executors with accumulators 97 ■ Sending data to executors using broadcast variables 99 4.6 Summary 101 PART 2MEET THE SPARK FAMILY......................................103 Sparkling queries with Spark SQL 105 5 5.1 Working with DataFrames 106 Creating DataFrames from RDDs 108 ■ DataFrame API basics 115 Using SQL functions to perform calculations on data 118 ■ Working with missing values 123 ■ Converting DataFrames to RDDs 124 Grouping and joining data 125 ■ Performing joins 128 5.2 Beyond DataFrames: introducing DataSets 129 5.3 Using SQL commands 130 Table catalog and Hive metastore 131 ■ Executing SQL queries 133 Connecting to Spark SQL through the Thrift server 134 5.4 Saving and loading DataFrame data 137 Built-in data sources 138 ■ Saving data 139 ■ Loading data 141 5.5 Catalyst optimizer 142 5.6 Performance improvements with Tungsten 144 5.7 Beyond DataFrames: introducing DataSets 145 5.8 Summary 145 Ingesting data with Spark Streaming 147 6 6.1 Writing Spark Streaming applications 148 Introducing the example application 148 ■ Creating a streaming context 150 ■ Creating a discretized stream 150 ■ Using discretized streams 152 ■ Saving the results to a file 153 ■ Starting and stopping the streaming computation 154 ■ Saving the computation state over time 155 ■ Using window operations for time-limited calculations 162 Examining the other built-in input streams 164 xii CONTENTS 6.2 Using external data sources 165 Setting up Kafka 166 ■ Changing the streaming application to use Kafka 167 6.3 Performance of Spark Streaming jobs 172 Obtaining good performance 173 ■ Achieving fault-tolerance 175 6.4 Structured Streaming 176 Creating a streaming DataFrame 176 ■ Outputting streaming data 177 ■ Examining streaming executions 178 ■ Future direction of structured streaming 178 6.5 Summary 178 Getting smart with MLlib 180 7 7.1 Introduction to machine learning 181 Definition of machine learning 184 ■ Classification of machine- learning algorithms 184 ■ Machine learning with Spark 187 7.2 Linear algebra in Spark 187 Local vector and matrix implementations 188 ■ Distributed matrices 192 7.3 Linear regression 193 About linear regression 193 ■ Simple linear regression 194 Expanding the model to multiple linear regression 196 7.4 Analyzing and preparing the data 198 Analyzing data distribution 199 ■ Analyzing column cosine similarities 200 ■ Computing the covariance matrix 200 Transforming to labeled points 201 ■ Splitting the data 201 Feature scaling and mean normalization 202 7.5 Fitting and using a linear regression model 202 Predicting the target values 203 ■ Evaluating the model’s performance 204 ■ Interpreting the model parameters 204 Loading and saving the model 205 7.6 Tweaking the algorithm 205 Finding the right step size and number of iterations 206 ■ Adding higher-order polynomials 207 ■ Bias-variance tradeoff and model complexity 209 ■ Plotting residual plots 211 ■ Avoiding overfitting by using regularization 212 ■ K-fold cross-validation
Recommended publications
  • Orchestrating Big Data Analysis Workflows in the Cloud: Research Challenges, Survey, and Future Directions
    00 Orchestrating Big Data Analysis Workflows in the Cloud: Research Challenges, Survey, and Future Directions MUTAZ BARIKA, University of Tasmania SAURABH GARG, University of Tasmania ALBERT Y. ZOMAYA, University of Sydney LIZHE WANG, China University of Geoscience (Wuhan) AAD VAN MOORSEL, Newcastle University RAJIV RANJAN, Chinese University of Geoscienes and Newcastle University Interest in processing big data has increased rapidly to gain insights that can transform businesses, government policies and research outcomes. This has led to advancement in communication, programming and processing technologies, including Cloud computing services and technologies such as Hadoop, Spark and Storm. This trend also affects the needs of analytical applications, which are no longer monolithic but composed of several individual analytical steps running in the form of a workflow. These Big Data Workflows are vastly different in nature from traditional workflows. Researchers arecurrently facing the challenge of how to orchestrate and manage the execution of such workflows. In this paper, we discuss in detail orchestration requirements of these workflows as well as the challenges in achieving these requirements. We alsosurvey current trends and research that supports orchestration of big data workflows and identify open research challenges to guide future developments in this area. CCS Concepts: • General and reference → Surveys and overviews; • Information systems → Data analytics; • Computer systems organization → Cloud computing; Additional Key Words and Phrases: Big Data, Cloud Computing, Workflow Orchestration, Requirements, Approaches ACM Reference format: Mutaz Barika, Saurabh Garg, Albert Y. Zomaya, Lizhe Wang, Aad van Moorsel, and Rajiv Ranjan. 2018. Orchestrating Big Data Analysis Workflows in the Cloud: Research Challenges, Survey, and Future Directions.
    [Show full text]
  • Building Machine Learning Inference Pipelines at Scale
    Building Machine Learning inference pipelines at scale Julien Simon Global Evangelist, AI & Machine Learning @julsimon © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Problem statement • Real-life Machine Learning applications require more than a single model. • Data may need pre-processing: normalization, feature engineering, dimensionality reduction, etc. • Predictions may need post-processing: filtering, sorting, combining, etc. Our goal: build scalable ML pipelines with open source (Spark, Scikit-learn, XGBoost) and managed services (Amazon EMR, AWS Glue, Amazon SageMaker) © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Apache Spark https://spark.apache.org/ • Open-source, distributed processing system • In-memory caching and optimized execution for fast performance (typically 100x faster than Hadoop) • Batch processing, streaming analytics, machine learning, graph databases and ad hoc queries • API for Java, Scala, Python, R, and SQL • Available in Amazon EMR and AWS Glue © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. MLlib – Machine learning library https://spark.apache.org/docs/latest/ml-guide.html • Algorithms: classification, regression, clustering, collaborative filtering. • Featurization: feature extraction, transformation, dimensionality reduction. • Tools for constructing, evaluating and tuning pipelines • Transformer – a transform function that maps a DataFrame into a new
    [Show full text]
  • Evaluation of SPARQL Queries on Apache Flink
    applied sciences Article SPARQL2Flink: Evaluation of SPARQL Queries on Apache Flink Oscar Ceballos 1 , Carlos Alberto Ramírez Restrepo 2 , María Constanza Pabón 2 , Andres M. Castillo 1,* and Oscar Corcho 3 1 Escuela de Ingeniería de Sistemas y Computación, Universidad del Valle, Ciudad Universitaria Meléndez Calle 13 No. 100-00, Cali 760032, Colombia; [email protected] 2 Departamento de Electrónica y Ciencias de la Computación, Pontificia Universidad Javeriana Cali, Calle 18 No. 118-250, Cali 760031, Colombia; [email protected] (C.A.R.R.); [email protected] (M.C.P.) 3 Ontology Engineering Group, Universidad Politécnica de Madrid, Campus de Montegancedo, Boadilla del Monte, 28660 Madrid, Spain; ocorcho@fi.upm.es * Correspondence: [email protected] Abstract: Existing SPARQL query engines and triple stores are continuously improved to handle more massive datasets. Several approaches have been developed in this context proposing the storage and querying of RDF data in a distributed fashion, mainly using the MapReduce Programming Model and Hadoop-based ecosystems. New trends in Big Data technologies have also emerged (e.g., Apache Spark, Apache Flink); they use distributed in-memory processing and promise to deliver higher data processing performance. In this paper, we present a formal interpretation of some PACT transformations implemented in the Apache Flink DataSet API. We use this formalization to provide a mapping to translate a SPARQL query to a Flink program. The mapping was implemented in a prototype used to determine the correctness and performance of the solution. The source code of the Citation: Ceballos, O.; Ramírez project is available in Github under the MIT license.
    [Show full text]
  • Flare: Optimizing Apache Spark with Native Compilation
    Flare: Optimizing Apache Spark with Native Compilation for Scale-Up Architectures and Medium-Size Data Gregory Essertel, Ruby Tahboub, and James Decker, Purdue University; Kevin Brown and Kunle Olukotun, Stanford University; Tiark Rompf, Purdue University https://www.usenix.org/conference/osdi18/presentation/essertel This paper is included in the Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’18). October 8–10, 2018 • Carlsbad, CA, USA ISBN 978-1-939133-08-3 Open access to the Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation is sponsored by USENIX. Flare: Optimizing Apache Spark with Native Compilation for Scale-Up Architectures and Medium-Size Data Grégory M. Essertel1, Ruby Y. Tahboub1, James M. Decker1, Kevin J. Brown2, Kunle Olukotun2, Tiark Rompf1 1Purdue University, 2Stanford University {gesserte,rtahboub,decker31,tiark}@purdue.edu, {kjbrown,kunle}@stanford.edu Abstract cessing. Systems like Apache Spark [8] have gained enormous traction thanks to their intuitive APIs and abil- In recent years, Apache Spark has become the de facto ity to scale to very large data sizes, thereby commoditiz- standard for big data processing. Spark has enabled a ing petabyte-scale (PB) data processing for large num- wide audience of users to process petabyte-scale work- bers of users. But thanks to its attractive programming loads due to its flexibility and ease of use: users are able interface and tooling, people are also increasingly using to mix SQL-style relational queries with Scala or Python Spark for smaller workloads. Even for companies that code, and have the resultant programs distributed across also have PB-scale data, there is typically a long tail of an entire cluster, all without having to work with low- tasks of much smaller size, which make up a very impor- level parallelization or network primitives.
    [Show full text]
  • Large Scale Querying and Processing for Property Graphs Phd Symposium∗
    Large Scale Querying and Processing for Property Graphs PhD Symposium∗ Mohamed Ragab Data Systems Group, University of Tartu Tartu, Estonia [email protected] ABSTRACT Recently, large scale graph data management, querying and pro- cessing have experienced a renaissance in several timely applica- tion domains (e.g., social networks, bibliographical networks and knowledge graphs). However, these applications still introduce new challenges with large-scale graph processing. Therefore, recently, we have witnessed a remarkable growth in the preva- lence of work on graph processing in both academia and industry. Querying and processing large graphs is an interesting and chal- lenging task. Recently, several centralized/distributed large-scale graph processing frameworks have been developed. However, they mainly focus on batch graph analytics. On the other hand, the state-of-the-art graph databases can’t sustain for distributed Figure 1: A simple example of a Property Graph efficient querying for large graphs with complex queries. Inpar- ticular, online large scale graph querying engines are still limited. In this paper, we present a research plan shipped with the state- graph data following the core principles of relational database systems [10]. Popular Graph databases include Neo4j1, Titan2, of-the-art techniques for large-scale property graph querying and 3 4 processing. We present our goals and initial results for querying ArangoDB and HyperGraphDB among many others. and processing large property graphs based on the emerging and In general, graphs can be represented in different data mod- promising Apache Spark framework, a defacto standard platform els [1]. In practice, the two most commonly-used graph data models are: Edge-Directed/Labelled graph (e.g.
    [Show full text]
  • Apache Spark Solution Guide
    Technical White Paper Dell EMC PowerStore: Apache Spark Solution Guide Abstract This document provides a solution overview for Apache Spark running on a Dell EMC™ PowerStore™ appliance. June 2021 H18663 Revisions Revisions Date Description June 2021 Initial release Acknowledgments Author: Henry Wong This document may contain certain words that are not consistent with Dell's current language guidelines. Dell plans to update the document over subsequent future releases to revise these words accordingly. This document may contain language from third party content that is not under Dell's control and is not consistent with Dell's current guidelines for Dell's own content. When such third party content is updated by the relevant third parties, this document will be revised accordingly. The information in this publication is provided “as is.” Dell Inc. makes no representations or warranties of any kind with respect to the information in this publication, and specifically disclaims implied warranties of merchantability or fitness for a particular purpose. Use, copying, and distribution of any software described in this publication requires an applicable software license. Copyright © 2021 Dell Inc. or its subsidiaries. All Rights Reserved. Dell Technologies, Dell, EMC, Dell EMC and other trademarks are trademarks of Dell Inc. or its subsidiaries. Other trademarks may be trademarks of their respective owners. [6/9/2021] [Technical White Paper] [H18663] 2 Dell EMC PowerStore: Apache Spark Solution Guide | H18663 Table of contents Table of contents
    [Show full text]
  • HDP 3.1.4 Release Notes Date of Publish: 2019-08-26
    Release Notes 3 HDP 3.1.4 Release Notes Date of Publish: 2019-08-26 https://docs.hortonworks.com Release Notes | Contents | ii Contents HDP 3.1.4 Release Notes..........................................................................................4 Component Versions.................................................................................................4 Descriptions of New Features..................................................................................5 Deprecation Notices.................................................................................................. 6 Terminology.......................................................................................................................................................... 6 Removed Components and Product Capabilities.................................................................................................6 Testing Unsupported Features................................................................................ 6 Descriptions of the Latest Technical Preview Features.......................................................................................7 Upgrading to HDP 3.1.4...........................................................................................7 Behavioral Changes.................................................................................................. 7 Apache Patch Information.....................................................................................11 Accumulo...........................................................................................................................................................
    [Show full text]
  • Debugging Spark Applications a Study on Debugging Techniques of Spark Developers
    Debugging Spark Applications A Study on Debugging Techniques of Spark Developers Master Thesis Melike Gecer from Bern, Switzerland Philosophisch-naturwissenschaftlichen Fakultat¨ der Universitat¨ Bern May 2020 Prof. Dr. Oscar Nierstrasz Dr. Haidar Osman Software Composition Group Institut fur¨ Informatik und angewandte Mathematik University of Bern, Switzerland Abstract Debugging is the main activity to investigate software failures, identify their root causes, and eventually fix them. Debugging distributed systems in particular is burdensome, due to the challenges of managing numerous devices and concurrent operations, detecting the problematic node, lengthy log files, and real-world data being inconsistent. Apache Spark is a distributed framework which is used to run analyses on large-scale data. Debugging Apache Spark applications is difficult as no tool, apart from log files, is available on the market. However, an application may produce a lengthy log file, which is challenging to examine. In this thesis, we aim to investigate various techniques used by developers on a distributed system. In order to achieve that, we interviewed Spark application developers, presented them with buggy applications, and observed their debugging behaviors. We found that most of the time, they formulate hypotheses to allay their suspicions and check the log files as the first thing to do after obtaining an exception message. Afterwards, we use these findings to compose a debugging flow that can help us to understand the way developers debug a project. i Contents
    [Show full text]
  • Adrian Florea Software Architect / Big Data Developer at Pentalog [email protected]
    Adrian Florea Software Architect / Big Data Developer at Pentalog [email protected] Summary I possess a deep understanding of how to utilize technology in order to deliver enterprise solutions that meet requirements of business. My expertise includes strong hands-on technical skills with large, distributed applications. I am highly skilled in Enterprise Integration area, various storage architectures (SQL/NoSQL, polyglot persistence), data modelling and, generally, in all major areas of distributed application development. Furthermore I have proven the ability to manage medium scale projects, consistently delivering these engagements within time and budget constraints. Specialties: API design (REST, CQRS, Event Sourcing), Architecture, Design, Optimisation and Refactoring of large scale enterprise applications, Propose application architectures and technical solutions, Projects coordination. Big Data enthusiast with particular interest in Recommender Systems. Experience Big Data Developer at Pentalog March 2017 - Present Designing and developing the data ingestion and processing platform for one of the leading actors in multi-channel advertising, a major European and International affiliate network. • Programming Languages: Scala, GO, Python • Storage: Apache Kafka, Hadoop HDFS, Druid • Data processing: Apache Spark (Spark Streaming, Spark SQL) • Continuous Integration : Git(GitLab) • Development environments : IntelliJ Idea, PyCharm • Development methodology: Kanban Back End Developer at Pentalog January 2015 - December 2017 (3 years)
    [Show full text]
  • Big Data Analysis Using Hadoop Lecture 4 Hadoop Ecosystem
    1/16/18 Big Data Analysis using Hadoop Lecture 4 Hadoop EcoSystem Hadoop Ecosytems 1 1/16/18 Overview • Hive • HBase • Sqoop • Pig • Mahoot / Spark / Flink / Storm • Hadoop & Data Management Architectures Hive 2 1/16/18 Hive • Data Warehousing Solution built on top of Hadoop • Provides SQL-like query language named HiveQL • Minimal learning curve for people with SQL expertise • Data analysts are target audience • Ability to bring structure to various data formats • Simple interface for ad hoc querying, analyzing and summarizing large amountsof data • Access to files on various data stores such as HDFS, etc Website : http://hive.apache.org/ Download : http://hive.apache.org/downloads.html Documentation : https://cwiki.apache.org/confluence/display/Hive/LanguageManual Hive • Hive does NOT provide low latency or real-time queries • Even querying small amounts of data may take minutes • Designed for scalability and ease-of-use rather than low latency responses • Translates HiveQL statements into a set of MapReduce Jobs which are then executed on a Hadoop Cluster • This is changing 3 1/16/18 Hive • Hive Concepts • Re-used from Relational Databases • Database: Set of Tables, used for name conflicts resolution • Table: Set of Rows that have the same schema (same columns) • Row: A single record; a set of columns • Column: provides value and type for a single value • Can can be dived up based on • Partitions • Buckets Hive – Let’s work through a simple example 1. Create a Table 2. Load Data into a Table 3. Query Data 4. Drop a Table 4 1/16/18
    [Show full text]
  • Mobile Storm: Distributed Real-Time Stream Processing for Mobile Clouds
    View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Texas A&M University MOBILE STORM: DISTRIBUTED REAL-TIME STREAM PROCESSING FOR MOBILE CLOUDS A Thesis by QIAN NING Submitted to the Office of Graduate and Professional Studies of Texas A&M University in partial fulfillment of the requirements for the degree of MASTER OF SCIENCE Chair of Committee, Radu Stoleru Committee Members, I-Hong Hou Jyh-Charn (Steve) Liu Head of Department, Dilma Da Silva December 2015 Major Subject: Computer Engineering Copyright 2015 Qian Ning ABSTRACT Recent advances in mobile technologies have enabled a plethora of new applica- tions. The hardware capabilities of mobile devices, however, are still insufficient for real-time stream data processing (e.g., real-time video stream). In order to process real-time streaming data, most existing applications offload the data and computa- tion to a remote cloud service, such as Apache Storm or Apache Spark Streaming. Offloading streaming data, however, has high costs for users, e.g., significant service fees and battery consumption. To address these challenges, we design, implement and evaluate Mobile Storm, the first stream processing platform for mobile clouds, leveraging clusters of local mobile devices to process real-time stream data. In Mobile Storm, we model the workflow of a real-time stream processing job and decompose it into several tasks so that the job can be executed concurrently and in a distribut- ed manner on multiple mobile devices. Mobile Storm was implemented on Android phones and evaluated extensively through a real-time HD video processing applica- tion.
    [Show full text]
  • Analysis of Real Time Stream Processing Systems Considering Latency Patricio Córdova University of Toronto [email protected]
    1 Analysis of Real Time Stream Processing Systems Considering Latency Patricio Córdova University of Toronto [email protected] Abstract—As the amount of data received by online Apache Spark [8], Google’s MillWheel [9], Apache Samza companies grows, the need to analyze and understand this [10], Photon [11], and some others. The purpose of this data as it arrives becomes imperative. There are several study is to analyze two of the most notable open source systems available that provide real time stream processing platforms already mentioned: Storm and Spark, capabilities. This document analyzes two of the most notable systems available these days: Spark Streaming [1] considering their Trident and Streaming APIs and Storm Trident [2]. The two main contributions of this respectively. Both systems offer real-time processing document are: the description and comparison of the main capabilities over micro-batches of tuples but their characteristics of both systems and the analysis of their functioning mechanisms, failure response, processing latency. There is a benchmarking showing the time taken speed and many other characteristics are different. The by both systems to perform same tasks with different data purpose of the study is to provide an overview of many of sizes. The benchmarking is followed by a breakdown of the times obtained and the rationalization of the reasons the high level features of both systems and to understand behind them. and evaluate their processing speed. Index Terms— analysis, analytics, apache, benchmarking, big data, comparison, database, database II. THE SYSTEMS management system, databases, dbms, distributed databases, latency, processing, real, real time analytics, A.
    [Show full text]