Spark in Action

IN ACTION ^ Petar Zecevi´c Marko Bona´ci MANNING www.allitebooks.com MAIN SPARK COMPONENTS, VARIOUS RUNTIME INTERACTIONS, AND STORAGE OPTIONS Streaming sources include Spark Streaming can use Spark Streaming can use Kafka, Flume, Twitter, HDFS, GraphX features on the machine-learning models and and ZeroMQ. data it receives. Spark SQL to analyze streaming data. Spark Streaming Spark ML & MLlib Spark MLlib models Streaming sources use DataFrames to represent data. DStream ML model Spark ML uses RDDs. Both use features from Spark Core. Spark GraphX Spark Core Spark SQL Data sources include Graph RDD RDD Dataframe Hive, JSON, relational databases, NoSQL databases, and Parquet files. Spark Streaming uses DStreams to Filesystems Data sources periodically create RDDs. Spark GraphX uses Spark Filesystems include HDFS, Spark SQL transforms Core features behind Guster FS, and Amazon S3. operations on DataFrames the scenes. to operations on RDDs. RDD EXAMPLE DEPENDENCIES RDD lineage parallelize map reduceByKey map Collect listrdd pairs reduced finalrdd ParallelCollectionRDD MapPartitionsRDD ShuffledRDD MapPartitionsRDD Driver Driver OneToOne Shuffle OneToOne dependency dependency dependency www.allitebooks.com Spark in Action www.allitebooks.com www.allitebooks.com Spark in Action PETAR ZECEVICˇ ´ MARKO BONACIˇ MANN ING Shelter Island www.allitebooks.com For online information and ordering of this and other Manning books, please visit www.manning.com. The publisher offers discounts on this book when ordered in quantity. For more information, please contact Special Sales Department Manning Publications Co. 20 Baldwin Road PO Box 761 Shelter Island, NY 11964 Email: [email protected] ©2017 by Manning Publications Co. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by means electronic, mechanical, photocopying, or otherwise, without prior written permission of the publisher. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in the book, and Manning Publications was aware of a trademark claim, the designations have been printed in initial caps or all caps. Recognizing the importance of preserving what has been written, it is Manning’s policy to have the books we publish printed on acid-free paper, and we exert our best efforts to that end. Recognizing also our responsibility to conserve the resources of our planet, Manning books are printed on paper that is at least 15 percent recycled and processed without the use of elemental chlorine. Manning Publications Co. Development editor: Marina Michaels 20 Baldwin Road Technical development editor: Andy Hicks PO Box 761 Project editor: Karen Gulliver Shelter Island, NY 11964 Copyeditor: Tiffany Taylor Proofreader: Elizabeth Martin Technical proofreaders: Michiel Trimpe Robert Ormandi Typesetter: Gordan Salinovic Cover designer: Marija Tudor ISBN 9781617292606 Printed in the United States of America 1 2 3 4 5 6 7 8 9 10 – EBM – 21 20 19 18 17 16 To my mother in heaven. —P.Z. To my dear wife Suzana and our twins, Frane and Luka. —M.B. brief contents PART 1FIRST STEPS .....................................................................1 1 ■ Introduction to Apache Spark 3 2 ■ Spark fundamentals 18 3 ■ Writing Spark applications 40 4 ■ The Spark API in depth 66 PART 2MEET THE SPARK FAMILY ...............................................103 5 ■ Sparkling queries with Spark SQL 105 6 ■ Ingesting data with Spark Streaming 147 7 ■ Getting smart with MLlib 180 8 ■ ML: classification and clustering 218 9 ■ Connecting the dots with GraphX 254 PART 3SPARK OPS...................................................................283 10 ■ Running Spark 285 11 ■ Running on a Spark standalone cluster 306 12 ■ Running on YARN and Mesos 331 vii viii BRIEF CONTENTS PART 4BRINGING IT TOGETHER.................................................361 13 ■ Case study: real-time dashboard 363 14 ■ Deep learning on Spark with H2O 383 brief contents contents preface xvii acknowledgments xix about this book xx about the authors xxiv about the cover xxv PART 1FIRST STEPS ...........................................................1 Introduction to Apache Spark 3 1 1.1 What is Spark? 4 The Spark revolution 5 ■ MapReduce’s shortcomings 5 ■ What Spark brings to the table 6 1.2 Spark components 8 Spark Core 8 ■ Spark SQL 9 ■ Spark Streaming 10 ■ Spark MLlib 10 ■ Spark GraphX 10 1.3 Spark program flow 10 1.4 Spark ecosystem 13 1.5 Setting up the spark-in-action VM 14 Downloading and starting the virtual machine 15 ■ Stopping the virtual machine 16 1.6 Summary 16 ix x CONTENTS Spark fundamentals 18 2 2.1 Using the spark-in-action VM 19 Cloning the Spark in Action GitHub repository 20 ■ Finding Java 20 ■ Using the VM’s Hadoop installation 21 Examining the VM’s Spark installation 22 2.2 Using Spark shell and writing your first Spark program 23 Starting the Spark shell 23 ■ The first Spark code example 25 The notion of a resilient distributed dataset 27 2.3 Basic RDD actions and transformations 28 Using the map transformation 28 ■ Using the distinct and flatMap transformations 30 ■ Obtaining RDD’s elements with the sample, take, and takeSample operations 34 2.4 Double RDD functions 36 Basic statistics with double RDD functions 37 ■ Visualizing data distribution with histograms 38 ■ Approximate sum and mean 38 2.5 Summary 39 Writing Spark applications 40 3 3.1 Generating a new Spark project in Eclipse 41 3.2 Developing the application 46 Preparing the GitHub archive dataset 46 ■ Loading JSON 48 Running the application from Eclipse 50 ■ Aggregating the data 52 ■ Excluding non-employees 53 ■ Broadcast variables 55 ■ Using the entire dataset 57 3.3 Submitting the application 58 Building the uberjar 59 ■ Adapting the application 60 Using spark-submit 62 3.4 Summary 64 The Spark API in depth 66 4 4.1 Working with pair RDDs 67 Creating pair RDDs 67 ■ Basic pair RDD functions 68 4.2 Understanding data partitioning and reducing data shuffling 74 Using Spark’s data partitioners 74 ■ Understanding and avoiding unnecessary shuffling 76 ■ Repartitioning RDDs 80 Mapping data in partitions 81 CONTENTS xi 4.3 Joining, sorting, and grouping data 82 Joining data 82 ■ Sorting data 88 ■ Grouping data 91 4.4 Understanding RDD dependencies 94 RDD dependencies and Spark execution 94 ■ Spark stages and tasks 96 ■ Saving the RDD lineage with checkpointing 97 4.5 Using accumulators and broadcast variables to communicate with Spark executors 97 Obtaining data from executors with accumulators 97 ■ Sending data to executors using broadcast variables 99 4.6 Summary 101 PART 2MEET THE SPARK FAMILY......................................103 Sparkling queries with Spark SQL 105 5 5.1 Working with DataFrames 106 Creating DataFrames from RDDs 108 ■ DataFrame API basics 115 Using SQL functions to perform calculations on data 118 ■ Working with missing values 123 ■ Converting DataFrames to RDDs 124 Grouping and joining data 125 ■ Performing joins 128 5.2 Beyond DataFrames: introducing DataSets 129 5.3 Using SQL commands 130 Table catalog and Hive metastore 131 ■ Executing SQL queries 133 Connecting to Spark SQL through the Thrift server 134 5.4 Saving and loading DataFrame data 137 Built-in data sources 138 ■ Saving data 139 ■ Loading data 141 5.5 Catalyst optimizer 142 5.6 Performance improvements with Tungsten 144 5.7 Beyond DataFrames: introducing DataSets 145 5.8 Summary 145 Ingesting data with Spark Streaming 147 6 6.1 Writing Spark Streaming applications 148 Introducing the example application 148 ■ Creating a streaming context 150 ■ Creating a discretized stream 150 ■ Using discretized streams 152 ■ Saving the results to a file 153 ■ Starting and stopping the streaming computation 154 ■ Saving the computation state over time 155 ■ Using window operations for time-limited calculations 162 Examining the other built-in input streams 164 xii CONTENTS 6.2 Using external data sources 165 Setting up Kafka 166 ■ Changing the streaming application to use Kafka 167 6.3 Performance of Spark Streaming jobs 172 Obtaining good performance 173 ■ Achieving fault-tolerance 175 6.4 Structured Streaming 176 Creating a streaming DataFrame 176 ■ Outputting streaming data 177 ■ Examining streaming executions 178 ■ Future direction of structured streaming 178 6.5 Summary 178 Getting smart with MLlib 180 7 7.1 Introduction to machine learning 181 Definition of machine learning 184 ■ Classification of machine- learning algorithms 184 ■ Machine learning with Spark 187 7.2 Linear algebra in Spark 187 Local vector and matrix implementations 188 ■ Distributed matrices 192 7.3 Linear regression 193 About linear regression 193 ■ Simple linear regression 194 Expanding the model to multiple linear regression 196 7.4 Analyzing and preparing the data 198 Analyzing data distribution 199 ■ Analyzing column cosine similarities 200 ■ Computing the covariance matrix 200 Transforming to labeled points 201 ■ Splitting the data 201 Feature scaling and mean normalization 202 7.5 Fitting and using a linear regression model 202 Predicting the target values 203 ■ Evaluating the model’s performance 204 ■ Interpreting the model parameters 204 Loading and saving the model 205 7.6 Tweaking the algorithm 205 Finding the right step size and number of iterations 206 ■ Adding higher-order polynomials 207 ■ Bias-variance tradeoff and model complexity 209 ■ Plotting residual plots 211 ■ Avoiding overfitting by using regularization 212 ■ K-fold cross-validation

Load more