Advanced Analytics in Business [D0S07a] Big Data Platforms & Technologies [D0S06a] Hadoop and MapReduce Spark Streaming Analytics Overview

Big data Hadoop MapReduce Spark SparkSQL MLlib Streaming analytics and other trends

2 The landscape is incredibly complex

3 Heard about Hadoop? Spark? H2O?

Many vendors with their "big data and analytics" stack

Amazon There's always "roll your own" Cloudera Open source, or walled garden? Datameer Support? DataStax What's up to date? Dell Which features? Oracle IBM MapR Pentaho Databricks Microsoft Hortonworks EMC2

4 Two sides emerge

Infrastructure

"Big Data" "Integration" "Architecture" "Streaming"

5 Two sides emerge

Analytics

"Data Science" "Machine Learning" "AI"

6 There's a difference

7 Previously

In-memory analytics

Together with some intermediate techniques: mostly based on disk swapping and directed acyclic execution graphs

Now: moving to the world of big data

Managing, storing, querying data Storage and computation in a distributed setup: over multiple machines And (hopefully still, though it will be difficult for a while...): analytics

8 Hadoop

9 Hadoop

We all know of Hadoop and hear it being mentioned every time a team is talking about some big daunting task related to the analysis or management of big data

Have lots of volume? Hadoop! Unstructured data? Hadoop! Streaming data? Hadoop! Want to run super-fast machine learning in parallel? You guessed it… Hadoop!

So what is Hadoop and what is it not?

10 History

The genesis of Hadoop came from the Google paper that was published in 2003 Spawned another research paper from Google: MapReduce Hadoop itself however started as part of a search engine called Nutch, which was being worked on by Doug Cutting and Mike Cafarella 5k lines of code for NDFS (Nutch Distributed File System) and 6k lines of code for MapReduce In 2006, Cutting joined Yahoo! to work in its search engine division. The part of Nutch which dealt with distributed computing and processing (initially constructed to handle with the simultaneous parsing of enormous amounts of web links in an efficient manner) was split of and renamed to “Hadoop” Toy elephant of his son In 2008, Yahoo! open-sourced Hadoop Hadoop become part of an ecosystem of technologies which are managed by the non-profit Apache Software foundation Today, most of the hype around Hadoop has been passed us, for reasons we'll see later on

11 Hadoop

Even when talking about “raw” Hadoop, it is important to know that it describes a stack containing four core modules:

1. Hadoop Common (a set of shared libraries) 2. Hadoop Distributed File System (HDFS), a Java-based file system to store data across multiple machines 3. MapReduce (a programming model to process large sets of data in parallel) 4. YARN (Yet Another Resource Negotiator), a framework to schedule and handle resource requests in a distributed environment

In MapReduce version 1 (Hadoop 1), HDFS and MapReduce were tightly coupled. Didn’t scale well to really big clusters. In Hadoop 2, the resource management and scheduling tasks are separated from MapReduce by YARN

12 HDFS

HDFS is the distributed file system used by Hadoop to store data in the cluster

HDFS lets you connect nodes (commodity personal computers, which was a big deal at the time) over which data files are distributed You can then access and store the data files as one seamless file system HDFS is fault tolerant and provides high-throughput access Theoretically, you don't need to have it running and files could instead be stored elsewhere

HDFS is composed of a NameNode, an optional SecondaryNameNode (for data recovery in the event of failure), and DataNodes which hold the actual data

A NameNode manages holds all the metadata regarding the stored files and manages namespace operations like opening, closing, and renaming files and directories, and maps data blocks to DataNodes DataNodes handle read and write requests from HDFS clients and also create, delete, and replicate data blocks according to instructions from the governing NameNode A typical installation cluster has a dedicated machine that runs a name node and at least one data node Data nodes continuously loop, asking the name node for instructions HDFS supports a hierarchical file organization of directories and files inside them

13 HDFS

HDFS replicates file blocks for fault tolerance

An application can specify the number of replicas of a file at the time it is created, and this number can be changed any time after that. The name node makes all decisions concerning block replication

One of the main goals of HDFS is to support large files. The size of a typical HDFS block is 64MB

Therefore, each HDFS file consists of one or more 64MB blocks HDFS tries to place each block on separate data nodes

14 HDFS

15 HDFS

16 HDFS

17 HDFS

18 HDFS

HDFS provides a native Java API and a native C-language wrapper for the Java API, as well as shell commands to interface with the file system

byte[] fileData = readFile(); String filePath = "/data/course/participants.csv"; Configuration config = new Configuration(); org.apache.hadoop.fs.FileSystem hdfs = org.apache.hadoop.fs .FileSystem.get(config); org.apache.hadoop.fs.Path path = new org.apache.hadoop.fs.Path(filePath);

org.apache.hadoop.fs.FSDataOutputStream outputStream = hdfs.create(path); outputStream.write(fileData, 0, fileData.length);

In layman’s terms: a massive, distributed, C:-drive…

And note: reading in a massive file in a naive way will still end up in trouble

19 MapReduce

What is MapReduce?

A “programming framework” for coordinating tasks in a distributed environment HDFS uses “behind the scenes” this to make access fast Reading a file is converted to a MapReduce task to read across multiple DataNodes and stream the resulting file Can be used to construct scalable and fault-tolerant operations in general

HDFS provides a way to store files in a distributed fashion, MapReduce allows to do something with them in a distributed fashion

20 MapReduce

The concepts of "map" and "reduce" existed long before Hadoop and stem from the domain of functional programming

Map: apply a function on every item in a list: result is a new list of values

numbers = [1,2,3,4,5] numbers.map(λ x : x * x) # [1,4,9,16,25]

Reduce: apply function on a list: result is a scalar

numbers.reduce(λ x : sum(x)) # 15

21 MapReduce

A Hadoop map-reduce pipeline works over lists of (key, value) pairs

The map operations maps each pair to a list of output key-value pairs Zero, one, or more This operation can be run in parallel over the input pairs The input list could also contain a single key-value pair

Next, the output entries are shuffled and distributed so that all output entries belonging to the same key are assigned to the same worker

All of these workers then apply a reduce function to each group

Producing a final key-value pair for each distinct key The resulting, final outputs are then (optionally) sorted per key to produce the final outcome

22 MapReduce

# Input: a list of key-value pairs documents = [ ('document1', 'two plus two does'), ('document2', 'not equal two'), ]

def map(key, value): # For each word, produce an output pair for word in value.split(' '): yield (word, 1)

for input in documents: map(input)

# [ (two, 1), (plus, 1), (two, 1), (does, 1) ] # [ (not, 1), (equal, 1), (two, 1) ]

def reduce(key, values): # For each key, produce output as sum of values yield (key, sum(values))

reduce('two', [1, 1, 1]) # ('two', 2) reduce('plus', [1]) # ('plus', 1) # ... and so on

23 MapReduce: word count example

24 MapReduce: word count example

25 MapReduce: averaging example

def map(key, value): yield (value['genre'], value['nrPages'])

26 MapReduce: averaging example

# Minibatch-style approach would also be possible def map(key, value): for record in value: yield (record['genre'], record['nrPages'])

27 MapReduce: averaging example

def reduce(key, values): yield (key, sum(values) / length(values))

28 MapReduce: averaging example

There's a gotcha, however: the reduce operation should work on partial results and be able to be applied multiple times in a chain

29 MapReduce: averaging example

30 MapReduce

The reduce operation should work on partial results and be able to be applied multiple times in a chain

1. The reduce function should output the same structure as emitted by the map function, since this output can be used again in an additional reduce operation 2. The reduce function should provide correct results even if called multiple times on partial results

31 MapReduce: correct averaging example

function map(key, value): yield (value['genre'], (value['nrPages'], 1))

function reduce(key, values): sum, newcount = 0, 0 for (nrPages, count) in values: sum = sum + nrPages * count newcount = newcount + count yield (key, (sum/newcount, newcount))

Instead of using a running average as a value, our value will now itself be a pair of (running average, number of records already seen)

32 MapReduce: correct averaging example

33 Testing it out

A small piece of Python code map_reduce.py will be made available as background material

You can use this to play around with the MapReduce paradigm without setting up a full Hadoop stack

34 Testing it out: word count example

from map_reduce import runtask

documents = ['een twee drie drie drie vier twee acht', 'acht drie twee zes vijf twee', 'zes drie acht']

# Provide a mapping function of the form mapfunc(value) # Must yield (k,v) pairs def mapfunc(value): for x in value.split(): yield (x,1)

# Provide a reduce function of the form reducefunc(key, list_of_values) # Must yield (k,v) pairs def reducefunc(key, values): yield (key, sum(values))

# Pass your input list, mapping and reduce functions runtask(documents, mapfunc, reducefunc)

35 Testing it out

36 Testing it out: minimum per group example

from map_reduce import runtask

documents = [ ('drama', 200), ('education', 100), ('action', 20), ('thriller', 20), ('drama', 220), ('education', 150), ('action', 10), ('thriller', 160), ('drama', 140), ('education', 160), ('action', 20), ('thriller', 30) ]

# Provide a mapping function of the form mapfunc(value) # Must yield (k,v) pairs def mapfunc(value): genre, pages = value yield (genre, pages)

# Provide a reduce function of the form reducefunc(key, list_of_values) # Must yield (k,v) pairs def reducefunc(key, values): yield (key, min(values))

# Pass your input list, mapping and reduce functions runtask(documents, mapfunc, reducefunc)

37 Back to Hadoop

On Hadoop, MapReduce tasks are written using Java

Bindings for Python and other languages exist as well, but Java is the “native” environment Java program is packages as a JAR archive and launched using the command:

hadoop jar myfile.jar ClassToRun [args...]

hadoop jar wordcount.jar RunWordCount /input/dataset.txt /output/

38 Back to Hadoop

public static class MyReducer extends Reducer {

public void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException {

int sum = 0; IntWritable result = new IntWritable();

for (IntWritable val : values) { sum += val.get(); }

result.set(sum); context.write(key, result); } }

39 Back to Hadoop hadoop jar wordcount.jar WordCount /users/me/dataset.txt /users/me/output/

40 Back to Hadoop

$ hadoop fs -ls /users/me/output Found 2 items -rw-r—r-- 1 root hdfs 0 2017-05-20 15:11 /users/me/output/_SUCCESS -rw-r—r-- 1 root hdfs 2069 2017-05-20 15:11 /users/me/output/part-r-00000

$ hadoop fs -cat /users/me/output/part-r-00000 and 2 first 1 is 3 line 2 second 1 the 2 this 3

41 Back to Hadoop

MapReduce tasks can consist of more than mappers and reducers

Partitioners, Combiners, Shufflers, and Sorters

42 MapReduce

Constructing MapReduce programs requires “a certain skillset” in terms of programming (to put it lightly)

One does not simply implement Random Forest on MapReduce There's a reason why most tutorials don't go much further than counting words

Tradeoffs in terms of speed, memory consumption, and scalability

Big does not mean fast Does your use case really align with a search engine?

43 YARN

How is a MapReduce program coordinated amongst the different nodes in the cluster?

In the former Hadoop 1 architecture, the cluster was managed by a service called the JobTracker TaskTracker services lived on each node and would launch tasks on behalf of jobs (instructed by the JobTracker) The JobTracker would serve information about completed jobs JobTracker could still become overloaded, however!

In Hadoop 2, MapReduce is split into two components

The cluster resource management capabilities have become YARN, while the MapReduce- specific capabilities remain MapReduce

44 YARN

YARN's setup is relatively complex… but the new architecture has a couple advantages

First, by breaking up the JobTracker into a few different services, it avoids many of the scaling issues facing Hadoop 1 It also makes it possible to run frameworks other than MapReduce on a Hadoop cluster. For example, Impala can also run on YARN and share resources on a cluster with MapReduce I.e. can be used for all sorts of coordination of tasks Will be a huge advantage once we move away from Hadoop (see later) Even then, people also proposed alternatives to Yarn (see later)

I.e. a general coordination and resource management framework

45 So... Hadoop?

Standard Hadoop: definitely not a turn-key solution for most environments Just a big hard drive and a way to do scalable MapReduce? In a way which is not fun to program at all? As such, many implementations and vendors also mix-in a number of additional projects such as: HBase: a distributed database which runs on top of the Hadoop core stack (no SQL, just MapReduce) Hive: a data warehouse solution with SQL like query capabilities to handle data in the form of tables Pig: a framework to manipulate data stored in HDFS without having to write complex MapReduce programs from scratch Cassandra: another distributed database Ambari: a web interface for managing Hadoop stacks (managing all these other fancy keywords) Flume: a framework to collect and deal with streaming data intakes Oozie: a more advanced job scheduler that cooperates with YARN Zookeeper: a centralized service for maintaining configuration information, naming (a cluster on its own) Sqoop: a connector to move data between Hadoop and relational databases Atlas: a system to govern metadata and its compliance Ranger: a centralized platform to define, administer and manage security policies consistently across Hadoop components Spark: a computing framework geared towards data analytics

46 So... Hadoop?

47 SQL on Hadoop

48 The first letdown

From the moment a new distributed data store gets popular, the next “ “ question will be how to run SQL on top of it…

What do you mean it’s a file system? How do we query this thing? We “ “ need SQL!

2008: the first release of Apache Hive, the original SQL-on-Hadoop solution Rapidly became one of the de-facto tools included with almost all Hadoop installations Hive converts SQL queries to a series of map-reduce jobs, and presents itself to clients in a way which very much resembles a MySQL server It also offers a command line client, Java APIs and JDBC drivers, which made the project wildly successful and quickly adapted by all organizations which were quickly beginning to realize that they’d taken a step back from their traditional data warehouse setups in their desire to switch to Hadoop as soon as possible

SELECT genre, SUM(nrPages) FROM books --\ GROUP BY genre -- > convert to MapReduce job ORDER BY genre --/

49 There is (was?) also HBase

The first database on Hadoop Native database on top of Hadoop No SQL, own get/put/filter operations Complex queries as MapReduce jobs

hbase(main):009:0> scan 'users' ROW COLUMN+CELL seppe column=email:, timestamp=1495293082872, [email protected] seppe column=name:first, timestamp=1495293050816, value=Seppe seppe column=name:last, timestamp=1495293067245, value=vanden Broucke 1 row(s) in 0.1170 seconds

hbase(main):011:0> get 'users', 'seppe' COLUMN CELL email: timestamp=1495293082872, [email protected] name:first timestamp=1495293050816, value=Seppe name:last timestamp=1495293067245, value=vanden Broucke 4 row(s) in 0.1250 seconds

50 There is (was?) also Pig

Another way to ease the pain of writing MapReduce programs Still not very easy though People still wanted good ole SQL

timesheet = LOAD 'timesheet.csv' USING PigStorage(','); raw_timesheet = FILTER timesheet by $0>1; timesheet_logged = FOREACH raw_timesheet GENERATE $0 AS driverId, $2 AS hours_logged, $3 AS miles_logged;

grp_logged = GROUP timesheet_logged by driverId;

sum_logged = FOREACH grp_logged GENERATE group as driverId, SUM(timesheet_logged.hours_logged) as sum_hourslogged, SUM(timesheet_logged.miles_logged) as sum_mileslogged;

51 Hive

2008: the first release of Apache Hive, the original SQL-on-Hadoop solution Hive converts SQL queries to a series of map-reduce jobs, and presents itself to clients in a way which very much resembles a MySQL server

SELECT genre, SUM(nrPages) FROM books --\ GROUP BY genre -- > convert to MapReduce job ORDER BY genre --/

Hive is handy… but SQL-on-Hadoop technologies are not perfect implementations of relational database management systems:

Sacrifice on features such as speed, SQL language compatibility Support for complex joins lacking For Hive, the main draw back was its lack of speed Because of the overhead incurred by translating each query into a series of map-reduce jobs, even the simplest of queries can consume a large amount of time

Big does not mean fast

52 So... without MapReduce?

For a long time, companies such as Hortonworks were pushing behind the development of Hive, mainly by putting efforts behind Apache Tez, which provides a new backend for Hive, no longer based on the map-reduce paradigm but on directed-acyclic-graph pipelines In 2012, Cloudera, another well-known Hadoop vendor, introduced their own SQL-on-Hadoop technology as part of their “Impala” stack Cloudera also opted to forego map-reduce completely, and instead uses its own set of execution daemons, which have to be installed along Hive-compatible datanodes. It offers SQL-92 syntax support, a command line client, and ODBC drivers Much faster than a standard Hive installation, allowing for immediate feedback after queries, hence making them more interactive Today: Apache Impala is open source It didn’t take long for other vendors to take notice of the need for SQL-on-Hadoop, and in recent years, we saw almost every vendor joining the bandwagon and offering their own query engines (IBM’s BigSQL platform or Oracle’s Big Data SQL, for instance) Some better, some worse

But…

53 Hype, meet reality

“ In a tech startup industry that loves its shiny new objects, the term “Big Data” is in the unenviable position of sounding increasingly “3 years ago”

- Matt Turck “

Hadoop was created in 2006!

It’s now been more than a decade since Google’s papers on MapReduce

Interest in the concept of “Big Data” reached fever pitch sometime between 2011 and 2014

Big Data was the new “black”, “gold” or “oil” There’s an increasing sense of having reached some kind of plateau 2015 was probably the year when people started moving to AI and its many related concepts and flavors: machine intelligence, deep learning, etc. Today, we're in the midst of a new "AI summer" (with it's own hype as well)

54 Hype, meet reality

Big Data wasn’t a very likely candidate for the type of hype it experienced in the first place

Big Data, fundamentally, is… plumbing There’s a reason why most map-reduce examples don’t go much further than counting words The early years of the Big Data phenomenon were propelled by a very symbiotic relationship among a core set of large Internet companies Fast forward a few years, and we’re now in the thick of the much bigger, but also trickier, opportunity: adoption of Big Data technologies by a broader set of companies Those companies do not have the luxury of starting from scratch

Big Data success is not about implementing one piece of technology (like Hadoop), but instead requires putting together a collection of technologies, people and processes

55 Today

Today, the field of the data aspect has stabilized: the storage and querying aspect has found a good marriage between big data techniques, speed, a return to relational data bases and NoSQL-style scalability

E.g. Amazon Redshift, Snowflake, CockroachDB, Presto...

“ Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to “ petabytes

Presto allows querying data where it lives, including Hive, Cassandra, relational databases or even proprietary data stores A single Presto query can combine data from multiple sources, allowing for analytics across your entire organization Presto is targeted at analysts who expect response times ranging from sub-second to minutes. Presto breaks the false choice between having fast analytics using an expensive commercial solution or using a slow "free" solution that requires excessive hardware Facebook uses Presto for interactive queries against several internal data stores, including their 300PB data warehouse

We'll revisit this later on when talking about NoSQL... 56 Big Analytics?

“What do you mean it’s a file system and some MapReduce? How do we query this thing?” We need SQL! Hive is too slow! Can we do it without MapReduce?

Most managers that know their salt have (finally) realized that Hadoop-based solutions might not be the right fit

Proper cloud-based databases might be

But the big unanswered question right now:

“ How to use Hadoop for machine learning and analytics? “

57 Big Analytics?

It turns out that MapReduce was never very well suited for analytics

Extremely hard to convert techniques to a map-reduce paradigm Slow due to lots of in-out swapping to HDFS Ask the Mahout project, they tried Slow for most “online” tasks… Querying is nice, but… we just end up with business intelligence dashboarding and pretending we have big data?

“2015 was the year of Apache Spark”

Bye bye, Hadoop! Spark has been embraced by a variety of players, from IBM to Cloudera-Hortonworks Spark is meaningful because it effectively addresses some of the key issues that were slowing down the adoption of Hadoop: it is much faster (benchmarks have shown Spark is 10 to 100 times faster than Hadoop’s MapReduce), easier to program, and lends itself well to machine learning

58 Spark

59 Time for a spark

Just as Hadoop was perhaps not the right solution to satisfy common querying needs, it was also not the right solution for analytics

In 2015, another project, Apache Spark, entered the scene in full with a radically different approach

Spark is meaningful because it effectively addresses some of the key issues that were slowing down the adoption of Hadoop: it is much faster (benchmarks have shown Spark is 10 to 100 times faster than Hadoop’s MapReduce), easier to program, and lends itself well to machine learning

60 Spark

Apache Spark is a top-level project of the Apache Software Foundation, it is an open-source distributed general-purpose cluster computing framework with an in-memory data processing engine that can do ETL, analytics, machine learning and graph processing on large volumes of data at rest (batch processing) or in motion (streaming processing) with rich concise high-level APIs for the programming languages: Scala, Python, Java, R, and SQL

Spark's speed, simplicity, and broad support for existing development environments and storage systems make it increasingly popular with a wide range of developers, and relatively accessible to those learning to work with it for the first time The project supporting Spark's ongoing development is one of Apache's largest and most vibrant, with over 500 contributors from more than 200 organizations responsible for code in the current software release

61 So what do we throw out?

The resource manager (YARN)?

We're still running on a cluster of machines Spark can run on top of YARN, but also Mesos (an alternative resource manager), or even in standalone mode

The data storage (HDFS)?

Again, Spark can work with a variety of storage systems Google Cloud Amazon S3 Apache Cassandra (HDFS) Apache HBase Apache Hive Flat files (JSON, Parquet, CSV, others)

62 So what do we throw out?

One thing that we do "kick out" is MapReduce

Spark is designed to perform both batch processing (similar to MapReduce) and new workloads like streaming, interactive queries, and machine learning (aha!) How? Apache Spark replaces the MapReduce paradigm with an advanced DAG execution engine that supports cyclic data flow and in-memory computing

A smarter way to distribute jobs over machines! Note the similarities with previous projects such as Dask...

63 Spark's building blocks

64 Spark core

This is the heart of Spark, responsible for management functions such as task scheduling

Spark core also implements the core abstraction to represent data elements: the Resilient Distributed Dataset (RDD)

The Resilient Distributed Dataset is the primary data abstraction in Apache Spark Represents a collection of data elements It is designed to support in-memory data storage, distributed across a cluster in a manner that is demonstrably both fault-tolerant and efficient Fault-tolerance is achieved, in part, by tracking the lineage of transformations applied to coarse-grained sets of data Efficiency is achieved through parallelization of processing across multiple nodes in the cluster, and minimization of data replication between those nodes Once data is loaded into an RDD, two basic types of operation can be carried out: Transformations, which create a new RDD by changing the original through processes such as mapping, filtering, and more Actions, such as counts, which measure but do not change the original data

65 RDDs are distributed, fault-tolerant, efficient

66 RDDs are distribted, fault-tolerant, efficient

Note that an RDD represents a collection of elements partitioned across the nodes of the cluster that can be operated on in parallel

Any sort of element collection: a collection of text lines, a collection of single words, a collection of objects, a collection of images, a collection of instances, ... The only feature provided is automatic distribution and task management over this collection

Through transformations and actions: do things with the RDD

The original RDD remains unchanged throughout. The chain of transformations from RDD1 to RDDn are logged, and can be repeated in the event of data loss or the failure of a cluster node Transformations are said to be lazily evaluated: they are not executed until a subsequent action has a need for the result Where possible, these RDDs remain in memory, greatly increasing the performance of the cluster, particularly in use cases with a requirement for iterative queries or processes

67 RDDs are distribted, fault-tolerant, efficient

68 Writing Spark-based programs

Similar as with MapReduce, an application using the RDD framework can be coded in Java or Scala and packaged as a JAR file to be launched on the cluster

However, Spark also provides an interactive shell interface (Spark shell) to its cluster environment

And also exposes APIs to work directly with the RDD concept in a variety of languages

Scala Java Python R SQL

69 Spark shell (./bin/pyspark)

PySpark is the "driver program": runs on the client and will set up a "SparkContext" (a connection to the Spark cluster)

>>> textFile = sc.textFile("README.md") # sc is the SparkContext

# textFile is now an RDD (each element represents a line of text)

>>> textFile.count() # Number of items in this RDD 126

>>> textFile.first() # First item in this RDD u'# Apache Spark'

# Chaining together a transformation and an action: # How many lines contain "Spark"? >>> textFile.filter(lambda line: "Spark" in line).count() 15

70 SparkContext

SparkContext sets up internal services and establishes a connection to a Spark execution environment

Data operations are not executed on your machine: the client sends them to be executed by the Spark cluster!

No data is loaded in the client... unless you'd perform a .toPandas()

71 Deploying an application

Alternative to the interactive mode:

from pyspark import SparkContext

# Set up context ourselves sc = SparkContext("local", "Simple App")

logData = sc.textFile("README.md")

numAs = logData.filter(lambda s: 'a' in s).count() numBs = logData.filter(lambda s: 'b' in s).count()

print("Lines with a: %i, lines with b: %i" % (numAs, numBs)) sc.stop()

Execute using:

/bin/spark-submit MyExampleApp.py

Lines with a: 46, Lines with b: 23 72 More on the RDD API

So what can we do with RDDs?

Transformations Actions

map(func) reduce(func) filter(func) count() flatMap(func) first() mapPartitons(func) take(n) sample(withReplacement, fraction, takeSample(withReplacement, n) seed) saveAsTextFile(path) union(otherRDD) countByKey() intersection(otherRDD) foreach(func) distinct() groupByKey() reduceByKey(func) sortByKey() join(otherRDD) 73 Examples https://github.com/wdm0006/DummyRDD

A test class that walks like and RDD, talks like an RDD but is actually just a list No real Spark behind it Nice for testing and learning, however

from dummy_spark import SparkContext, SparkConf

sconf = SparkConf() sc = SparkContext(master='', conf=sconf)

# Make an RDD from a Python list: a collection of numbers rdd = sc.parallelize([1, 2, 3, 4, 5])

print(rdd.count()) print(rdd.map(lambda x: x**2).collect())

74 Examples: word count

from dummy_spark import SparkContext, SparkConf

sconf = SparkConf() sc = SparkContext(master='', conf=sconf)

# Make an RDD from a text file: collection of lines text_file = sc.textFile("kuleuven.txt")

counts = text_file.flatMap(lambda line: line.split(" ")) \ .map(lambda word: (word, 1)) \ .reduceByKey(lambda a, b: a + b)

print(counts)

75 Examples: filtering

from dummy_spark import SparkContext, SparkConf

sconf = SparkConf() sc = SparkContext(master='', conf=sconf)

rdd = sc.parallelize(list(range(1, 21)))

print(rdd.filter(lambda x: x % 3 == 0).collect())

76 SparkSQL, DataFrames and Datasets

77 These RDDs "feel" a lot like MapReduce...

Indeed, many operations are familiar: map , reduce , reduceByKey , ...

But remember: the actual execution is way more optimized

However, from the perspective of the user, this is still very low-level

Nice if you want low-level control to perform transformation and actions on your dataset Or when your data is unstructured, such as streams of text Or you actually want to manipulate your data with functional programming constructs Or you don’t care about imposing a schema, such as columnar format

But what if you do want to work with tabular structured data... like a data frame?

78 SparkSQL

Like Apache Spark in general, SparkSQL is all about distributed in-memory computations

SparkSQL builds on top of Spark Core with functionality to load and query structured data using queries that can be expressed using SQL, HiveQL, or through high-level API's similar to e.g. Pandas (called the "DataFrame" and "Dataset" API's in Spark) At the core of SparkSQL is the Catalyst query optimizer

Since Spark 2.0, Spark SQL is the primary and feature-rich interface to Spark’s underlying in-memory distributed platform (hiding Spark Core’s RDDs behind higher-level abstractions)

79 SparkSQL

# Note the difference: SparkSession instead of SparkContext from pyspark.sql import SparkSession

spark = SparkSession.builder\ .appName("Python Spark SQL example")\ .getOrCreate()

# A Spark "DataFrame" df = spark.read.json("people.json")

df.show() # | age| name| # +----+------+ # |null|Michael| # | 30| Andy| # | 19| Justin| # +----+------+

df.printSchema() # root # |-- age: long (nullable = true) # |-- name: string (nullable = true) 80 SparkSQL

df.select("name").show() # +------+ # | name| # +------+ # |Michael| # | Andy| # | Justin| # +------+

df.select(df['name'], df['age'] + 1).show() # +------+------+ # | name|(age + 1)| # +------+------+ # |Michael| null| # | Andy| 31| # | Justin| 20| # +------+------+

81 SparkSQL

df.filter(df['age'] > 21).show() # +---+----+ # |age|name| # +---+----+ # | 30|Andy| # +---+----+

df.groupBy("age").count().show() # +----+-----+ # | age|count| # +----+-----+ # | 19| 1| # |null| 1| # | 30| 1| # +----+-----+

82 SparkSQL

# Register the DataFrame as a SQL temporary view df.createOrReplaceTempView("people") sqlDF = spark.sql("SELECT * FROM people") sqlDF.show() # +----+------+ # | age| name| # +----+------+ # |null|Michael| # | 30| Andy| # | 19| Justin| # +----+------+

83 DataFrames

Like an RDD, a DataFrame is an immutable distributed collection of data elements

Extends the "free-form" elements by imposing that every element is organized as a set of value into named columns, e.g. (age=30, name=Seppe) Imposes some additional structure on top of RDDs

Designed to make large data sets processing easier

This allows for an easier and higher-level abstraction Provides a domain specific language API to manipulate your distributed data (see examples above) Makes Spark accessible to a wider audience Finally, much in line to what data scientists are actually used to

84 DataFrames

pyspark.sql.SparkSession : Main entry point for DataFrame and SQL functionality

pyspark.sql.DataFrame : A distributed collection of data grouped into named columns

pyspark.sql.Row : A row of data in a DataFrame

pyspark.sql.Column : A column expression in a DataFrame

pyspark.sql.GroupedData : Aggregation methods, returned by DataFrame.groupBy()

pyspark.sql.DataFrameNaFunctions : Methods for handling missing data (null values)

pyspark.sql.DataFrameStatFunctions : Methods for statistics functionality

pyspark.sql.functions : List of built-in functions available for DataFrame

pyspark.sql.types : List of data types available

pyspark.sql.Window : For working with window functions

85 DataFrames

Class pyspark.sql.DataFrame : A distributed collection of data grouped into named columns:

agg(*exprs) : Aggregate on the entire DataFrame without groups

columns : Returns all column names as a list

corr(col1, col2, method=None) : Calculates the correlation of two columns

count() : Returns the number of rows in this DataFrame

cov(col1, col2) : Calculate the sample covariance for the given columns

crossJoin(other) : Returns the cartesian product with another DataFrame

crosstab(col1, col2) : Computes a pair-wise frequency table of the given columns

describe(*cols) : Computes statistics for numeric and string columns

distinct() : Returns a new DataFrame containing the distinct rows in this DataFrame.

drop(*cols) : Returns a new DataFrame that drops the specified column

dropDuplicates(subset=None) : Return a new DataFrame with duplicate rows removed

dropna(how='any', thresh=None, subset=None) : Returns new DataFrame omitting rows with null values

fillna(value, subset=None) : Replace null values 86 DataFrames

Class pyspark.sql.DataFrame : A distributed collection of data grouped into named columns:

filter(condition) : Filters rows using the given condition; where() is an alias for filter()

first() : Returns the first row as a Row

foreach(f) : Applies the f function to all rows of this DataFrame

groupBy(*cols) : Groups the DataFrame using the specified columns

head(n=None) : Returns the first n rows

intersect(other) : Return a intersection with other DataFrame

join(other, on=None, how=None) : Joins with another DataFrame, using the given join expression

orderBy(*cols, **kwargs) : Returns a new DataFrame sorted by the specified column(s)

printSchema() : Prints out the schema in the tree format

randomSplit(weights, seed=None) : Randomly splits this DataFrame with the provided weights

replace(to_replace, value, subset=None) : Returns DataFrame replacing a value with another value

select(*cols) : Projects a set of expressions and returns a new DataFrame

toPandas() : Returns the contents of this DataFrame as a Pandas data frame

union(other) : Return a new DataFrame containing union of rows in this frame and another frame

87 DataFrames

Can be loaded in from:

Parquet files Hive tables JSON files CSV files (Spark 2) JDBC (to connect with a database) AVRO files (using "spark-avro" library or built-in in Spark 2.4) Normal RDDs (given that you specify or infer a "schema")

Can also be converted back to a standard RDD

88 SparkR

Implementation of the Spark DataFrame API for R

An R package that provides a light-weight frontend to use Apache Spark from R Way of working very similar to dplyr Can convert R data frames to SparkDataFrame objects

df <- as.DataFrame(faithful) groupBy(df, df$waiting) %>% summarize(count = n(df$waiting)) %>% head(3)

## waiting count ##1 70 4 ##2 67 1 ##3 69 2

89 Datasets

Spark Datasets is an extension of the DataFrame API that provides a type-safe, object-oriented programming interface

Introduced in Spark 1.6 Like DataFrames, Datasets take advantage of Spark’s optimizer by exposing expressions and data fields to a query planner Datasets extend these benefits with compile-time type safety: meaning production applications can be checked for errors before they are run

A Dataset is a strongly-typed, immutable collection of objects that are mapped to a relational schema

At the core of the Dataset API is a new concept called an encoder, which is responsible for converting between JVM objects and tabular representation

Core idea: where a DataFrame represented a collection of Rows (which a number of named Columns), a Dataset represents a collection of typed objects (with their according typed fields) which can be converted from and to table rows 90 Datasets

Since Spark 2.0, the DataFrame APIs has merged with the Datasets APIs, unifying data processing capabilities across libraries

Because of this unification, developers now have fewer concepts to learn or remember, and work with a single high-level and type-safe API called Dataset However, DataFrame as a name is still used: a DataFrame is a Dataset[Row], so a collection of generic Row objects

91 Datasets

Starting in Spark 2.0, Dataset takes on two distinct APIs characteristics: a strongly-typed API and an untyped API

Dataset represents a collection of strongly-typed JVM objects, dictated by a case class you define in Scala or a class in Java Consider DataFrame as an alias for Dataset[Row], where a Row represents a generic untyped JVM object Since Python and R have no compile-time type-safety, there’s only the untyped API, namely DataFrames

Language Main Abstraction Scala Dataset[T] & DataFrame (= Dataset[Row]) Java Dataset[T] Python DataFrame (= Dataset[Row]) R DataFrame (= Dataset[Row])

92 Datasets

Benefits:

Static typing and runtime-type safety: both syntax and analysis errors can now be caught during compilation of our program High-level abstraction and custom view into structured and semi-structured data Ease-of-use of APIs with structure Performance and optimization

For us R and Python users, we can continue using DataFrames knowing that they are build on Dataset[Row]

Most common use case anyways

(Example will be posted in background information for those interested)

93 MLlib

94 MLlib

MLlib is Spark’s machine learning (ML) library

Its goal is to make practical machine learning scalable and easy Think of it as a "scikit-learn"-on-Spark

Provides:

ML Algorithms: common learning algorithms such as classification, regression, clustering, and collaborative filtering Featurization: feature extraction, transformation, dimensionality reduction, and selection Pipelines: tools for constructing, evaluating, and tuning ML Pipelines Persistence: saving and load algorithms, models, and Pipelines Utilities: linear algebra, statistics, data handling, etc.

95 MLlib

As of Spark 2.0, the primary Machine Learning API for Spark is the DataFrame-based API in the spark.ml package

Before: spark.mllib was RDD-based Not a very helpful way of working

MLlib still supported the RDD-based API in spark.mllib

Since Spark 2.3, MLlib’s DataFrames-based API have reached feature parity with the RDD-based API

After reaching feature parity, the RDD-based API will be deprecated The RDD-based API is expected to be removed in Spark 3.0 Why: DataFrames provide a more user-friendly API than RDDs

Does cause that MLlib is in a bit of a mess…

96 MLlib

Classification Gradient-boosted tree regression Survival regression Logistic regression Isotonic regression Decision tree classifier Random forest classifier Clustering Gradient-boosted tree classifier K-means Multilayer perceptron classifier Latent Dirichlet allocation (LDA) One-vs-Rest classifier (a.k.a. One-vs- Bisecting k-means All) Gaussian Mixture Model (GMM) Naive Bayes

Regression Recommender systems Collaborative filtering Linear regression Generalized linear regression Validation routines Decision tree regression Random forest regression 97 MLlib example

from pyspark.ml.classification import LogisticRegression

training = spark.read.format("libsvm").load("data.txt") lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8) lrModel = lr.fit(training) print("Coefs: " + str(lrModel.coefficients)) print("Intercept: " + str(lrModel.intercept))

from pyspark.ml.clustering import KMeans

dataset = spark.read.format("libsvm").load("data/data.txt")

kmeans = KMeans().setK(2).setSeed(1) model = kmeans.fit(dataset) predictions = model.transform(dataset)

centers = model.clusterCenters() print("Cluster Centers: ") for center in centers: print(center)

98 Conclusions so far

99 Spark versus...

Spark: a high-performance in-memory data-processing framework

Has been widely adopted and still one of the main computing platforms today

Versus:

MapReduce (a mature batch-processing platform for the petabyte scale): Spark is faster, better suited in an online, analytics setting, implements data frame and ML concepts and algorithms Apache Tez: “aimed at building an application framework which allows for a complex directed- acyclic-graph of tasks for processing data” Hortonworks: Spark is a general purpose engine with APIs for mainstream developers, while Tez is a framework for purpose-built tools such as Hive and Pig Cloudera was rooting for Spark, Hortonworks for Tez (a few years ago…) Today: Tez is out! (Hortonworks had to also adopt Spark, and merged with Cloudera) Apache Mahout: “the goal is to build an environment for quickly creating scalable performant machine learning applications” A simple and extensible programming environment and framework for building scalable algorithms A wide variety of premade algorithms for Apache Spark, H2O, Apache Flink Before: also “MapReduce all things” approach Kind of an extension to Spark Though most of the algorithms also in MLlib… so not that widely used any more! 100 Spark versus...

One contender that still is very much in the market is H2O (http://www.h2o.ai/)

H2O is an open source, in-memory, distributed, fast, and scalable machine “ “ learning and predictive analytics platform

Core is written in Java. Inside H2O, a Distributed Key/Value store is used to access and reference data, models, objects, etc., across all nodes and machines The algorithms are implemented on top of H2O’s distributed Map/Reduce framework and utilize the Java /Join framework for multi-threading. - The data is read in parallel and is distributed across the cluster and stored in memory in a columnar format in a compressed way H2O’s REST API allows access to all the capabilities of H2O from an external program or script via JSON over HTTP

So they also had the idea of coming up with a better "MapReduce" engine

Based on a distributed Key-value store In-memory map/reduce Can work on top of Hadoop (YARN) or standalone Though not as efficient as Spark's engine 101 H2O

102 H2O

However, H2O was quick to realize the benefits of Spark, and the role they could play: “customers want to use Spark SQL to make a query, feed the results into H2O Deep Learning to build a model, make their predictions, and then use the results again in Spark”

“Sparkling Water”

103 H2O

Web based UI: Flow

Strong support for algorithms

Killer app on Spark!

Productionization in mind!

E.g. documentation describes “What happens when you try to predict on a categorical level not seen during training?” and “How does the algorithm handle missing values during testing?”

"A better MLlib"

We see a lot of companies embracing H2O on Spark as the "next extension"

104 H2O

library(h2o) h2o.init(nthreads=-1, max_mem_size = "2G") h2o.removeAll()

df <- h2o.importFile(path = normalizePath("./covtype.full.csv")) splits <- h2o.splitFrame(df, c(0.6,0.2))

train <- h2o.assign(splits[[1]], "train.hex") valid <- h2o.assign(splits[[2]], "valid.hex") test <- h2o.assign(splits[[3]], "test.hex")

rf1 <- h2o.randomForest( training_frame = train, validation_frame = valid, x=1:12, y=13, model_id = "rf_covType_v1", ntrees = 200, stopping_rounds = 2, score_each_iteration = T)

summary(rf1) rf1@model$validation_metrics h2o.hit_ratio_table(rf1,valid = T)[1,2]

h2o.shutdown(prompt=FALSE)

105 H2O

Though faster implementations continue to arrive… see, e.g.: https://github.com/aksnzhy/xforest

http://datascience.la/benchmarking-random-forest- implementations/ 106 Further reading

https://spark.apache.org/docs/latest/index.html (!) https://www.gitbook.com/book/jaceklaskowski/mastering-apache-spark https://mapr.com/ebooks/spark/03-apache-spark-architecture-overview.html https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and- datasets.html

107 Summary so far

Hadoop letdown: MapReduce not that user-friendly

Not convenient for analytics purposes More high-volume bath operations, ETL What about SQL? What about analytics?

Spark re-uses components of Hadoop

HDFS (or Hbase, Hive, flat files, Cassandra, …) YARN (or Mesos, or stand-alone)

108 Summary so far

DAG approach adopted by many other projects

E.g. Dask, https://dask.pydata.org/en/latest/

Spark has a Directed Acyclic Graph (DAG) execution engine

Supporting cyclic data flow and in-memory computing

109 Summary so far

DAG engine powers core concept of Resilient Distributed Datasets (RDDs)

Fault-tolerant and efficient Two types of operations: transformations and actions Represents an unstructured collection of data (e.g. lines of tekst, images, vectors, …)

Programs are written interactively (Spark shell)

Or packaged as an application (like with a MapReduce app) Bindings to Scala, Java, Python, R, and support for SQL

RDDs are still a bit hard to learn and use

Low-level control to perform transformation and actions on your dataset Data not necessarily structured, columnar SparkSQL is the engine to work with semi-structured data Exposes DataFrame and Dataset APIs Underlyingly uses RDDs, but more suited towards structured analyses 110 Summary so far

MLlib is Spark’s machine learning (ML) library

Before: spark.mllib was RDD-based Not a very helpful way of working MLlib will still support the RDD-based API in spark.mllib MLlib will not add new features to the RDD-based API

Newer features: spark.ml Classification, regression, clustering, recommender systems

111 Streaming analytics

112 Streaming analytics?

Not only serve predictions using static models, but also as tightly-integrated components of feedback loops involving dynamic, real-time decision making

Next to Spark, other exciting frameworks emerge and gain momentum

Kafka, Samza, Flink, Ignite, Kudu, Splunk

Lots of attention today towards streaming / realtime analytics

Although many projects still in beginning stage “Streaming analytics” might not be that hard

113 Streaming analytics?

With regards to the why, examples include:

Advertising: from mass branding to 1-1 targeting Fintech: general advice to personalized robo-advisors Healthcare: mass treatment to designer medicine Retail: static branding to real-time personalization Manufacturing: break-then-fix to repair-before-break

Note: many of these have more to do with “personalization”

Take care of slogans such as “self-learning models”

114 Streaming analytics

We need to differentiate between different “aspects” of “streaming”

Streaming data: my data source emits events, or a stream of instances, or a stream of …

How can I access to historical events? How do I store this? How do I convert this to a data set? How do I send such streams across my data center, clients, applications? (Plumbing?)

Streaming training: my algorithm needs to be trained on a continuous stream

“On-line” algorithms, very hard problem for many techniques Requires ad-hoc changes and modification of each algorithm

Streaming prediction: my model needs to predict on a stream of instances, events, …

Matter of deployment, operationalizing the model 115 Streaming analytics

We need to differentiate between different “aspects” of “streaming”

Streaming data: my data source emits events, or a stream of instances, or a stream of …

If we can do this, do we even need streaming training? Is this really a hard setting?

Streaming training: my algorithm needs to be trained on a continuous stream

See above

Streaming prediction: my model needs to predict on a stream of instances, events, …

Is this really a large issue in your setting?

116 Think about your use case

Netflix: 9 million events per second at peak

LinkedIn: 500 billion events per day, ~24 GB per second during peak hours

Bombardier showcased its C Series jetliner that carries Pratt & Whitney’s Geared Turbo Fan (GTF) engine, which is fitted with 5,000 sensors that generate up to 10 GB of data per second. A single twin-engine aircraft with an average 12-hr. flight-time can produce up to 844 TB of data

WhatsApp, Uber, ...?

How do you compare?

117 Streaming data, streaming engine

Data can be bounded (finite) or unbounded

Execution engine is streaming or batch

Combinations of both possible!

You can pretend a finite data set comes in as a stream And you can handle an infinite data set in batches

For finite data-sets, accurate processing is relatively simple

On failure, schedule a new reprocessing Use checkpoints to make it more efficient Effectively what Spark does (the resilient in RDD)

For infinite data sets, this is harder

Need to take time into account Time of event creation, ingestion, processed 118 We also need to rethink aggregations

119 We also need to rethink aggregations

120 The reality

What about “on-line” algorithms?

Do we need them? In many cases: perhaps not Also hard to find “on-line” implementations Streaming linear regression, online k-means clustering, incremental matrix factorization

E.g. Netflix: 9 million events per second

Real-time (re)training of recommender matrix, but main training still offline

In most cases:

You might need to deal with streaming data: how to story it, access history? We want to be able to perform online predictions on this data Training can be done offline Model can be deployed in a streaming setup Depending on your needs: re-train every month, week, day, ... but depends more on how fast the feature space changes, not really on how fast the data comes in 121 Streaming overview

https://www.slideshare.net/sbaltagi/apache-flink-realworld-use-cases-for-streaming-analytics

122 Event collectors

(We're on a quest for something to help with analytics in a streaming setting)

Event collectors gather, collect, centralize events

Examples of event collectors include:

Apache Flume: one of the oldest Apache projects designed to collect, aggregate, and move large data sets such as web server logs to a centralized location Main use case: streaming logs from multiple sources capable of running JVM Not that helpful for analytics Apache NiFi: a relatively new project. It is based on Enterprise Integration Patterns (EIP) where the data flows through multiple stages and transformations before reaching the destination Apache NiFi comes with a highly intuitive graphical interface that makes it easy to design data flow and transformations Main use case: ETL, EIP, data transformations Not that helpful for analytics

123 Event brokers

Message/event/data brokers handle message validation, transformation and routing

Message oriented middleware (“MOM”) It mediates communication amongst applications, minimizing the mutual awareness that applications should have of each other in order to be able to exchange messages, effectively implementing decoupling Message routing (one or more destinations), message transformation, simple message aggregation In a way which is resilient, fail-safe, scalable A lot of the streaming data “plumbing” thus handled by message/event/data brokers

Examples include:

Apache ActiveMQ, Apache Kafka, Celery, RabbitMQ, Redis, ZeroMQ Especially Kafka is a popular choice: can be easily integrated with Spark But in itself: not really much analytics E.g. in-between Uber's mobile app and data lakes

124 Event processors

Here we find the actually intelligence

Spark streaming and Spark structured streaming as two very popular options

Spark Streaming enables developers to build streaming applications through Sparks’ high-level API

Since it runs on Spark, Spark Streaming lets developers reuse the same code for batch processing, join streams against historical data, or run ad-hoc queries on stream state Spark Streaming operates in micro-batching mode where the batch size is much smaller to conventional batch processing Can be put on top of Kafka acting as the message broker: common approach these days

125 Spark Streaming

Spark Streaming provides a high-level abstraction called discretized stream or DStream, which represents a continuous stream of data

DStreams can be created either from input data streams from sources such as Kafka, Flume, and Kinesis, or by applying high-level operations on other DStreams Internally, a DStream is represented as a sequence of RDDs This helps with regards to easy of use: many of the same actions and operations can be applied

126 Spark Streaming

pyspark.streaming.StreamingContext is the main entry point for all streaming functionality

from pyspark import SparkContext from pyspark.streaming import StreamingContext sc = SparkContext("local[2]", "NetworkWordCount") ssc = StreamingContext(sc, 1)

# Create a DStream that will connect to localhost:9999 as a source lines = ssc.socketTextStream("localhost", 9999)

# Split each line into words words = lines.flatMap(lambda line: line.split(" "))

# Count each word in each batch pairs = words.map(lambda word: (word, 1)) wordCounts = pairs.reduceByKey(lambda x, y: x + y) wordCounts.pprint() # Print out first ten elements # of each RDD generated in this Dstream ssc.start() # Start the computation ssc.awaitTermination() # Wait for the computation to terminate 127 Spark Streaming

A Discretized Stream or DStream is the basic abstraction provided by Spark Streaming

It represents a continuous stream of data, either the input data stream received from source, or the processed data stream generated by transforming the input stream Internally, a DStream is represented by a continuous series of RDDs Each RDD in a DStream contains data from a certain interval Similar to that of RDDs, transformations allow the data from the input DStream to be modified DStreams support many of the transformations available on normal Spark RDD’s The transform operation (along with its variations like transformWith) allows arbitrary RDD- to-RDD functions to be applied on a DStream Spark Streaming also provides windowed computations, which allow you to apply transformations over a sliding window of data

128 Spark Streaming

Every time the window slides over a source DStream, the source RDDs that fall within the window are combined and operated upon to produce the RDDs of the windowed DStream

window length - the duration of the window (3 below) sliding interval - the interval at which the window operation is performed (2 below)

windowedWordCounts = pairs.reduceByKeyAndWindow(lambda x, y: x + y, 30, 10)

129 Spark Streaming

You can perform different kinds of joins on streams in Spark Streaming

windowedStream1 = stream1.window(20) windowedStream2 = stream2.window(60) joinedStream = windowedStream1.join(windowedStream2)

dataset = ... # some RDD windowedStream = stream.window(20) joinedStream = windowedStream.transform(lambda rdd: rdd.join(dataset))

130 Spark Streaming

Output operations allow DStream’s data to be pushed out to external systems like a database or a file systems

Triggers the actual execution of all the DStream transformations

print() prints the first ten elements of every batch of data in a DStream on the driver node running the streaming application, useful for development and debugging (this is called pprint() in the Python API)

saveAsTextFiles(prefix, [suffix]) saves this DStream's contents as text files. The file name at each batch interval is generated based on prefix and suffix: "prefix-TIME_IN_MS[.suffix]"

saveAsHadoopFiles(prefix, [suffix]) saves this DStream's contents as Hadoop files. The file name at each batch interval is generated based on prefix and suffix: "prefix- TIME_IN_MS[.suffix]".

foreachRDD(func) is the most generic output operator that applies a function, func, to each RDD generated from the stream

131 Spark Streaming

Note that Spark Streaming utilizes RDDs heavily

Not very pleasant

But: you can quite easily use DataFrames and SQL operations on streaming data

Just convert the RDD to a DataFrame You have to create a SparkSession using the SparkContext that the StreamingContext is using Furthermore this has to done such that it can be restarted on driver failures. This is done by creating a lazily instantiated singleton instance of SparkSession

132 Spark Streaming

def getSparkSessionInstance(sparkConf): if ("sparkSessionSingletonInstance" not in globals()): globals()["sparkSessionSingletonInstance"] = \ SparkSession.builder.config(conf=sparkConf) \ .getOrCreate() return globals()["sparkSessionSingletonInstance"]

words = ... # DStream of strings

def process(time, rdd): print("======%s ======" % str(time)) try: # Get the singleton instance of SparkSession spark = getSparkSessionInstance(rdd.context.getConf()) # Convert RDD[String] to RDD[Row] and then to DataFrame rowRdd = rdd.map(lambda w: Row(word=w)) wordsDataFrame = spark.createDataFrame(rowRdd) # Creates a temporary view using the DataFrame wordsDataFrame.createOrReplaceTempView("words") # Do word count on table using SQL and print it wordCountsDataFrame = spark.sql("select word, count(*) as total from words group by word") wordCountsDataFrame.show() except: pass

words.foreachRDD(process)

133 Spark Streaming and MLlib

You can also easily use machine learning algorithms provided by MLlib

First of all, there are streaming (on-line) machine learning algorithms which can simultaneously learn from the streaming data as well as apply the model on the streaming data Beyond these, for a much larger class of machine learning algorithms, you can learn a learning model offline (i.e. using historical data) and then apply the model online on streaming data (more common case)

134 Spark Structured Streaming

Spark Structured Streaming is a scalable and fault-tolerant stream processing engine built on the Spark SQL engine

The key idea in Structured Streaming is to treat a live data stream as a table that is being continuously appended. This leads to a new stream processing model that is very similar to a batch processing model. You will express your streaming computation as standard batch-like query as on a static table, and Spark runs it as an incremental query on the unbounded input table You can express your streaming computation the same way you would express a batch computation on static data The Spark SQL engine will take care of running it incrementally and continuously and updating the final result as streaming data continues to arrive You can use the Dataset/DataFrame API in Scala, Java or Python to express streaming aggregations, event-time windows, stream-to-batch joins, etc The computation is executed on the same optimized Spark SQL engine Finally, the system ensures end-to-end exactly-once fault-tolerance guarantees through checkpointing and Write Ahead Logs Structured Streaming provides fast, scalable, fault-tolerant, end-to-end exactly-once stream processing without the user having to reason about streaming Structured Streaming is alpha in Spark 2.1 but stable in 2.2 https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html 135 Spark Structured Streaming

https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html

136 Spark Structured Streaming

https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html

137 Spark Structured Streaming

Note that Structured Streaming does not materialize the entire table

It reads the latest available data from the streaming data source, processes it incrementally to update the result, and then discards the source data It only keeps around the minimal intermediate state data as required to update the result

https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html 138 Spark Structured Streaming

Aggregations over a sliding event-time window are straightforward with Structured Streaming and are very similar to grouped aggregations

In a grouped aggregation, aggregate values (e.g. counts) are maintained for each unique value in the user-specified grouping column In case of window-based aggregations, aggregate values are maintained for each window the event-time of a row falls into

https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html 139 Spark Structured Streaming

Structured Streaming can maintain the intermediate state for partial aggregates for a long period of time such that late data can update aggregates of old windows correctly

https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html

140 Spark Structured Streaming

from pyspark.sql import SparkSession from pyspark.sql.functions import explode from pyspark.sql.functions import split

spark = SparkSession.builder.appName("StructuredNetworkWordCount") \ .getOrCreate()

# This will be a streaming data frame lines = spark.readStream.format("socket").option("host", "localhost") \ .option("port", 9999).load()

# Split the lines into words and generate word count words = lines.select( explode( split(lines.value, " ") ).alias("word") ) wordCounts = words.groupBy("word").count()

# Start running the query that prints the running counts to the console query = wordCounts.writeStream.outputMode("complete").format("console") query.start() query.awaitTermination()

141 Other event processors

Apache Storm was originally developed by Nathan Marz at BackType, a company that was acquired by Twitter. After the acquisition, Twitter open sourced Storm before donating it to Apache Used by Flipboard, Yahoo!, and Twitter, it was a long-time standard for developing distributed, real-time, data processing platforms Storm is often referred as “the Hadoop for real-time processing”. It is primarily designed for scalability and fault- tolerance For new projects: not that much in use any more (mainly replaced by Kafka + Spark setups) Apache Flink: a streaming dataflow engine that provides data distribution, communication, and fault tolerance for distributed computations over data streams Gaining traction, for use cases where mini-batching is not viable Apache Ignite: a high-performance, integrated and distributed in-memory platform for computing and transacting on large-scale data sets in real-time For very heavy duty applications

142 Event indexers, search, visualization

Tools like Splunk, Kibana

Good commercial offerings available More concerned with dashboarding on real-time data, not so much learning

143 Conclusions

144 The Apache family

Most of the "big data tooling" are Apache Foundation projects

Open source Java Enterprisy Many of these depend on others… Who can make sense of this all? Some relatively new

Future to be seen in terms of adoption and maturity

Hadoop → Spark → Spark + H2O → Spark + Kafka + H2O → Flink → ...

Important: perform a thorough assessment before committing

Features, but also: documentation, security, integration, ease-of-use, … “Which version are we getting?”

145 Focus on what matters

https://landscape.cncf.io/category=streaming-messaging&format=card- mode&grouping=category https://news.ycombinator.com/item?id=19016997

“ All of these distributed streaming architectures are massively overused. They certainly have their place, but you probably don't need them. I see it all the time with ML work. You wind up using a cluster to overcome the memory inefficiency of Spark, when you could have just used a single

machine. “

“ This has been my experience, too. I worked at just one place that had a really good handle on high-volume, high-velocity streaming data, and they didn't use Flink or Storm or Kafka or anything like that. They mostly just used the KISS principle and a protobuf-style wire format. There is definitely a point where these sorts of scale-out-centric solutions are

unavoidable. Short of that point, though, they're probably best avoided. “

146 Focus on what matters

https://www.usenix.org/system/files/conference/hotos15/hotos15-paper-mcsherry.pdf

147 Platforms: have someone solve it for you?

Either combine a combination of Apache projects

Hadoop on Premise Spark in the Cloud

Or provide a one-click Jupyter environment

Installing packages Scheduling and monitoring models

Options for "infrastructure in the cloud": Amazon, Google, ...

Challenge remains deployment, management, governance, and starting from the right question

Recall discussion in "Evaluation" and "Data Science Tools"!

148 Focus on what matters

Just 5% of data scientists surveyed by O’Reilly use any SAS software, and 0% use any IBM analytics software. “ In the slightly broader KDnuggets poll, 6% use SAS Enterprise Miner, and 8% say they use IBM SPSS Modeler Gartner’s obsession with ‘Citizen Data Scientists’ leads it to criticize Domino and H2O because they are ‘hard to

use’. Imagine that! If you want to use a data science platform, you need to know how to do data science... Gartner is clueless about open source software (https://thomaswdinsmore.com/2017/02/28/gartner-looks-at- “ data-science-platforms/)

2017 Magic Quadrant for Data Science Platforms -- Gartner

149 Focus on what matters

2018 Magic Quadrant for Data Science Platforms -- Gartner

150 Focus on what matters

Maybe this is what we actually need

151 So what should you do?

Steal from the best...

https://eng.uber.com/scaling-michelangelo/ 152 So what should you do?

Steal from the best...

https://eng.uber.com/uber-big-data-platform/

153 So what should you do?

Steal from the best...

https://labs.spotify.com/2016/03/03/spotifys-event-delivery-the-road-to-the-cloud-part-ii/ https://airflow.apache.org/ https://github.com/spotify/luigi 154 So what should you do?

Steal from the best...

https://blog.keen.io/architecture-of-giants-data-stacks-at-facebook-netflix-airbnb-and-pinterest/

155 Most importantly

Start with the analytics: big data stacks bring the plumbing, but doesn't mean you can make soup

Not every organization is engineering driven You probably do not have the capacity, or need, even, to keep up to date Be honest: do you have big data? What do you mean with “real time”, “streaming”? Make an informed decision The algorithms are the same everywhere anyway One use case with small data can lead to much more value than a data lake filled with stuff. Data is not gold People and process matters. Right governance, thinking about deployment, maintenance, monitoring, and most of all: the business question

156

157