Advanced Analytics in Business [D0S07a] Big Data Platforms & Technologies [D0S06a] Hadoop and Spark Streaming Analytics Overview
Introduction Hadoop: HDFS and MapReduce Spark: SparkSQL and MLlib Streaming analytics and other trends
2 Recall…
3 Two sides emerge
Infrastructure
Big Data Integration Architecture NoSQL and NewSQL Streaming AI and ML ops
4 Two sides emerge
Analytics
Data Science Machine Learning AI NLP But also still: BI and Visualization
5 There’s a difference
6 Previously
In-memory analytics
Together with some intermediate techniques (Dask and friends): based on disk swapping and directed acyclic execution graphs
Now: moving to the world of big data
Managing, storing, querying data Storage and computation in a distributed setup: setting of multiple machines is assumed And (hopefully) the same for analytics: distributed data frames, distributed model training
7 Hadoop
8 Hadoop
At some point, Hadoop was mentioned every time a team was talking about some big daunting task related to the analysis or management of big data
Have lots of volume? Hadoop! Unstructured data? Hadoop! Streaming data? Hadoop! Want to run super-fast machine learning in parallel? You guessed it… Hadoop!
So what is Hadoop and what is it not?
9 Hadoop
The genesis of Hadoop came from the Google File System paper that was published in 2003 Spawned another research paper from Google: MapReduce Hadoop itself however started as part of a search engine called Nutch, which was being worked on by Doug Cutting and Mike Cafarella 5k lines of code for NDFS (Nutch Distributed File System) and 6k lines of code for MapReduce In 2006, Cutting joined Yahoo! to work in its search engine division. The part of Nutch which dealt with distributed computing and processing (initially constructed to handle with the simultaneous parsing of enormous amounts of web links in an efficient manner) was split of and renamed to “Hadoop” Toy elephant of his son In 2008, Yahoo! open-sourced Hadoop Hadoop become part of an ecosystem of technologies which are managed by the non-profit Apache Software foundation
Today, most of the hype around Hadoop has passed, for reasons we’ll see later
10 Hadoop
“Raw” Hadoop contains four core modules:
1. Hadoop Common (a set of shared libraries) 2. Hadoop Distributed File System (HDFS), a Java-based file system to store data across multiple machines 3. MapReduce (a programming model to process large sets of data in parallel) 4. YARN (Yet Another Resource Negotiator), a framework to schedule and handle resource requests in a distributed environment
In MapReduce version 1 (Hadoop 1), HDFS and MapReduce were tightly coupled. This didn’t scale well to really big clusters. In Hadoop 2, the resource management and scheduling tasks are separated from MapReduce by YARN
11 HDFS
HDFS is the distributed file system used by Hadoop to store data in the cluster
HDFS lets you connect nodes (commodity computers) contained within clusters over which data files are distributed You can then access and store the data files as one seamless file system Theoretically, you don’t need to have it running and files could instead be stored elsewhere
HDFS replicates file blocks for fault tolerance and high-throughput access
An application can specify the number of replicas of a file at the time it is created, and this number can be changed any time after that A “name node” makes all decisions concerning block replication, “data nodes” hold the actual data blocks
One of the main goals of HDFS is to support large files. The size of a typical HDFS block is 64MB
Therefore, files can consist of one or more 64MB blocks HDFS tries to place each block on separate data nodes
12 HDFS
HDFS provides a native Java API and a native C-language wrapper for the Java API, as well as shell commands to interface with the file system
byte[] fileData = readFile(); String filePath = "/data/course/participants.csv"; Configuration config = new Configuration(); org.apache.hadoop.fs.FileSystem hdfs = org.apache.hadoop.fs.FileSystem.get(config); org.apache.hadoop.fs.Path path = new org.apache.hadoop.fs.Path(filePath); org.apache.hadoop.fs.FSDataOutputStream outputStream = hdfs.create(path); outputStream.write(fileData, 0, fileData.length);
In layman’s terms: a massive, distributed, C:-drive…
And note: reading in a massive file in a naive way will still end up badly Distributed storage, not computing!
13 MapReduce
What is MapReduce?
A “programming framework” for coordinating tasks in a distributed environment HDFS uses this “behind the scenes” Reading a file is converted to a MapReduce task to read across multiple DataNodes and stream the resulting file Can be used to construct scalable and fault-tolerant operations in general
HDFS provides a way to store files in a distributed fashion, MapReduce allows to do something with them in a distributed fashion
14 MapReduce
The concepts of “map” and “reduce” existed long before Hadoop and stem from the domain of functional programming
Map: apply a function on every item in a list: result is a new list of values
numbers = [1,2,3,4,5] numbers.map(λ x : x * x) # [1,4,9,16,25]
Reduce: apply function on a list: result is a scalar
numbers.reduce(λ x : sum(x)) # 15
15 MapReduce
A Hadoop map-reduce pipeline works over lists of (key, value) pairs
The map operations maps each pair to a list of output key-value pairs (zero, one, or more) This operation can be run in parallel over the input pairs The input list could also contain a single key-value pair
Next, the output entries are shuffled and distributed so that all output entries belonging to the same key are assigned to the same worker
All of these workers then apply a reduce function to each group
Producing a final key-value pair for each distinct key The resulting, final outputs are then (optionally) sorted per key to produce the final outcome
16 MapReduce: word count example
17 MapReduce: word count example
18 MapReduce: averaging example
def map(key, value): yield (value['genre'], value['nrPages'])
19 MapReduce: averaging example
def reduce(key, values): yield (key, sum(values) / length(values))
20 MapReduce: averaging example
There’s a gotcha, however: the reduce operation should work on partial results and be able to be applied multiple times in a chain
21 MapReduce: averaging example
22 MapReduce
The reduce operation should work on partial results and be able to be applied multiple times in a chain
1. The reduce function should output the same structure as emitted by the map function, since this output can be used again in an additional reduce operation 2. The reduce function should provide correct results even if called multiple times on partial results
def map(key, value): yield (value['genre'], (value['nrPages'], 1))
def reduce(key, values): sum, newcount = 0, 0 for (nrPages, count) in values: sum = sum + nrPages * count newcount = newcount + count yield (key, (sum/newcount, newcount))
Instead of using a running average as a value, our value will now itself be a pair of (running average, number of records already seen) 23 MapReduce: correct averaging example
24 Testing it out map_reduce.py will be made available as background material
You can use this to play around with the MapReduce paradigm without setting up a full Hadoop stack
# Minimum per group example
from map_reduce import runtask
documents = [ ('drama', 200), ('education', 100), ('action', 20), ('thriller', 20), ('drama', 220), ('education', 150), ('action', 10), ('thriller', 160), ('drama', 140), ('education', 160), ('action', 20), ('thriller', 30) ]
# Provide a mapping function of the form mapfunc(value) # Must yield (k,v) pairs def mapfunc(value): genre, pages = value yield (genre, pages)
# Provide a reduce function of the form reducefunc(key, list_of_values) # Must yield (k,v) pairs def reducefunc(key, values): yield (key, min(values))
# Pass your input list, mapping and reduce functions runtask(documents, mapfunc, reducefunc) 25 Back to Hadoop
On Hadoop, MapReduce tasks are written using Java
Bindings for Python and other languages exist as well, but Java is the “native” environment Java program is packages as a JAR archive and launched using the command:
hadoop jar myfile.jar ClassToRun [args...]
hadoop jar wordcount.jar RunWordCount /input/dataset.txt /output/
26 Back to Hadoop
public static class MyReducer extends Reducer
public void reduce(Text key, Iterable
int sum = 0; IntWritable result = new IntWritable();
for (IntWritable val : values) { sum += val.get(); }
result.set(sum); context.write(key, result); } }
27 Back to Hadoop hadoop jar wordcount.jar WordCount /users/me/dataset.txt /users/me/output/
28 Back to Hadoop
$ hadoop fs -ls /users/me/output
Found 2 items -rw-r—r-- 1 root hdfs 0 2017-05-20 15:11 /users/me/output/_SUCCESS -rw-r—r-- 1 root hdfs 2069 2017-05-20 15:11 /users/me/output/part-r-00000
$ hadoop fs -cat /users/me/output/part-r-00000
and 2 first 1 is 3 line 2 second 1 the 2 this 3
29 Back to Hadoop
MapReduce tasks can consist of more than mappers and reducers
Partitioners, Combiners, Shufflers, and Sorters
30 MapReduce
Constructing MapReduce programs requires a certain skillset in terms of programming (to put it lightly)
There’s a reason why most tutorials don’t go much further than counting words
Tradeoffs in terms of speed, memory consumption, and scalability
Big does not mean fast Does your use case really align with that of a search engine?
31 YARN
How is a MapReduce program coordinated amongst the different nodes in the cluster?
In the former Hadoop 1 architecture, the cluster was managed by a service called the JobTracker
In Hadoop 2, MapReduce is split into two components
The cluster resource management capabilities have become YARN, while the MapReduce-specific capabilities remain MapReduce
32 So… Hadoop?
Standard Hadoop: definitely not a turn-key solution for most environments Just a big hard drive and a way to do scalable MapReduce? In a way which is not fun to program at all? As such, many implementations and vendors also mix-in a number of additional projects such as: HBase: a distributed database which runs on top of the Hadoop core stack (no SQL, just MapReduce) Hive: a data warehouse solution with SQL like query capabilities to handle data in the form of tables Pig: a framework to manipulate data stored in HDFS without having to write complex MapReduce programs from scratch Cassandra: another distributed database Ambari: a web interface for managing Hadoop stacks (managing all these other fancy names) Flume: a framework to collect and deal with streaming data intakes Oozie: a more advanced job scheduler that cooperates with YARN Zookeeper: a centralized service for maintaining configuration information, naming (a cluster on its own) Sqoop: a connector to move data between Hadoop and relational databases Atlas: a system to govern metadata and its compliance Ranger: a centralized platform to define, administer and manage security policies consistently across Hadoop components Spark: a computing framework geared towards data analytics
33 So… Hadoop?
34 SQL on Hadoop
35 The first letdown
“ From the moment a new distributed data store gets popular, the next question will be how to run SQL on top of it…
What do you mean it’s a file system? How do we query this thing? We need SQL! “
2008: the first release of Apache Hive, the original SQL-on-Hadoop solution Rapidly became one of the de-facto tools included with almost all Hadoop installations Hive converts SQL queries to a series of map-reduce jobs, and presents itself to clients in a way which very much resembles a MySQL server It also offers a command line client, Java APIs and JDBC drivers, which made the project wildly successful and quickly adapted by all organizations which were quickly beginning to realize that they’d taken a step back from their traditional data warehouse setups in their desire to switch to Hadoop as soon as possible
SELECT genre, SUM(nrPages) FROM books GROUP BY genre --> convert to MapReduce job ORDER BY genre
36 There is (was?) also HBase
The first database on Hadoop Native database on top of Hadoop No SQL, own get/put/filter operations Complex queries as MapReduce jobs
hbase(main):009:0> scan 'users' ROW COLUMN+CELL seppe column=email:, timestamp=1495293082872, [email protected] seppe column=name:first, timestamp=1495293050816, value=Seppe seppe column=name:last, timestamp=1495293067245, value=vanden Broucke 1 row(s) in 0.1170 seconds
hbase(main):011:0> get 'users', 'seppe' COLUMN CELL email: timestamp=1495293082872, [email protected] name:first timestamp=1495293050816, value=Seppe name:last timestamp=1495293067245, value=vanden Broucke 4 row(s) in 0.1250 seconds
37 There is (was?) also Pig
Another way to ease the pain of writing MapReduce programs Still not very easy though People still wanted good ole SQL
timesheet = LOAD 'timesheet.csv' USING PigStorage(','); raw_timesheet = FILTER timesheet by $0>1; timesheet_logged = FOREACH raw_timesheet GENERATE $0 AS driverId, $2 AS hours_logged, $3 AS miles_logged;
grp_logged = GROUP timesheet_logged by driverId;
sum_logged = FOREACH grp_logged GENERATE group as driverId, SUM(timesheet_logged.hours_logged) as sum_hourslogged, SUM(timesheet_logged.miles_logged) as sum_mileslogged;
38 Hive
Hive is handy… but SQL-on-Hadoop technologies are not perfect implementations of relational database management systems:
Sacrifice on features such as speed, SQL language compatibility Support for complex joins lacking For Hive, the main draw back was its lack of speed Because of the overhead incurred by translating each query into a series of map-reduce jobs, even the simplest of queries can consume a large amount of time
Big does not mean fast
39 So… without MapReduce?
For a long time, companies such as Hortonworks were pushing behind the development of Hive, mainly by putting efforts behind Apache Tez, which provides a new backend for Hive, no longer based on the map-reduce paradigm but on directed-acyclic-graph pipelines In 2012, Cloudera, another well-known Hadoop vendor, introduced their own SQL-on-Hadoop technology as part of their “Impala” stack. Cloudera also opted to forego map-reduce completely
It didn’t take long for other vendors to take notice of the need for SQL-on-Hadoop, and in recent years, we saw almost every vendor joining the bandwagon and offering their own query engines (IBM’s BigSQL platform or Oracle’s Big Data SQL, for instance)
Some better, some worse
But…
40 Hype meets reality
In a tech startup industry that loves its shiny new objects, the term “Big Data” is in the “ “ unenviable position of sounding increasingly “3 years ago” – Matt Turck
Hadoop was created in 2006!
It’s now been more than a decade since Google’s papers on MapReduce
Interest in the concept of “Big Data” reached fever pitch sometime between 2011 and 2014
Big Data was the new “black”, “gold” or “oil” 2015 was probably the year when people started moving to AI and its many related concepts and flavors: machine intelligence, deep learning, etc. Today, we’re in the midst of a new “AI summer” (with it’s own hype as well)
41 Today
Today, the storage and querying aspect has stabilized and found a good marriage between big data techniques, speed, a return to relational data bases and NoSQL-style scalability
E.g. Amazon Redshift, Snowflake, CockroachDB, Presto, Dremio and many others… Dozens of other data story and querying solutions Cloud warehousing went mainstream again Don’t do it yourself
This is mainly what you’ll hear big data architects talk about
Storage Management Integration Pipelines … plumbing?
42 Big Analytics?
Most managers worth their salt have realized that Hadoop-based solutions might not be the right fit
Proper cloud-based databases might be
This is all nice, but from an analytics point of view, we’re back to the BI days…
SQL based reports Dashboards, metrics On the predictive analytics, machine learning or AI front, we’re not much further yet (AI is not a live dashboard with a modern color scheme) Except for a (more annoying) data storage layer which slows things down
The big unanswered question right now:
“ How to use Hadoop for machine learning and analytics? “
Or rather:
“ How to support distributed analytics? “
43 Big Analytics?
So it turns out that MapReduce was never very well suited for analytics as well
Extremely hard to convert techniques to a map-reduce paradigm Slow due to lots of in-out swapping to HDFS Ask the Mahout project, they tried Slow for most “online” tasks… Querying is nice, but… we just end up with business intelligence dashboarding and pretending we have big data?
“2015 was the year of Apache Spark”
Bye bye, Hadoop! Spark has been embraced by virtually all players Spark is meaningful because it effectively addresses some of the key issues that were slowing down the adoption of Hadoop: it is much faster (benchmarks have shown Spark is 10 to 100 times faster than Hadoop’s MapReduce), easier to program, and lends itself well to machine learning
44 Spark
45 Spark
Apache Spark focuses on real-time, in-memory, parallelized processing of data, and also comes with its own SQL engine, Spark SQL, that builds on top of it to allow SQL queries to be written against data, and which has become very popular, especially in data mining/science circles
46 So what do we throw out?
The resource manager (YARN)?
We’re still running on a cluster of machines Spark can run on top of YARN, but also Mesos (an alternative resource manager), or even in standalone mode
The data storage (HDFS)?
Again, Spark can work with a variety of storage systems Google Cloud Amazon S3 Apache Cassandra Apache Hadoop (HDFS) Apache HBase Apache Hive Flat files (JSON, Parquet, CSV, others)
47 So what do we throw out?
One thing that we do “kick out” is MapReduce
Spark is designed to perform both batch processing (similar to MapReduce) and new workloads like streaming, interactive queries, and machine learning How? Apache Spark replaces the MapReduce paradigm with an advanced DAG execution engine that supports cyclic data flow and in-memory computing
A smarter way to distribute jobs over machines! Note the similarities with previous projects such as Dask…
48 Spark’s building blocks
49 Spark core
Spark core implements the core abstraction to represent data elements: the Resilient Distributed Dataset (RDD)
The Resilient Distributed Dataset is the primary data abstraction in Apache Spark and represents a collection of data elements It is designed to support in-memory data storage, distributed across a cluster in a manner that is demonstrably both fault- tolerant and efficient Fault-tolerance is achieved, in part, by tracking the lineage of transformations applied to coarse-grained sets of data Efficiency is achieved through parallelization of processing across multiple nodes in the cluster, and minimization of data replication between those nodes Once data is loaded into an RDD, two types of operations can be carried out: Transformations, which create a new RDD by changing the original through processes such as mapping, filtering, and more Actions, such as counts, which measure but do not change the original data
50 RDDs are distributed, fault-tolerant, efficient
Note that an RDD represents a collection of elements partitioned across the nodes of the cluster that can be operated on in parallel
Any sort of element collection: a collection of text lines, a collection of single words, a collection of objects, a collection of images, a collection of instances, … The only feature provided is automatic distribution and task management over this collection
Through transformations and actions: do things with the RDD
The original RDD remains unchanged throughout. The chain of transformations from RDD1 to RDDn are logged, and can be repeated in the event of data loss or the failure of a cluster node Transformations are said to be lazily evaluated: they are not executed until a subsequent action has a need for the result Where possible, these RDDs remain in memory, greatly increasing the performance of the cluster, particularly in use cases with a requirement for iterative queries or processes
51 RDDs are distribted, fault-tolerant, efficient
52 Writing Spark-based programs
Similar as with MapReduce, an application using the RDD framework can be coded in Java or Scala and packaged as a JAR file to be launched on the cluster
However, Spark also provides an interactive shell interface (Spark shell) to its cluster environment
And also exposes APIs to work directly with the RDD concept in a variety of languages
Scala Java Python R SQL
53 Spark shell (pyspark)
PySpark is the “driver program”: runs on the client and will set up a “SparkContext” (a connection to the Spark cluster)
>>> textFile = sc.textFile("README.md") # sc is the SparkContext
# textFile is now an RDD (each element represents a line of text)
>>> textFile.count() # Number of items in this RDD 126
>>> textFile.first() # First item in this RDD u'# Apache Spark'
# Chaining together a transformation and an action: # How many lines contain "Spark"? >>> textFile.filter(lambda line: "Spark" in line).count() 15
54 SparkContext
SparkContext sets up internal services and establishes a connection to a Spark execution environment
Data operations are not executed on your machine: the client sends them to be executed by the Spark cluster! No data is loaded in the client… unless you’d perform a .toPandas()
55 Deploying an application
Alternative to the interactive mode:
from pyspark import SparkContext
# Set up context ourselves sc = SparkContext("local", "Simple App")
logData = sc.textFile("README.md")
numAs = logData.filter(lambda s: 'a' in s).count() numBs = logData.filter(lambda s: 'b' in s).count()
print("Lines with a: %i, lines with b: %i" % (numAs, numBs)) sc.stop()
Execute using:
/bin/spark-submit MyExampleApp.py
Lines with a: 46, Lines with b: 23 56 More on the RDD API
So what can we do with RDDs?
Transformations Actions
map(func) reduce(func) filter(func) count() flatMap(func) first() mapPartitons(func) take(n) sample(withReplacement, fraction, seed) takeSample(withReplacement, n) union(otherRDD) saveAsTextFile(path) intersection(otherRDD) countByKey() distinct() foreach(func) groupByKey() reduceByKey(func) sortByKey() join(otherRDD)
57 Examples https://github.com/wdm0006/DummyRDD
A test class that walks like and RDD, talks like an RDD but is actually just a list No real Spark behind it Nice for testing and learning, however
from dummy_spark import SparkContext, SparkConf
sconf = SparkConf() sc = SparkContext(master='', conf=sconf)
# Make an RDD from a Python list: a collection of numbers rdd = sc.parallelize([1, 2, 3, 4, 5])
print(rdd.count()) print(rdd.map(lambda x: x**2).collect())
58 Examples: word count
from dummy_spark import SparkContext, SparkConf
sconf = SparkConf() sc = SparkContext(master='', conf=sconf)
# Make an RDD from a text file: collection of lines text_file = sc.textFile("kuleuven.txt")
counts = text_file.flatMap(lambda line: line.split(" ")) \ .map(lambda word: (word, 1)) \ .reduceByKey(lambda a, b: a + b)
print(counts)
59 Examples: filtering
from dummy_spark import SparkContext, SparkConf
sconf = SparkConf() sc = SparkContext(master='', conf=sconf)
rdd = sc.parallelize(list(range(1, 21)))
print(rdd.filter(lambda x: x % 3 == 0).collect())
60 SparkSQL, DataFrames and Datasets
61 These RDDs still “feel” a lot like MapReduce…
Indeed, many operations are familiar: map , reduce , reduceByKey , …
But remember: the actual execution is more optimized
However, from the perspective of the user, this is still very low-level
Nice if you want low-level control to perform transformation and actions on your dataset Or when your data is unstructured, such as streams of text Or you actually want to manipulate your data with functional programming constructs Or you don’t care about imposing a schema, such as columnar format
But what if you do want to work with tabular structured data… like a data frame?
62 SparkSQL
Like Apache Spark in general, SparkSQL is all about distributed in-memory computations
SparkSQL builds on top of Spark Core with functionality to load and query structured data using queries that can be expressed using SQL, HiveQL, or through high-level API’s similar to e.g. Pandas (called the “DataFrame” and “Dataset” API’s in Spark) At the core of SparkSQL is the Catalyst query optimizer
Since Spark 2.0, Spark SQL is the primary and feature-rich interface to Spark’s underlying in- memory distributed platform (hiding Spark Core’s RDDs behind higher-level abstractions)
63 SparkSQL
# Note the difference: SparkSession instead of SparkContext from pyspark.sql import SparkSession
spark = SparkSession.builder\ .appName("Python Spark SQL example")\ .getOrCreate()
# A Spark "DataFrame" df = spark.read.json("people.json")
df.show() # | age| name| # +----+------+ # |null|Michael| # | 30| Andy| # | 19| Justin| # +----+------+
df.printSchema() # root # |-- age: long (nullable = true) # |-- name: string (nullable = true)
64 SparkSQL
df.select("name").show() df.filter(df['age'] > 21).show() # +------+ # +---+----+ # | name| # |age|name| # +------+ # +---+----+ # |Michael| # | 30|Andy| # | Andy| # +---+----+ # | Justin| # +------+ df.groupBy("age").count().show() # +----+-----+ df.select(df['name'], df['age'] + 1)\ # | age|count| .show() # +----+-----+ # +------+------+ # | 19| 1| # | name|(age + 1)| # |null| 1| # +------+------+ # | 30| 1| # |Michael| null| # +----+-----+ # | Andy| 31| # | Justin| 20| # +------+------+
65 SparkSQL
# Register the DataFrame as a SQL temporary view
df.createOrReplaceTempView("people") sqlDF = spark.sql("SELECT * FROM people") sqlDF.show() # +----+------+ # | age| name| # +----+------+ # |null|Michael| # | 30| Andy| # | 19| Justin| # +----+------+
66 DataFrames
Like an RDD, a DataFrame is an immutable distributed collection of data elements
Extends the “free-form” elements by imposing that every element is organized as a set of value into named columns, e.g. (age=30, name=Seppe) Imposes some additional structure on top of RDDs
Designed to make large data sets processing easier
This allows for an easier and higher-level abstraction Provides a domain specific language API to manipulate your distributed data (see examples above) Makes Spark accessible to a wider audience Finally, much more in line to what data scientists are actually used to
67 DataFrames
pyspark.sql.SparkSession : Main entry point for DataFrame and SQL functionality
pyspark.sql.DataFrame : A distributed collection of data grouped into named columns
pyspark.sql.Row : A row of data in a DataFrame pyspark.sql.Column : A column expression in a DataFrame
pyspark.sql.GroupedData : Aggregation methods, returned by DataFrame.groupBy()
pyspark.sql.DataFrameNaFunctions : Methods for handling missing data (null values)
pyspark.sql.DataFrameStatFunctions : Methods for statistics functionality
pyspark.sql.functions : List of built-in functions available for DataFrame pyspark.sql.types : List of data types available
pyspark.sql.Window : For working with window functions
68 DataFrames
Class pyspark.sql.DataFrame : A distributed collection of data grouped into named columns:
agg(*exprs) : Aggregate on the entire DataFrame without groups
columns : Returns all column names as a list
corr(col1, col2, method=None) : Calculates the correlation of two columns
count() : Returns the number of rows in this DataFrame
cov(col1, col2) : Calculate the sample covariance for the given columns crossJoin(other) : Returns the cartesian product with another DataFrame
crosstab(col1, col2) : Computes a pair-wise frequency table of the given columns
describe(*cols) : Computes statistics for numeric and string columns
distinct() : Returns a new DataFrame containing the distinct rows in this DataFrame. drop(*cols) : Returns a new DataFrame that drops the specified column
dropDuplicates(subset=None) : Return a new DataFrame with duplicate rows removed
dropna(how='any', thresh=None, subset=None) : Returns new DataFrame omitting rows with null values
fillna(value, subset=None) : Replace null values 69 DataFrames
Class pyspark.sql.DataFrame : A distributed collection of data grouped into named columns:
filter(condition) : Filters rows using the given condition; where() is an alias for filter() first() : Returns the first row as a Row foreach(f) : Applies the f function to all rows of this DataFrame groupBy(*cols) : Groups the DataFrame using the specified columns head(n=None) : Returns the first n rows intersect(other) : Return a intersection with other DataFrame join(other, on=None, how=None) : Joins with another DataFrame, using the given join expression orderBy(*cols, **kwargs) : Returns a new DataFrame sorted by the specified column(s) printSchema() : Prints out the schema in the tree format randomSplit(weights, seed=None) : Randomly splits this DataFrame with the provided weights replace(to_replace, value, subset=None) : Returns DataFrame replacing a value with another value select(*cols) : Projects a set of expressions and returns a new DataFrame toPandas() : Returns the contents of this DataFrame as a Pandas data frame union(other) : Return a new DataFrame containing union of rows in this frame and another frame
70 DataFrames
Can be loaded in from:
Parquet files Hive tables JSON files CSV files (as of Spark 2) JDBC (to connect with a database) AVRO files (using “spark-avro” library or built-in in Spark 2.4) Normal RDDs (given that you specify or infer a “schema”)
Can also be converted back to a standard RDD
71 SparkR
Implementation of the Spark DataFrame API for R
An R package that provides a light-weight frontend to use Apache Spark from R Way of working very similar to dplyr Can convert R data frames to SparkDataFrame objects
df <- as.DataFrame(faithful) groupBy(df, df$waiting) %>% summarize(count = n(df$waiting)) %>% head(3)
## waiting count ##1 70 4 ##2 67 1 ##3 69 2
72 Datasets
Spark Datasets is an extension of the DataFrame API that provides a type-safe, object-oriented programming interface
Like DataFrames, Datasets take advantage of Spark’s optimizer by exposing expressions and data fields to a query planner Datasets extend these benefits with compile-time type safety: meaning production applications can be checked for errors before they are run
A Dataset is a strongly-typed, immutable collection of objects that are mapped to a relational schema
At the core of the Dataset API is a new concept called an encoder, which is responsible for converting between JVM objects and tabular representation
Core idea: where a DataFrame represented a collection of Rows (which a number of named Columns), a Dataset represents a collection of typed objects (with their according typed fields) which can be converted from and to table rows
73 Datasets
Since Spark 2.0, the DataFrame APIs has merged with the Datasets APIs, unifying data processing capabilities across libraries
Because of this unification, developers now have fewer concepts to learn or remember, and work with a single high-level and type-safe API called Dataset However, DataFrame as a name is still used: a DataFrame is a Dataset[Row], so a collection of generic Row objects
74 Datasets
Starting in Spark 2.0, Dataset takes on two distinct APIs characteristics: a strongly-typed API and an untyped API
Dataset represents a collection of strongly-typed JVM objects, dictated by a case class you define in Scala or a class in Java Consider DataFrame as an alias for Dataset[Row], where a Row represents a generic untyped JVM object Since Python and R have no compile-time type-safety, there’s only the untyped API, namely DataFrames
Language Main Abstraction Scala Dataset[T] & DataFrame (= Dataset[Row]) Java Dataset[T] Python DataFrame (= Dataset[Row]) R DataFrame (= Dataset[Row])
75 Datasets
Benefits:
Static typing and runtime-type safety: both syntax and analysis errors can now be caught during compilation of our program High-level abstraction and custom view into structured and semi-structured data Ease-of-use of APIs with structure Performance and optimization
For us R and Python users, we can continue using DataFrames knowing that they are built on Dataset[Row]
Most common use case anyways
(A more deteailed example will be posted in background information for those interested)
76 MLlib
77 MLlib
MLlib is Spark’s machine learning (ML) library
Its goal is to make practical machine learning scalable and easy Think of it as a “scikit-learn”-on-Spark
Provides:
ML Algorithms: common learning algorithms such as classification, regression, clustering, and collaborative filtering Featurization: feature extraction, transformation, dimensionality reduction, and selection Pipelines: tools for constructing, evaluating, and tuning ML Pipelines Persistence: saving and load algorithms, models, and Pipelines Utilities: linear algebra, statistics, data handling, etc.
78 MLlib
As of Spark 2.0, the primary Machine Learning API for Spark is the DataFrame-based API in the spark.ml package
Before: spark.mllib was RDD-based Not a very helpful way of working MLlib still supported the RDD-based API in spark.mllib
Since Spark 2.3, MLlib’s DataFrames-based API have reached feature parity with the RDD-based API and in Spark 3, the RDD-based API is in “maintenance mode”
79 MLlib
Gradient-boosted tree regression Classification Survival regression Logistic regression Isotonic regression Decision tree classifier Random forest classifier Clustering Gradient-boosted tree classifier K-means Multilayer perceptron classifier Latent Dirichlet allocation (LDA) One-vs-Rest classifier (One-vs-All) Bisecting k-means Naive Bayes Gaussian Mixture Model (GMM)
Regression Recommender systems
Linear regression Collaborative filtering Generalized linear regression Decision tree regression Validation routines Random forest regression 80 MLlib example
from pyspark.ml.classification import LogisticRegression
training = spark.read.format("libsvm").load("data.txt") lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8) lrModel = lr.fit(training) print("Coefs: " + str(lrModel.coefficients)) print("Intercept: " + str(lrModel.intercept))
from pyspark.ml.clustering import KMeans
dataset = spark.read.format("libsvm").load("data/data.txt")
kmeans = KMeans().setK(2).setSeed(1) model = kmeans.fit(dataset) predictions = model.transform(dataset)
centers = model.clusterCenters() print("Cluster Centers: ") for center in centers: print(center)
81 Conclusions so Far
82 Spark versus…
Spark: a high-performance in-memory data-processing framework
Has been widely adopted and still one of the main computing platforms today
Versus:
MapReduce (a mature batch-processing platform for the petabyte scale): Spark is faster, better suited in an online, analytics setting, implements data frame and ML concepts and algorithms Apache Tez: “aimed at building an application framework which allows for a complex directed-acyclic-graph of tasks for processing data” Cloudera was rooting for Spark, Hortonworks for Tez (a few years ago…) Today: Tez is out! (Hortonworks had to also adopt Spark, and merged with Cloudera) Apache Mahout: “the goal is to build an environment for quickly creating scalable performant machine learning applications” A simple and extensible programming environment and framework for building scalable algorithms A wide variety of premade algorithms for Apache Spark, H2O, Apache Flink Before: also “MapReduce all things” approach Kind of an extension to Spark Though most of the algorithms also in MLlib… so not that widely used any more! 83 Spark versus…
One contender that is doing well is H2O (http://www.h2o.ai/)
H2O is an open source, in-memory, distributed, fast, and scalable machine learning and “ “ predictive analytics platform
Core is written in Java. Inside H2O, a Distributed Key/Value store is used to access and reference data, models, objects, etc., across all nodes and machines The algorithms are implemented on top of H2O’s distributed Map/Reduce framework and utilize the Java Fork/Join framework for multi-threading. - The data is read in parallel and is distributed across the cluster and stored in memory in a columnar format in a compressed way H2O’s REST API allows access to all the capabilities of H2O from an external program or script via JSON over HTTP
They also had the idea of coming up with a better “MapReduce” engine
Based on a distributed Key-value store In-memory map/reduce Can work on top of Hadoop (YARN) or standalone Though not as efficient as Spark’s engine 84 H2O
85 H2O
However, H2O was quick to realize the benefits of Spark, and the role they could play: “customers want to use Spark SQL to make a query, feed the results into H2O Deep Learning to build a model, make their predictions, and then use the results again in Spark”
“Sparkling Water”
86 H2O
Web based UI: Flow
Strong support for algorithms
Productionization in mind!
E.g. documentation describes “What happens when you try to predict on a categorical level not seen during training?” and “How does the algorithm handle missing values during testing?”
“A better MLlib”
We see a lot of companies embracing H2O on Spark as the “next extension”
87 H2O
library(h2o) h2o.init(nthreads=-1, max_mem_size = "2G") h2o.removeAll()
df <- h2o.importFile(path = normalizePath("./covtype.full.csv")) splits <- h2o.splitFrame(df, c(0.6,0.2))
train <- h2o.assign(splits[[1]], "train.hex") valid <- h2o.assign(splits[[2]], "valid.hex") test <- h2o.assign(splits[[3]], "test.hex")
rf1 <- h2o.randomForest( training_frame = train, validation_frame = valid, x=1:12, y=13, model_id = "rf_covType_v1", ntrees = 200, stopping_rounds = 2, score_each_iteration = T)
summary(rf1) rf1@model$validation_metrics h2o.hit_ratio_table(rf1,valid = T)[1,2]
h2o.shutdown(prompt=FALSE)
88 Summary so far
Hadoop letdown: MapReduce not that user-friendly
Not convenient for analytics purposes More high-volume bath operations, ETL What about SQL? What about analytics?
Spark re-uses components of Hadoop
HDFS (or Hbase, Hive, flat files, Cassandra, …) YARN (or Mesos, or stand-alone)
Spark docs:
https://spark.apache.org/docs/latest/index.html
89 Summary so far
Spark has a Directed Acyclic Graph (DAG) execution engine
Supporting cyclic data flow and in-memory computing
DAG approach adopted by many other projects
E.g. Dask, https://dask.pydata.org/en/latest/, Airflow, …
90 Summary so far
DAG engine powers core concept of Resilient Distributed Datasets (RDDs)
Fault-tolerant and efficient Two types of operations: transformations and actions Represents an unstructured collection of data (e.g. lines of tekst, images, vectors, …)
Programs are written interactively (Spark shell)
Or packaged as an application (like with a MapReduce app) Bindings to Scala, Java, Python, R, and support for SQL
RDDs are still a bit hard to learn and use
Low-level control to perform transformation and actions on your dataset Data not necessarily structured, columnar SparkSQL is the engine to work with semi-structured data and exposes DataFrame and Dataset APIs Underlyingly uses RDDs, but more suited towards structured analyses 91 Summary so far
MLlib is Spark’s machine learning (ML) library
Before: spark.mllib was RDD-based Not a very helpful way of working MLlib will still support the RDD-based API in spark.mllib MLlib will not add new features to the RDD-based API Newer features: spark.ml Classification, regression, clustering, recommender systems
92 Streaming Analytics
93 Streaming analytics?
Not only serve predictions using static models, but also as tightly-integrated components of feedback loops involving dynamic, real-time decision making
Next to Spark, other exciting frameworks have emerged and gain momentum
Kafka, Samza, Flink, Ignite, Kudu, Splunk
Lots of attention today towards streaming / realtime analytics
Although many projects still in beginning stage And “streaming analytics” might not be that hard
94 Streaming analytics?
With regards to the why, examples include:
Advertising: from mass branding to 1-1 targeting Fintech: general advice to personalized robo-advisors Healthcare: mass treatment to designer medicine Retail: static branding to real-time personalization Manufacturing: break-then-fix to repair-before-break
Note: many of these have more to do with “personalization”
Take care of slogans such as “self-learning models”
95 Streaming analytics
We need to differentiate between different “aspects” of “streaming”
Streaming data: my data source emits events, or a stream of instances, or a stream of …
How can I access to historical events? How do I store this? How do I convert this to a data set? How do I send such streams across my data center, clients, applications? (Plumbing?)
Streaming training: my algorithm needs to be trained on a continuous stream
“On-line” algorithms, very hard problem for many techniques Requires ad-hoc changes and modification of each algorithm
Streaming prediction: my model needs to predict on a stream of instances, events, …
Matter of deployment, operationalizing the model 96 Streaming analytics
We need to differentiate between different “aspects” of “streaming”
Streaming data: my data source emits events, or a stream of instances, or a stream of …
If we can do this, do we even need streaming training? Is this really a hard setting?
Streaming training: my algorithm needs to be trained on a continuous stream
See above
Streaming prediction: my model needs to predict on a stream of instances, events, …
Is this really an issue in your setting?
97 Think about your use case
Netflix: 9 million events per second at peak LinkedIn: 500 billion events per day, ~24 GB per second during peak hours Bombardier showcased its C Series jetliner that carries Pratt & Whitney’s Geared Turbo Fan (GTF) engine, which is fitted with 5,000 sensors that generate up to 10 GB of data per second. A single twin-engine aircraft with an average 12- hr. flight-time can produce up to 844 TB of data WhatsApp, Uber, …?
How do you compare?
98 Streaming data, streaming engine
Data can be bounded (finite) or unbounded
Execution engine is streaming or batch
Combinations of both possible!
You can pretend a finite data set comes in as a stream And you can handle an infinite data set in batches
For finite data-sets, accurate processing is relatively simple
On failure, schedule a new reprocessing Use checkpoints to make it more efficient Effectively what Spark does (the resilient in RDD)
For infinite data sets, this is harder
Need to take time into account Time of event creation, ingestion, processed
99 We also need to rethink aggregations
100 We also need to rethink aggregations
101 The reality
What about “on-line” algorithms?
Do we need them? In many cases: perhaps not Also hard to find “on-line” implementations Streaming linear regression, online k-means clustering, incremental matrix factorization
E.g. Netflix: 9 million events per second
Real-time (re)training of recommender matrix, but main training still offline
In most cases:
You might need to deal with streaming data: how to store it, access history? We want to be able to perform online predictions on this data Training can be done offline Model can be deployed in a streaming setup Depending on your needs: re-train every month, week, day, … but depends more on how fast the feature space changes, not really on how fast the data comes in
102 Streaming overview
https://www.slideshare.net/sbaltagi/apache-flink-realworld-use-cases-for-streaming-analytics
103 Streaming analytics
(We’re on a quest for something to help with analytics in a streaming setting)
Event collectors gather, collect, centralize events
Examples of event collectors include:
Apache Flume: one of the oldest Apache projects designed to collect, aggregate, and move large data sets such as web server logs to a centralized location Apache NiFi: a relatively new project. It is based on Enterprise Integration Patterns (EIP) where the data flows through multiple stages and transformations before reaching the destination
Not that helpful…
104 Event brokers
Message/event/data brokers handle message validation, transformation and routing
Message oriented middleware (“MOM”) Mediates communication amongst applications, minimizing the mutual awareness that applications should have of each other in order to be able to exchange messages, effectively implementing decoupling Message routing (one or more destinations), message transformation, simple message aggregation In a way which is resilient, fail-safe, scalable A lot of the streaming data “plumbing” thus handled by message/event/data brokers
Examples include:
Apache ActiveMQ, Apache Kafka, Celery, RabbitMQ, Redis, ZeroMQ Especially Kafka is a popular choice: can be easily integrated with Spark But in itself: not really much analytics E.g. in-between Uber’s mobile app and data lakes
Not that helpful… 105 Event processors
Here we find the actually intelligence
Spark Streaming and Spark Structured Streaming as two very popular options
Spark Streaming enables developers to build streaming applications through Sparks’ high-level API
Since it runs on Spark, Spark Streaming lets developers reuse the same code for batch processing, join streams against historical data, or run ad-hoc queries on stream state Spark Streaming operates in micro-batching mode where the batch size is much smaller to conventional batch processing Can be put on top of Kafka acting as the message broker: common approach these days
106 Spark Streaming
Spark Streaming provides a high-level abstraction called discretized stream or DStream, which represents a continuous stream of data
DStreams can be created either from input data streams from sources such as Kafka, Flume, and Kinesis, or by applying high-level operations on other DStreams Internally, a DStream is represented as a sequence of RDDs This helps with regards to easy of use: many of the same actions and operations can be applied 107 Spark Streaming
108 Spark Streaming pyspark.streaming.StreamingContext is the main entry point for all streaming functionality
from pyspark import SparkContext from pyspark.streaming import StreamingContext sc = SparkContext("local[2]", "NetworkWordCount") ssc = StreamingContext(sc, 1)
# Create a DStream that will connect to localhost:9999 as a source lines = ssc.socketTextStream("localhost", 9999)
# Split each line into words words = lines.flatMap(lambda line: line.split(" "))
# Count each word in each batch pairs = words.map(lambda word: (word, 1)) wordCounts = pairs.reduceByKey(lambda x, y: x + y) wordCounts.pprint() # Print out first ten elements # of each RDD generated in this Dstream ssc.start() # Start the computation ssc.awaitTermination() # Wait for the computation to terminate
109 Spark Streaming
A Discretized Stream or DStream is the basic abstraction provided by Spark Streaming
It represents a continuous stream of data, either the input data stream received from source, or the processed data stream generated by transforming the input stream Internally, a DStream is represented by a continuous series of RDDs Each RDD in a DStream contains data from a certain interval Similar to that of RDDs, transformations allow the data from the input DStream to be modified DStreams support many of the transformations available on normal Spark RDD’s The transform operation (along with its variations like transformWith) allows arbitrary RDD-to-RDD functions to be applied on a DStream Spark Streaming also provides windowed computations, which allow you to apply transformations over a sliding window of data
110 Spark Streaming
Every time the window slides over a source DStream, the source RDDs that fall within the window are combined and operated upon to produce the RDDs of the windowed DStream
window length: the duration of the window (3 below) sliding interval: the interval at which the window operation is performed (2 below)
windowedWordCounts = pairs.reduceByKeyAndWindow(lambda x, y: x + y, 30, 10)
111 Spark Streaming
You can perform different kinds of joins on streams in Spark Streaming
windowedStream1 = stream1.window(20) windowedStream2 = stream2.window(60) joinedStream = windowedStream1.join(windowedStream2)
dataset = ... # some RDD windowedStream = stream.window(20) joinedStream = windowedStream.transform(lambda rdd: rdd.join(dataset))
112 Spark Streaming
Output operations allow DStream’s data to be pushed out to external systems like a database or a file systems
Triggers the actual execution of all the DStream transformations print() prints the first ten elements of every batch of data in a DStream on the driver node running the streaming application, useful for development and debugging (this is called pprint() in the Python API)
saveAsTextFiles(prefix, [suffix]) saves this DStream’s contents as text files. The file name at each batch interval is generated based on prefix and suffix: “prefix-TIME_IN_MS[.suffix]” saveAsHadoopFiles(prefix, [suffix]) saves this DStream’s contents as Hadoop files. The file name at each batch interval is generated based on prefix and suffix: “prefix-TIME_IN_MS[.suffix]”. foreachRDD(func) is the most generic output operator that applies a function, func, to each RDD generated from the stream
113 Spark Streaming
Note that Spark Streaming heavily utilizes RDDs
Not very pleasant
But: you can quite easily use DataFrames and SQL operations on streaming data
Convert the RDD to a DataFrame You have to create a SparkSession using the SparkContext that the StreamingContext is using Furthermore this has to done such that it can be restarted on driver failures. This is done by creating a lazily instantiated singleton instance of SparkSession
114 Spark Streaming and MLlib
You can also easily use machine learning algorithms provided by MLlib
First of all, there are streaming (on-line) machine learning algorithms which can simultaneously learn from the streaming data as well as apply the model on the streaming data Beyond these, for a much larger class of machine learning algorithms, you can learn a learning model offline (i.e. using historical data) and then apply the model online on streaming data (more common case)
115 Spark Structured Streaming
Spark Structured Streaming is a scalable and fault-tolerant stream processing engine built on the Spark SQL engine
The key idea in Structured Streaming is to treat a live data stream as a table that is being continuously appended. This leads to a new stream processing model that is very similar to a batch processing model. You will express your streaming computation as standard batch-like query as on a static table, and Spark runs it as an incremental query on the unbounded input table You can express your streaming computation the same way you would express a batch computation on static data The Spark SQL engine will take care of running it incrementally and continuously and updating the final result as streaming data continues to arrive
You can use the Dataset/DataFrame API in Scala, Java or Python to express streaming aggregations, event-time windows, stream-to-batch joins, etc
The computation is executed on the same optimized Spark SQL engine Finally, the system ensures end-to-end exactly-once fault-tolerance guarantees through checkpointing and Write Ahead Logs Structured Streaming provides fast, scalable, fault-tolerant, end-to-end exactly-once stream processing without the user having to reason about streaming
Structured Streaming is alpha in Spark 2.1 but stable in 2.2 https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
116 Spark Structured Streaming
117 Spark Structured Streaming
118 Spark Structured Streaming
Note that Structured Streaming does not materialize the entire table
It reads the latest available data from the streaming data source, processes it incrementally to update the result, and then discards the source data It only keeps around the minimal intermediate state data as required to update the result
119 Spark Structured Streaming
Aggregations over a sliding event-time window are straightforward with Structured Streaming and are very similar to grouped aggregations
In a grouped aggregation, aggregate values (e.g. counts) are maintained for each unique value in the user-specified grouping column In case of window-based aggregations, aggregate values are maintained for each window the event-time of a row falls into
120 Spark Structured Streaming
Structured Streaming can maintain the intermediate state for partial aggregates for a long period of time such that late data can update aggregates of old windows correctly
121 Spark Structured Streaming
from pyspark.sql import SparkSession from pyspark.sql.functions import explode from pyspark.sql.functions import split
spark = SparkSession.builder.appName("StructuredNetworkWordCount").getOrCreate()
# This will be a streaming data frame lines = spark.readStream.format("socket")\ .option("host", "localhost").option("port", 9999).load()
# Split the lines into words and generate word count words = lines.select( explode( split(lines.value, " ") ).alias("word") ) wordCounts = words.groupBy("word").count()
# Start running the query that prints the running counts to the console query = wordCounts.writeStream.outputMode("complete").format("console") query.start() query.awaitTermination()
122 Other event processors
Apache Storm: was originally developed by Nathan Marz at BackType, a company that was acquired by Twitter. After the acquisition, Twitter open sourced Storm before donating it to Apache Used Flipboard, Yahoo!, and Twitter, it was a long-time standard for developing distributed, real-time, data processing platforms Storm is often referred as “the Hadoop for real-time processing”. It is primarily designed for scalability and fault-tolerance For new projects: not that much in use any more (mainly replaced by Kafka + Spark setups) Apache Samza: tightly coupled to YARN. Used at LinkedIn and some other places, not much more Apache Apex: also tightly coupled to YARN. Not much used any more… Apache Flink: a streaming dataflow engine that provides data distribution, communication, and fault tolerance for distributed computations over data streams Gaining traction, for use cases where mini-batching is not viable E.g. used by Lyft, Huawei, Tencent, https://flink.apache.org/poweredby.html Apache Ignite: a high-performance, integrated and distributed in-memory platform for computing and transacting on large-scale data sets in real-time For very heavy duty applications Apache Beam: based on a unified model for defining and executing data-parallel processing pipelines
123 Event indexers, search, visualization
Tools like Splunk, Kibana
Good commercial offerings available More concerned with dashboarding on real-time data, not so much learning
124 Conclusions
125 The Apache family
Most of the “big data tooling” are Apache Foundation projects
Open source Java Enterprisy Many of these depend on others… Some new but unfinished, some old and useless? Who can make sense of this all? Some relatively new
Future to be seen in terms of adoption and maturity
Hadoop → Spark → Spark + H2O → Spark + Kafka + H2O → Flink → …
126 Key analytics patterns
Key patterns and insights:
1. The hype of big data is over (luckily) Distributed storage and querying (the data plumbing): take the best of breed Most likely a cloud-based solution (Snowflake, Redshift, BigQuery)! 2. For analytics, multiple patterns emerge Development environment: heavily notebook driven in all cases Pure Python / notebook / virtual environment driven, either hosted or not (Google, Amazon) Spark (+ H2O) (+ Kafka)… other “big” Apache projects losing steam Or: Kubeflow and other DAG-based approaches (e.g. Dask, Ray, Airflow) + containerization technologies (e.g. Docker) for scalability and reproducibility In some cases: pure TensorFlow/Pytorch-based (deep learning only) Choice between hosted and do-it yourself for all of the above
Important: perform a thorough assessment before committing
No solutions before problems! 127 Focus on what matters
https://landscape.cncf.io/category=streaming-messaging&format=card-mode&grouping=category
“ All of these distributed streaming architectures are massively overused. They certainly have their place, but you probably don’t need them. I see it all the time with ML work. You wind up using a cluster to overcome the memory inefficiency of Spark, when you could have just used a single machine.
This has been my experience, too. I worked at just one place that had a really good handle on high-volume, high-velocity streaming data, and they didn’t use Flink or Storm or Kafka or anything like that. They mostly just used the KISS principle and a protobuf-style wire format. There is definitely a point where these sorts of scale-out-centric solutions are unavoidable. Short of that point, though, they’re probably best avoided. –
https://news.ycombinator.com/item?id=19016997 “
128 Focus on what matters
https://www.usenix.org/system/files/conference/hotos15/hotos15-paper-mcsherry.pdf
129 Focus on what matters
130 So what should you do?
Steal from the best…
https://eng.uber.com/scaling-michelangelo/ https://github.com/uber/manifold 131 So what should you do?
Steal from the best…
https://eng.uber.com/uber-big-data-platform/
132 So what should you do?
Steal from the best…
https://medium.com/netflix-techblog/notebook-innovation-591ee3221233
133 So what should you do?
Steal from the best…
https://labs.spotify.com/2020/02/27/how-we-improved-data-discovery-for-data-scientists-at-spotify/
134 So what should you do?
Steal from the best…
https://airflow.apache.org/ https://github.com/spotify/luigi 135 But most importantly
Start with the analytics: big data stacks bring the plumbing, but doesn’t mean you can make soup
Not every organization is engineering driven You probably do not have the capacity, or need, even, to keep up to date Be honest: do you have big data? What do you mean with “real time”, “streaming”? Make an informed decision The algorithms are the same everywhere anyway One use case with small data can lead to much more value than a data lake filled with stuff. Data is not gold People and process matters. Right governance, thinking about deployment, maintenance, monitoring, and most of all: the business question