“No one at uses MapReduce anymore”

Cloud Dataflow, the Dataflow model and parallel processing in 2016.

Martin Görner

#Dataflow @martin_gorner Before we start...

(Quick Google BigQuery demo)

#Dataflow @martin_gorner

The Dataflow Model (2015) Millwheel (2013) Flume Java (2010)

#Dataflow @martin_gorner Streaming pipeline

Example: hash tag auto-complete

Pipeline p = Pipeline.create() Tweets .apply(PubSub.Read...)) read #argentina scores, my #art project, watching.apply( #armeniaWindow vs #argentina.into()) ExtractTags #argentina.apply( #art #armeniaParDo #argentina.of(new ExtractTags()))

Count (argentina,.apply( 5M) (art,Count 9M) (armenia,.create()) 2M) a->(argentina,5M) ar->(argentina,5M) ExpandPrefixes arg->(.apply(argentina,5M)ParDo ar->(art,.of(new 9M) ... ExpandPrefixes()) a->[apple, art, argentina] Top(3) ar->[ar.apply(t, argentina,Top armenia].largestPerKey(3) write .apply(PubSub.Write...)); p.run() Predictions

#Dataflow @martin_gorner MapReduce ?

M M M

R R

#Dataflow @martin_gorner MapReduce ?

M M M

R R

#Dataflow @martin_gorner Associative example: count

M+R M+R M+R

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 SUM⇒ 3 3 SUM⇒ 3 4 SUM⇒ 2 3

3 3 2 3 4 3 R R SUM⇒ 8 SUM⇒ 9

#Dataflow @martin_gorner Dataflow programming model

PCollection

in: PCollection PaDo(DoFn) out: PCollection GroupByKey in: PCollection> //multimap out: PCollection>> //unimap Combine(CombineFn) in: PCollection>> //unimap out: PCollection> //unimap in: PCollection, PCollection, ... Flatten out: PCollection

#Dataflow @martin_gorner reinventing Count

in: PCollection Count out: PCollection> //unimap

ParDo: T => KV

GroupByKey: PCollection> // multimap => PCollection>> // unimap

Combine(Sum): => PCollection>

#Dataflow @martin_gorner reinventing JOIN

in: two multimaps with same key type PCollection>, PCollection> join out: a unimap PCollection,Collection>>>

ParDo: KV => KV> KV => KV>

Flatten: => PCollection>> //multimap

GroupByKey: => PCollection>>> //unimap ParDo: final type transform => PCollection, Collection>>>

#Dataflow @martin_gorner = ParallelDo Execution graph

example: extracting shopping session data checkout logs

IN1 A OUT1 formatted payment logs 1 checkout log IN2 B flatten join payment logs 2 D F OUT2 by session id IN3 C shopping session log: ➔ session id website logs ➔ cart contents count ➔ payment OK? IN4 E ➔ page views page views in session

#Dataflow @martin_gorner = ParallelDo graph optimiser: ParallelDo Fusion + = Combine

GBK = GroupByKey

consumer-producer sibling

C D C D

C+D C+D

#Dataflow @martin_gorner = ParallelDo graph optimiser: sink Flattens + = Combine

GBK = GroupByKey

B flatten D GBK C

#Dataflow @martin_gorner = ParallelDo graph optimiser: sink Flattens + = Combine

GBK = GroupByKey

B D flatten GBK C D

#Dataflow @martin_gorner = ParallelDo graph optimiser: sink Flattens + = Combine

GBK = GroupByKey

B D GBK C D

#Dataflow @martin_gorner = ParallelDo graph optimiser: identify MR-like blocks + = Combine

“Map Shuffle Combine Reduce” GBK = GroupByKey

M IN1 1 OUT1

R GBK + 1 OUT M 2 IN2 2

GBK + R 2 OUT3 M IN3 3 MSCR

#Dataflow @martin_gorner = ParallelDo graph optimiser + = Combine

GBK = GroupByKey

IN1 A OUT1

IN2 B flatten join D F OUT2 IN3 C

count IN4 E

#Dataflow @martin_gorner = ParallelDo graph optimiser + = Combine

GBK = GroupByKey

IN1 A OUT1

IN2 B flatten join D F OUT2 IN3 C

Cnt GBK IN4 + E

#Dataflow @martin_gorner = ParallelDo graph optimiser + = Combine

GBK = GroupByKey

IN1 A OUT1

J1 IN2 B flatten J flat. GBK J D 1 2 F OUT2 IN3 C J1

Cnt GBK IN4 + E

#Dataflow @martin_gorner = ParallelDo graph optimiser + = Combine

GBK = GroupByKey

IN1 A OUT1

J1 IN2 B flatten J flat. GBK J D 1 2 F OUT2 IN3 C J1

Cnt GBK IN4 + E

#Dataflow @martin_gorner = ParallelDo graph optimiser: sink Flattens + = Combine

GBK = GroupByKey

IN1 A OUT1

J1 IN2 B D flatten J flat. GBK J 1 2 F OUT2 IN3 C D J1

Cnt GBK IN4 + E

#Dataflow @martin_gorner = ParallelDo graph optimiser: sink Flattens + = Combine

GBK = GroupByKey

IN1 A OUT1

J J 1 IN2 B D 1 flatten flat. GBK J 2 F OUT2 J IN3 C D 1 J1

Cnt GBK IN4 + E

#Dataflow @martin_gorner = ParallelDo graph optimiser: sink Flattens + = Combine

GBK = GroupByKey

IN1 A OUT1

J J 1 IN2 B D 1 GBK J 2 F OUT2 J IN3 C D 1 J1

Cnt GBK IN4 + E

#Dataflow @martin_gorner = ParallelDo graph optimiser: ParallelDo fusion + = Combine

GBK = GroupByKey

A+J IN1 1 OUT1

J IN2 B D 1 GBK J 2 F OUT2 J IN3 C D 1 J1

Cnt GBK IN4 + E

#Dataflow @martin_gorner = ParallelDo graph optimiser: ParallelDo fusion + = Combine

GBK = GroupByKey

A+J IN1 1 OUT1

B+D+J IN2 1 GBK J 2 F OUT2 J IN3 C D 1 J1

Cnt GBK IN4 + E

#Dataflow @martin_gorner = ParallelDo graph optimiser: ParallelDo fusion + = Combine

GBK = GroupByKey

A+J IN1 1 OUT1

B+D+J IN2 1 GBK J 2 F OUT2 C+D+J IN3 1 J1

Cnt GBK IN4 + E

#Dataflow @martin_gorner = ParallelDo graph optimiser: ParallelDo fusion + = Combine

GBK = GroupByKey

A+J IN1 1 OUT1

B+D+J IN2 1 GBK J 2 F OUT2 C+D+J IN3 1

E+J1 Cnt GBK IN4 +

#Dataflow @martin_gorner = ParallelDo graph optimiser: ParallelDo fusion + = Combine

GBK = GroupByKey

A+J IN1 1 OUT1

B+D+J IN2 1 GBK J +F 2 OUT2 C+D+J IN3 1

E+J1 Cnt GBK IN4 +

#Dataflow @martin_gorner = ParallelDo graph optimiser + = Combine

GBK = GroupByKey

A+J IN1 1 OUT1

B+D+J IN2 1 GBK J +F 2 OUT2 C+D+J IN3 1

Cnt GBK E+J IN4 + 1

#Dataflow @martin_gorner = ParallelDo graph optimiser: MapReduce-like blocks + = Combine

GBK = GroupByKey

A+J IN1 1 OUT1

B+D+J IN2 1 GBK + J +F 2 OUT2 C+D+J IN3 1

Cnt GBK E+J IN4 + 1

#Dataflow @martin_gorner = ParallelDo graph optimiser: MapReduce-like blocks + = Combine

GBK = GroupByKey

A+J IN1 1 OUT1

B+D+J IN2 1 GBK + J +F 2 OUT2 C+D+J IN3 1 MSCR1 Cnt GBK E+J MSCR IN4 + 1 2

#Dataflow @martin_gorner graph optimiser

A+B+C+D+E+J IN 1 A+B+C+D+E+J 1 1 F+J 2 OUT A+B+C+D+E+J 1 1 F+J2

IN A+B+C+D+E+J1 F+J 2 2 A+B+C+D+E+J 1 OUT A+B+C+D+E+J 2 IN3 1

Cnt

+ Cnt IN + Cnt 4

#Dataflow @martin_gorner Dataflow model

What are you computing?

Where in event time?

When are results emitted?

Windowing How do refinements relate?

#Dataflow @martin_gorner Dataflow model: windowing

Windowing divides data into event-time-based finite chunks.

Fixed Sliding Sessions 1 1 2 3 4 2 3 1 3 4

Key 1 Key 2 Key 3

4 5 2

Time Often required when doing aggregations over unbounded data.

What Where When How

#Dataflow @martin_gorner Event-time windowing: out-of-order

Input

Processing Time 10:00 11:00 12:00 13:00 14:00 15:00

Output

Event Time 10:00 11:00 12:00 13:00 14:00 15:00

What Where When How

#Dataflow @martin_gorner Watermark

Watermark

Skew

The correct definition of “late data” Processing Time

Event Time

What Where When How

#Dataflow @martin_gorner What, When, Where, How ?

PCollection> scores p.apply(Window.into(FixedWindows.of(Duration.standardMinutes(2))

.triggering(AfterWatermark().pastEndOfWindow() default .withLateFirings(AtCount(1)) behaviour .withEarlyFirings(AtPeriod(Duration.standardMinutes(1)) .withAllowedLateness.(Duration.standardHours(1)) .accumulatingfFiredPanes()/.discardingFiredPanes() ) .apply(Sum.integersPerKey());

What Where When How

#Dataflow @martin_gorner The Dataflow model

ParDo GroupByKey Flatten Combine What are you computing? FixedWindows

SlidingWindows Where in event time? Sessions

triggering When are result emitted? AfterWatermark AfterProcessingTime

How do refinements relate? AfterPane.elementCount… withAllowedLateness accumulatingFiredPanes withEarlyFirings discardingFiredPanes withLateFirings

#Dataflow @martin_gorner (incubating)

Run it in the cloud Google Cloud Dataflow

Run it on premise

Apache Flink runner for Beam

Apache Beam: the “Dataflow model” Apache Spark runner Open-source lingua franca for unified for Beam batch and streaming data processing

#Dataflow @martin_gorner Demo time

1 week of data

3M taxi rides

One point every 2s on each ride

Accelerated 8x

20,000 events/s

Streamed from Google Pub/Sub

#Dataflow @martin_gorner Demo: NYC Taxis

Dataflow

PubSub PubSub (event queue) (event queue)

Visualisation (Javascript)

BigQuery (data warehousing and interactive analysis)

#Dataflow @martin_gorner Thank you !

cloud.google.com/dataflow

beam.incubator.apache.org

Martin Görner

Google Developer relations @martin_gorner

#Dataflow @martin_gorner