“No one at Google uses MapReduce anymore”
Cloud Dataflow, the Dataflow model and parallel processing in 2016.
Martin Görner
#Dataflow @martin_gorner Before we start...
(Quick Google BigQuery demo)
#Dataflow @martin_gorner
The Dataflow Model (2015) Millwheel (2013) Flume Java (2010)
#Dataflow @martin_gorner Streaming pipeline
Example: hash tag auto-complete
Pipeline p = Pipeline.create() Tweets .apply(PubSub.Read...)) read #argentina scores, my #art project, watching.apply( #armeniaWindow vs #argentina.into()) ExtractTags #argentina.apply( #art #armeniaParDo #argentina.of(new ExtractTags()))
Count (argentina,.apply( 5M) (art,Count 9M) (armenia,.create()) 2M) a->(argentina,5M) ar->(argentina,5M) ExpandPrefixes arg->(.apply(argentina,5M)ParDo ar->(art,.of(new 9M) ... ExpandPrefixes()) a->[apple, art, argentina] Top(3) ar->[ar.apply(t, argentina,Top armenia].largestPerKey(3) write .apply(PubSub.Write...)); p.run() Predictions
#Dataflow @martin_gorner MapReduce ?
M M M
R R
#Dataflow @martin_gorner MapReduce ?
M M M
R R
#Dataflow @martin_gorner Associative example: count
M+R M+R M+R
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 SUM⇒ 3 3 SUM⇒ 3 4 SUM⇒ 2 3
3 3 2 3 4 3 R R SUM⇒ 8 SUM⇒ 9
#Dataflow @martin_gorner Dataflow programming model
PCollection
in: PCollection GroupByKey in: PCollection
#Dataflow @martin_gorner reinventing Count
in: PCollection
ParDo: T => KV
GroupByKey: PCollection
Combine(Sum): => PCollection
#Dataflow @martin_gorner reinventing JOIN
in: two multimaps with same key type PCollection
ParDo: KV
Flatten: => PCollection
GroupByKey: => PCollection
#Dataflow @martin_gorner = ParallelDo Execution graph
example: extracting shopping session data checkout logs
IN1 A OUT1 formatted payment logs 1 checkout log IN2 B flatten join payment logs 2 D F OUT2 by session id IN3 C shopping session log: ➔ session id website logs ➔ cart contents count ➔ payment OK? IN4 E ➔ page views page views in session
#Dataflow @martin_gorner = ParallelDo graph optimiser: ParallelDo Fusion + = Combine
GBK = GroupByKey
consumer-producer sibling
C D C D
C+D C+D
#Dataflow @martin_gorner = ParallelDo graph optimiser: sink Flattens + = Combine
GBK = GroupByKey
B flatten D GBK C
#Dataflow @martin_gorner = ParallelDo graph optimiser: sink Flattens + = Combine
GBK = GroupByKey
B D flatten GBK C D
#Dataflow @martin_gorner = ParallelDo graph optimiser: sink Flattens + = Combine
GBK = GroupByKey
B D GBK C D
#Dataflow @martin_gorner = ParallelDo graph optimiser: identify MR-like blocks + = Combine
“Map Shuffle Combine Reduce” GBK = GroupByKey
M IN1 1 OUT1
R GBK + 1 OUT M 2 IN2 2
GBK + R 2 OUT3 M IN3 3 MSCR
#Dataflow @martin_gorner = ParallelDo graph optimiser + = Combine
GBK = GroupByKey
IN1 A OUT1
IN2 B flatten join D F OUT2 IN3 C
count IN4 E
#Dataflow @martin_gorner = ParallelDo graph optimiser + = Combine
GBK = GroupByKey
IN1 A OUT1
IN2 B flatten join D F OUT2 IN3 C
Cnt GBK IN4 + E
#Dataflow @martin_gorner = ParallelDo graph optimiser + = Combine
GBK = GroupByKey
IN1 A OUT1
J1 IN2 B flatten J flat. GBK J D 1 2 F OUT2 IN3 C J1
Cnt GBK IN4 + E
#Dataflow @martin_gorner = ParallelDo graph optimiser + = Combine
GBK = GroupByKey
IN1 A OUT1
J1 IN2 B flatten J flat. GBK J D 1 2 F OUT2 IN3 C J1
Cnt GBK IN4 + E
#Dataflow @martin_gorner = ParallelDo graph optimiser: sink Flattens + = Combine
GBK = GroupByKey
IN1 A OUT1
J1 IN2 B D flatten J flat. GBK J 1 2 F OUT2 IN3 C D J1
Cnt GBK IN4 + E
#Dataflow @martin_gorner = ParallelDo graph optimiser: sink Flattens + = Combine
GBK = GroupByKey
IN1 A OUT1
J J 1 IN2 B D 1 flatten flat. GBK J 2 F OUT2 J IN3 C D 1 J1
Cnt GBK IN4 + E
#Dataflow @martin_gorner = ParallelDo graph optimiser: sink Flattens + = Combine
GBK = GroupByKey
IN1 A OUT1
J J 1 IN2 B D 1 GBK J 2 F OUT2 J IN3 C D 1 J1
Cnt GBK IN4 + E
#Dataflow @martin_gorner = ParallelDo graph optimiser: ParallelDo fusion + = Combine
GBK = GroupByKey
A+J IN1 1 OUT1
J IN2 B D 1 GBK J 2 F OUT2 J IN3 C D 1 J1
Cnt GBK IN4 + E
#Dataflow @martin_gorner = ParallelDo graph optimiser: ParallelDo fusion + = Combine
GBK = GroupByKey
A+J IN1 1 OUT1
B+D+J IN2 1 GBK J 2 F OUT2 J IN3 C D 1 J1
Cnt GBK IN4 + E
#Dataflow @martin_gorner = ParallelDo graph optimiser: ParallelDo fusion + = Combine
GBK = GroupByKey
A+J IN1 1 OUT1
B+D+J IN2 1 GBK J 2 F OUT2 C+D+J IN3 1 J1
Cnt GBK IN4 + E
#Dataflow @martin_gorner = ParallelDo graph optimiser: ParallelDo fusion + = Combine
GBK = GroupByKey
A+J IN1 1 OUT1
B+D+J IN2 1 GBK J 2 F OUT2 C+D+J IN3 1
E+J1 Cnt GBK IN4 +
#Dataflow @martin_gorner = ParallelDo graph optimiser: ParallelDo fusion + = Combine
GBK = GroupByKey
A+J IN1 1 OUT1
B+D+J IN2 1 GBK J +F 2 OUT2 C+D+J IN3 1
E+J1 Cnt GBK IN4 +
#Dataflow @martin_gorner = ParallelDo graph optimiser + = Combine
GBK = GroupByKey
A+J IN1 1 OUT1
B+D+J IN2 1 GBK J +F 2 OUT2 C+D+J IN3 1
Cnt GBK E+J IN4 + 1
#Dataflow @martin_gorner = ParallelDo graph optimiser: MapReduce-like blocks + = Combine
GBK = GroupByKey
A+J IN1 1 OUT1
B+D+J IN2 1 GBK + J +F 2 OUT2 C+D+J IN3 1
Cnt GBK E+J IN4 + 1
#Dataflow @martin_gorner = ParallelDo graph optimiser: MapReduce-like blocks + = Combine
GBK = GroupByKey
A+J IN1 1 OUT1
B+D+J IN2 1 GBK + J +F 2 OUT2 C+D+J IN3 1 MSCR1 Cnt GBK E+J MSCR IN4 + 1 2
#Dataflow @martin_gorner graph optimiser
A+B+C+D+E+J IN 1 A+B+C+D+E+J 1 1 F+J 2 OUT A+B+C+D+E+J 1 1 F+J2
IN A+B+C+D+E+J1 F+J 2 2 A+B+C+D+E+J 1 OUT A+B+C+D+E+J 2 IN3 1
Cnt
+ Cnt IN + Cnt 4
#Dataflow @martin_gorner Dataflow model
What are you computing?
Where in event time?
When are results emitted?
Windowing How do refinements relate?
#Dataflow @martin_gorner Dataflow model: windowing
Windowing divides data into event-time-based finite chunks.
Fixed Sliding Sessions 1 1 2 3 4 2 3 1 3 4
Key 1 Key 2 Key 3
4 5 2
Time Often required when doing aggregations over unbounded data.
What Where When How
#Dataflow @martin_gorner Event-time windowing: out-of-order
Input
Processing Time 10:00 11:00 12:00 13:00 14:00 15:00
Output
Event Time 10:00 11:00 12:00 13:00 14:00 15:00
What Where When How
#Dataflow @martin_gorner Watermark
Watermark
Skew
The correct definition of “late data” Processing Time
Event Time
What Where When How
#Dataflow @martin_gorner What, When, Where, How ?
PCollection
.triggering(AfterWatermark().pastEndOfWindow() default .withLateFirings(AtCount(1)) behaviour .withEarlyFirings(AtPeriod(Duration.standardMinutes(1)) .withAllowedLateness.(Duration.standardHours(1)) .accumulatingfFiredPanes()/.discardingFiredPanes() ) .apply(Sum.integersPerKey());
What Where When How
#Dataflow @martin_gorner The Dataflow model
ParDo GroupByKey Flatten Combine What are you computing? FixedWindows
SlidingWindows Where in event time? Sessions
triggering When are result emitted? AfterWatermark AfterProcessingTime
How do refinements relate? AfterPane.elementCount… withAllowedLateness accumulatingFiredPanes withEarlyFirings discardingFiredPanes withLateFirings
#Dataflow @martin_gorner Apache Beam (incubating)
Run it in the cloud Google Cloud Dataflow
Run it on premise
Apache Flink runner for Beam
Apache Beam: the “Dataflow model” Apache Spark runner Open-source lingua franca for unified for Beam batch and streaming data processing
#Dataflow @martin_gorner Demo time
1 week of data
3M taxi rides
One point every 2s on each ride
Accelerated 8x
20,000 events/s
Streamed from Google Pub/Sub
#Dataflow @martin_gorner Demo: NYC Taxis
Dataflow
PubSub PubSub (event queue) (event queue)
Visualisation (Javascript)
BigQuery (data warehousing and interactive analysis)
#Dataflow @martin_gorner Thank you !
cloud.google.com/dataflow
beam.incubator.apache.org
Martin Görner
Google Developer relations @martin_gorner
#Dataflow @martin_gorner