“No One at Google Uses Mapreduce Anymore”
Total Page:16
File Type:pdf, Size:1020Kb
“No one at Google uses MapReduce anymore” Cloud Dataflow, the Dataflow model and parallel processing in 2016. Martin Görner #Dataflow @martin_gorner Before we start... (Quick Google BigQuery demo) #Dataflow @martin_gorner The Dataflow Model (2015) Millwheel (2013) Flume Java (2010) #Dataflow @martin_gorner Streaming pipeline Example: hash tag auto-complete Pipeline p = Pipeline.create() Tweets .apply(PubSub.Read...)) read #argentina scores, my #art project, watching.apply( #armeniaWindow vs #argentina.into()) ExtractTags #argentina.apply( #art #armeniaParDo #argentina.of(new ExtractTags())) Count (argentina,.apply( 5M) (art,Count 9M) (armenia,.create()) 2M) a->(argentina,5M) ar->(argentina,5M) ExpandPrefixes arg->(.apply(argentina,5M)ParDo ar->(art,.of(new 9M) ... ExpandPrefixes()) a->[apple, art, argentina] Top(3) ar->[ar.apply(t, argentina,Top armenia].largestPerKey(3) write .apply(PubSub.Write...)); p.run() Predictions #Dataflow @martin_gorner MapReduce ? M M M R R #Dataflow @martin_gorner MapReduce ? M M M R R #Dataflow @martin_gorner Associative example: count M+R M+R M+R 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 SUM⇒ 3 3 SUM⇒ 3 4 SUM⇒ 2 3 3 3 2 3 4 3 R R SUM⇒ 8 SUM⇒ 9 #Dataflow @martin_gorner Dataflow programming model PCollection<T> in: PCollection<T> PaDo(DoFn) out: PCollection<S> GroupByKey in: PCollection<KV<K,V>> //multimap out: PCollection<KV<K,Collection<V>>> //unimap Combine(CombineFn) in: PCollection<KV<K,Collection<V>>> //unimap out: PCollection<KV<K,V>> //unimap in: PCollection<T>, PCollection<T>, ... Flatten out: PCollection<T> #Dataflow @martin_gorner reinventing Count in: PCollection<T> Count out: PCollection<KV<T,Integer>> //unimap ParDo: T => KV<T, 1> GroupByKey: PCollection<KV<<T, 1>> // multimap => PCollection<KV<<T,Collection<1>>> // unimap Combine(Sum): => PCollection<KV<<T,Integer>> #Dataflow @martin_gorner reinventing JOIN in: two multimaps with same key type PCollection<KV<<K,V>>, PCollection<KV<<K,W>> join out: a unimap PCollection<KV<<K, Pair<Collection<V>,Collection<W>>>> ParDo: KV<K,V> => KV<K, TaggedUnion<V,W>> KV<K,W> => KV<K, TaggedUnion<V,W>> Flatten: => PCollection<KV<<K, TaggedUnion<V,W>>> //multimap GroupByKey: => PCollection<KV<K, Collection<TaggedUnion<V,W>>>> //unimap ParDo: final type transform => PCollection<KV<<K, Pair<Collection<V>, Collection<W>>>> #Dataflow @martin_gorner = ParallelDo Execution graph example: extracting shopping session data checkout logs IN1 A OUT1 formatted payment logs 1 checkout log IN2 B flatten join payment logs 2 D F OUT2 by session id IN3 C shopping session log: ➔ session id website logs ➔ cart contents count ➔ payment OK? IN4 E ➔ page views page views in session #Dataflow @martin_gorner = ParallelDo graph optimiser: ParallelDo Fusion + = Combine GBK = GroupByKey consumer-producer sibling C D C D C+D C+D #Dataflow @martin_gorner = ParallelDo graph optimiser: sink Flattens + = Combine GBK = GroupByKey B flatten D GBK C #Dataflow @martin_gorner = ParallelDo graph optimiser: sink Flattens + = Combine GBK = GroupByKey B D flatten GBK C D #Dataflow @martin_gorner = ParallelDo graph optimiser: sink Flattens + = Combine GBK = GroupByKey B D GBK C D #Dataflow @martin_gorner = ParallelDo graph optimiser: identify MR-like blocks + = Combine “Map Shuffle Combine Reduce” GBK = GroupByKey M IN1 1 OUT1 R GBK + 1 OUT M 2 IN2 2 GBK + R 2 OUT3 M IN3 3 MSCR #Dataflow @martin_gorner = ParallelDo graph optimiser + = Combine GBK = GroupByKey IN1 A OUT1 IN2 B flatten join D F OUT2 IN3 C count IN4 E #Dataflow @martin_gorner = ParallelDo graph optimiser + = Combine GBK = GroupByKey IN1 A OUT1 IN2 B flatten join D F OUT2 IN3 C Cnt GBK IN4 + E #Dataflow @martin_gorner = ParallelDo graph optimiser + = Combine GBK = GroupByKey IN1 A OUT1 J1 IN2 B flatten J flat. GBK J D 1 2 F OUT2 IN3 C J1 Cnt GBK IN4 + E #Dataflow @martin_gorner = ParallelDo graph optimiser + = Combine GBK = GroupByKey IN1 A OUT1 J1 IN2 B flatten J flat. GBK J D 1 2 F OUT2 IN3 C J1 Cnt GBK IN4 + E #Dataflow @martin_gorner = ParallelDo graph optimiser: sink Flattens + = Combine GBK = GroupByKey IN1 A OUT1 J1 IN2 B D flatten J flat. GBK J 1 2 F OUT2 IN3 C D J1 Cnt GBK IN4 + E #Dataflow @martin_gorner = ParallelDo graph optimiser: sink Flattens + = Combine GBK = GroupByKey IN1 A OUT1 J J 1 IN2 B D 1 flatten flat. GBK J 2 F OUT2 J IN3 C D 1 J1 Cnt GBK IN4 + E #Dataflow @martin_gorner = ParallelDo graph optimiser: sink Flattens + = Combine GBK = GroupByKey IN1 A OUT1 J J 1 IN2 B D 1 GBK J 2 F OUT2 J IN3 C D 1 J1 Cnt GBK IN4 + E #Dataflow @martin_gorner = ParallelDo graph optimiser: ParallelDo fusion + = Combine GBK = GroupByKey A+J IN1 1 OUT1 J IN2 B D 1 GBK J 2 F OUT2 J IN3 C D 1 J1 Cnt GBK IN4 + E #Dataflow @martin_gorner = ParallelDo graph optimiser: ParallelDo fusion + = Combine GBK = GroupByKey A+J IN1 1 OUT1 B+D+J IN2 1 GBK J 2 F OUT2 J IN3 C D 1 J1 Cnt GBK IN4 + E #Dataflow @martin_gorner = ParallelDo graph optimiser: ParallelDo fusion + = Combine GBK = GroupByKey A+J IN1 1 OUT1 B+D+J IN2 1 GBK J 2 F OUT2 C+D+J IN3 1 J1 Cnt GBK IN4 + E #Dataflow @martin_gorner = ParallelDo graph optimiser: ParallelDo fusion + = Combine GBK = GroupByKey A+J IN1 1 OUT1 B+D+J IN2 1 GBK J 2 F OUT2 C+D+J IN3 1 E+J1 Cnt GBK IN4 + #Dataflow @martin_gorner = ParallelDo graph optimiser: ParallelDo fusion + = Combine GBK = GroupByKey A+J IN1 1 OUT1 B+D+J IN2 1 GBK J +F 2 OUT2 C+D+J IN3 1 E+J1 Cnt GBK IN4 + #Dataflow @martin_gorner = ParallelDo graph optimiser + = Combine GBK = GroupByKey A+J IN1 1 OUT1 B+D+J IN2 1 GBK J +F 2 OUT2 C+D+J IN3 1 Cnt GBK E+J IN4 + 1 #Dataflow @martin_gorner = ParallelDo graph optimiser: MapReduce-like blocks + = Combine GBK = GroupByKey A+J IN1 1 OUT1 B+D+J IN2 1 GBK + J +F 2 OUT2 C+D+J IN3 1 Cnt GBK E+J IN4 + 1 #Dataflow @martin_gorner = ParallelDo graph optimiser: MapReduce-like blocks + = Combine GBK = GroupByKey A+J IN1 1 OUT1 B+D+J IN2 1 GBK + J +F 2 OUT2 C+D+J IN3 1 MSCR1 Cnt GBK E+J MSCR IN4 + 1 2 #Dataflow @martin_gorner graph optimiser A+B+C+D+E+J IN 1 A+B+C+D+E+J 1 1 F+J 2 OUT A+B+C+D+E+J 1 1 F+J2 IN A+B+C+D+E+J1 F+J 2 2 A+B+C+D+E+J 1 OUT A+B+C+D+E+J 2 IN3 1 Cnt + Cnt IN + Cnt 4 #Dataflow @martin_gorner Dataflow model What are you computing? Where in event time? When are results emitted? Windowing How do refinements relate? #Dataflow @martin_gorner Dataflow model: windowing Windowing divides data into event-time-based finite chunks. Fixed Sliding Sessions 1 1 2 3 4 2 3 1 3 4 Key 1 Key 2 Key 3 4 5 2 Time Often required when doing aggregations over unbounded data. What Where When How #Dataflow @martin_gorner Event-time windowing: out-of-order Input Processing Time 10:00 11:00 12:00 13:00 14:00 15:00 Output Event Time 10:00 11:00 12:00 13:00 14:00 15:00 What Where When How #Dataflow @martin_gorner Watermark Watermark Skew The correct definition of “late data” Processing Time Event Time What Where When How #Dataflow @martin_gorner What, When, Where, How ? PCollection<KV<String, Integer>> scores p.apply(Window.into(FixedWindows.of(Duration.standardMinutes(2)) .triggering(AfterWatermark().pastEndOfWindow() default .withLateFirings(AtCount(1)) behaviour .withEarlyFirings(AtPeriod(Duration.standardMinutes(1)) .withAllowedLateness.(Duration.standardHours(1)) .accumulatingfFiredPanes()/.discardingFiredPanes() ) .apply(Sum.integersPerKey()); What Where When How #Dataflow @martin_gorner The Dataflow model ParDo GroupByKey Flatten Combine What are you computing? FixedWindows SlidingWindows Where in event time? Sessions triggering When are result emitted? AfterWatermark AfterProcessingTime How do refinements relate? AfterPane.elementCount… withAllowedLateness accumulatingFiredPanes withEarlyFirings discardingFiredPanes withLateFirings #Dataflow @martin_gorner Apache Beam (incubating) Run it in the cloud Google Cloud Dataflow Run it on premise Apache Flink runner for Beam Apache Beam: the “Dataflow model” Apache Spark runner Open-source lingua franca for unified for Beam batch and streaming data processing #Dataflow @martin_gorner Demo time 1 week of data 3M taxi rides One point every 2s on each ride Accelerated 8x 20,000 events/s Streamed from Google Pub/Sub #Dataflow @martin_gorner Demo: NYC Taxis Dataflow PubSub PubSub (event queue) (event queue) Visualisation (Javascript) BigQuery (data warehousing and interactive analysis) #Dataflow @martin_gorner Thank you ! cloud.google.com/dataflow beam.incubator.apache.org Martin Görner Google Developer relations @martin_gorner #Dataflow @martin_gorner.