Portable Stateful Big Data Processing in Apache Beam

Portable stateful big data processing in Apache Beam Kenneth Knowles Apache Beam PMC Software Engineer @ Google https://s.apache.org/ffsf-2017-beam-state [email protected] / @KennKnowles Flink Forward San Francisco 2017 Agenda 1. What is Apache Beam? 2. State 3. Timers 4. Example & Little Demo What is Apache Beam? TL;DR (Flink draws it more like this) 4 DAGs, DAGs, DAGs Apache Beam Apache Flink Apache Cloud Hadoop Apache Apache Dataflow Spark Samza MapReduce Apache Apache Apache (paper) Storm Gearpump Apex (incubating) FlumeJava (paper) Heron MillWheel (paper) Dataflow Model (paper) 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 Apache Flink local, on-prem, The Beam Vision cloud Cloud Dataflow: Java fully managed input.apply( Apache Spark Sum.integersPerKey()) local, on-prem, cloud Sum Per Key Apache Apex Python local, on-prem, cloud input | Sum.PerKey() Apache Gearpump (incubating) ⋮ ⋮ 6 Apache Flink local, on-prem, The Beam Vision cloud Cloud Dataflow: Python fully managed input | KakaIO.read() Apache Spark local, on-prem, cloud KafkaIO Apache Apex ⋮ local, on-prem, cloud Apache Java Gearpump (incubating) class KafkaIO extends UnboundedSource { … } ⋮ 7 The Beam Model PTransform Pipeline PCollection (bounded or unbounded) 8 The Beam Model What are you computing? (read, map, reduce) Where in event time? (event time windowing) When in processing time are results produced? (triggers) How do refinements relate? (accumulation mode) 9 What are you computing? Read ParDo Grouping Composite Parallel connectors to Per element Group by key, Encapsulated external systems "Map" Combine per key, subgraph "Reduce" 10 State What are you computing? Read ParDo Grouping Composite Parallel connectors to Per element Group by key, Encapsulated external systems "Map" Combine per key, subgraph "Reduce" 12 State for ParDo ParDo(DoFn) 13 ParDo.of(new DoFn<...>() { "Example" // declare some state @ProcessElement public void process(...) { // update quantiles // output if needed } }) Per Key Quantiles 14 Partitioned Quantiles Quantiles Quantiles 15 Parallelism! new DoFn<Foo, Quantiles<Foo>>() { @StateId("quantiles") private final StateSpec<...> quantilesState = StateSpecs.combining(); … } Quantiles Quantiles Quantiles DoFn DoFn DoFn 16 Windowed Window into Fixed windows of one hour Quantiles Quantiles Quantiles DoFn DoFn DoFn Expected result: Quantiles for each hour 17 Windowed Window into windows of 30 min sliding by 10 min Quantiles Quantiles Quantiles DoFn DoFn DoFn Expected result: Quantiles for 30 minutes sliding by 10 min 18 State is per key and window ... <k, w>1 <k, w>2 <k, w>3 "x" 3 7 15 "y" "fizz" "7" "fizzbuzz" ... Bonus: automatically garbage collected when a window expires (vs manual clearing of keyed state) Windowed Window into Fixed windows of one hour Quantiles Quantiles Quantiles DoFn DoFn DoFn Expected result: Quantiles for each hour 20 What about Combine? Window into Fixed windows of one hour Quantiles Quantiles Quantiles Combiner Combiner Combiner Expected result: Quantiles for each hour 21 Combine vs State (naive conceptualization) Window hourly Window hourly Quantiles Quantiles Combiner DoFn Expected result: Expected result: Quantiles for each hour Quantiles for each hour 22 Combine vs State (likely execution plan) Quantiles Quantiles non-associative* Combiner Combiner non-commutative* Shuffle Elements multi-output Shuffle side outputs Accumulators (user is in control) Quantiles Quantiles associative Combiner DoFn commutative single-output enables optimizations (engine is in control) Combine vs State Quantiles Combiner Quantiles DoFn output governed by trigger (data/computation unaware) "output only when there's an interesting change" (data/computation aware) Example: Arbitrary indices E B A C D non-associative Assign indices non-commutative non-deterministic E,4 B,3 and totally fine! A,2 C,1 D,0 Kinds of state ● Value - just a mutable cell for a value ● Bag - supports "blind writes" ● Combining - has a CombineFn built in; can support blind writes and lazy accumulation ● Set - membership checking ● Map - lookups and partial writes Timers new DoFn<...>() { Timers for ParDo @TimerId("timeout") private final TimerSpec timeoutTimer = TimerSpecs.timer(TimeDomain.PROCESSING_TIME); @OnTimer("timout") public void timeout(...) { // access state, set new timers, output } … } Stateful ParDo 28 Timers in processing time buffer request "call me back in 5 On Element seconds" On Timer batched RPC output based only output return on incoming value of batched element RPC Timers in event time "call me back when store speculative result the watermark hits the end of the On Element window" On Timer output output replacement speculative result value only if immediately changed More example uses for state & timers ● Per-key arbitrary numbering ● Output only when result changes ● Tighter "side input" management for slowly changing dimension ● Streaming join-matrix / join-biclique ● Fine-grained combine aggregation and output control ● Per-key "workflows" like user sign up flow w/ expiration ● Low-latency deduplication (let the first through, squash the rest) Performance considerations (cross-runner) ● Shuffle to colocate keys ● Linear processing of elements for key+window ● Window merging ● Storage of state and timers ● GC of state 32 Demo 33 Summary State and Timers in Beam... ● … unlock new uses cases ● … they "just work" with event time windowing ● … are portable across runners (implementation ongoing) Thank you for listening! This talk: ● Me - @KennKnowles ● These Slides - https://s.apache.org/ffsf-2017-beam-state Go Deeper ● Design doc - https://s.apache.org/beam-state ● Blog post - https://beam.apache.org/blog/2017/02/13/stateful-processing.html Join the Beam community: ● User discussions - [email protected] ● Development discussions - [email protected] ● Follow @ApacheBeam on Twitter You can contribute to Beam + Flink ● New types of state ● Easy launch of Beam job on Flink-on-YARN ● Integration tests at scale ● Fit and finish: polish, polish, polish! ● … and lots more! https://beam.apache.org.

Portable Stateful Big Data Processing in Apache Beam

Apache Flink™: Stream and Batch Processing in a Single Engine

Evaluation of SPARQL Queries on Apache Flink

Regeldokument

Trifacta Data Preparation for Amazon Redshift and S3 Must Be Deployed Into an Existing Virtual Private Cloud (VPC)

Apache Flink Fast and Reliable Large-Scale Data Processing

The Forrester Wave™: Streaming Analytics, Q3 2019 the 11 Providers That Matter Most and How They Stack up by Mike Gualtieri September 23, 2019

The Ubiquity of Large Graphs and Surprising Challenges of Graph Processing: Extended Survey

Scalable and Flexible Middleware for Dynamic Data Flows

Big Data Analysis Using Hadoop Lecture 4 Hadoop Ecosystem

Researching Algorithmic Institutions Essay

CIF21 Dibbs: Middleware and High Performance Analytics Libraries for Scalable Data Science

Issues at the Intersection of AI, Streaming, HPC, Data Centers And