Introduction to Stateful Stream Processing with Apache Flink

Introduction to Stateful Stream Processing with Apache Flink Robert Metzger @rmetzger_ [email protected] 2 Ververica Platform Original creators of Open Source Apache Flink Apache Flink® + Application Manager 3 Apache Flink 101 4 Apache Flink § Apache Flink is an open source stream processing framework • Low latency • High throughput • Stateful • Distributed 5 What is Apache Flink? Data Stream Processing Batch Event-driven Processing realtime results from data streams Applications process static and data-driven actions historic data and services Stateful Computations Over Data Streams 6 Use Case: Streaming ETL ● Periodic ETL is the traditional Periodic ETL approach ○ External tool periodically triggers ETL batch job Data Pipeline / Real-time ETL ● Data pipelines continuously move data ○ Ingestion with low latency ○ No artificial data boundaries 7 Use Case: Data Analytics ● Batch analytics is great for ad-hoc Batch Analytics queries ○ Queries change faster than data ○ Interactive analytics / prototyping ● Stream analytics continuously Stream Analytics processes data ○ Data changes faster than queries ○ Live / low latency results ○ No Lambda architecture required! Use Case: Event-driven applications ● Traditional application design Transactional Application ○ Compute & data tier architecture ○ React to and process events ○ State is stored in (remote) database ● Event-driven application ○ State is maintained locally Event-driven Application ○ Guaranteed consistency by periodic state checkpoints ○ Tight coupling of logic and data (microservice architecture) ○ Highly scalable design 9 Hardened at scale Details about their use cases and more users are listed on Flink’s website at https://flink.apache.org/poweredby.html 10 Case Study: Single’s Day ● Chinese Shopping Festival ● Very high peak load ○ 100s millions records per second ○ 100s thousands payments per second ○ 10 TBs of managed state ○ 10s thousands of cores ● Flink used in various areas in the process incl. payment, shipping, realtime recommendations and the giant dashboard References ● https://www.ververica.com/blog/singles-day-2018-data-in-a-flink-of-an-eye ● https://medium.com/@alitech_2017/how-to-cope-with-peak-data-traffic-on-11-11-the-secrets-of- alibaba-stream-computing-17d5e807980c 11 Building blocks 12 The Core Building Blocks Event Streams State (Event) Time Snapshots consistency with forking / real-time and complex out-of-order data versioning / hindsight business logic and late data time-travel 13 Stateful Event & Stream Processing val lines: DataStream[String] = env.addSource(new FlinkKafkaConsumer(…)) Source val events: DataStream[Event] = lines.map((line) => parse(line)) Transformation val stats: DataStream[Statistic] = stream .keyBy("sensor") .timeWindow(Time.seconds(5)) Transformation .sum(new MyAggregationFunction()) stats.addSink(new RollingSink(path)) Sink Streaming Dataflow Source Transform Window Sink (state read/write) 14 Stateful Event & Stream Processing Filter / State Source Sink Transform read/write 15 Stateful Event & Stream Processing Scalable embedded state Access at memory speed & scales with parallel operators 16 Stateful Event & Stream Processing Rolling back computation Re-processing Re-load state Reset positions in input streams 17 Time: Different Notions of Time Flink Flink Event Producer Message Queue Data Source Window Operator partition 1 partition 2 Broker Event Ingestion Window Time Time Time Processing Time 18 Time: Event Time Example Event Time Episode Episode Episode Episode Episode Episode Episode IV V VI I II III VII 1977 1980 1983 1999 2002 2005 2015 Processing Time 19 Recap: The Core Building Blocks Event Streams State (Event) Time Snapshots consistency with forking / real-time and complex out-of-order data versioning / hindsight business logic and late data time-travel 20 APIs 21 The APIs Analytics Stream SQL Stream- & Batch Processing Table API (dynamic tables) 22 Stateful DataStream API (streams, windows) Event-Driven Applications Process Function (events, state, time) Process Function class MyFunction extends ProcessFunction[MyEvent, Result] { // declare state to use in the program lazy val state: ValueState[CountWithTimestamp] = getRuntimeContext().getState(…) def processElement(event: MyEvent, ctx: Context, out: Collector[Result]): Unit = { // work with event and state (event, state.value) match { … } out.collect(…) // emit events state.update(…) // modify state // schedule a timer callback ctx.timerService.registerEventTimeTimer(event.timestamp + 500) } def onTimer(timestamp: Long, ctx: OnTimerContext, out: Collector[Result]): Unit = { // handle callback when event-/processing- time instant is reached } } 23 Data Stream API val lines: DataStream[String] = env.addSource( new FlinkKafkaConsumer<>(…)) val events: DataStream[Event] = lines.map((line) => parse(line)) val stats: DataStream[Statistic] = stream .keyBy("sensor") .timeWindow(Time.seconds(5)) .sum(new MyAggregationFunction()) stats.addSink(new RollingSink(path)) 24 Table API & Stream SQL 25 Deployment & Integrations 26 Deployment 27 Integrations ● Event logs: ○ Kafka, Kinesis, Pulsar* ● File systems: ○ S3, HDFS, NFS, MapR FS, ... ● Encodings: ○ Avro, JSON, CSV, ORC, Parquet ● Databases: ○ JDBC, HCatalog ● Key-Value Stores ○ Cassandra, Elasticsearch, Redis* 28 Concluding… 29 Early Bird ticket sales ends July 15th Get in touch via eMail: Get in touch via Twitter: Q & A [email protected] @rmetzger_ [email protected] @ApacheFlink 31 What’s happening in Flink these days … • INSERT INTO flink_sql SELECT * FROM blink_sql DataStream Table / SQL • Turning Table API into an API unified across batch DataSet and streaming (FLINK-11439) (deprecated) Stream Operator & DAG API • Integration with Hive ecosystem (FLINK-10556) Runtime • Batch runtime improvements: Fine-grained recovery (FLINK-4256), more schedulers (FLINK-10429), pluggable shuffle service (FLINK-10653) • Machine Learning Pipelines on Table API (FLIP-39) • Table API: Caching of intermediate results (.cache() API) (FLINK-11199) • Table API: Python support (FLINK-12308) Implementation: State Checkpointing 34 State, Snapshots, Recovery Coordination via markers, injected into the streams 35 36 37 38 39 Implementation: Stream SQL 41 Stream SQL: Intuition Stream / Table Duality Table without Primary Key Table with Primary Key 42 Stream SQL: Implementation Differential Computation Query Compilation (add/mod/del) 43 Implementation: Rescaling 44 Rescaling State / Elasticity ▪ Similar to consistent hashing Key space ▪ Split key space into key groups Key group #1 Key group #2 ▪ Assign key groups to tasks Key group #4 Key group #3 45 Rescaling State / Elasticity ▪ Rescaling changes key group assignment ▪ Maximum parallelism defined by #key groups ▪ Rescaling happens through restoring a savepoint using the new parallism 46 Implementation: Time-handling 47 Time: Different Notions of Time Flink Flink Event Producer Message Queue Data Source Window Operator partition 1 partition 2 Broker Event Ingestion Window Time Time Time Processing Time 48 Time: Watermarks Stream (in order) 23 21 20 19 18 17 15 14 11 10 9 9 7 W(20) W(11) Event Watermark Event timestamp Stream (out of order) 21 19 20 17 22 12 17 14 12 9 15 11 7 W(17) W(11) Event Watermark Event timestamp 49 Time: Watermarks in Parallel Event Watermark [id|timestamp] 29 14 29 Q|44 N|39 M|39 Source K|35 map B|31 A|30 window (1) (1) (1) W(33) C|30 14 Watermark D|15 Event Time at input streams Generation W(17) E|30 29 17 14 Source map window R|37 O|23 L|22 H|20 G|18 F|15 (2) (2) (2) 14 W(17) Event Time at the operator 50 Implementation: Queryable State 51 Queryable State Traditional Queryable State 52 Queryable State: Implementation Query: /job/operation/state-name/key Query Client (1) Get location of "key-partition" (4) Query for "operator" of" job" state-name and key (3) Respond location State Location Server (2) Look up State State register location Registry Registry ExecutionGraph deploy window( window( local state status )/ )/ sum() sum() Job Manager Task Manager Task Manager 53 State, Snapshots, Recovery Events flow without replication or synchronous writes State index (Hash Table or RocksDB) Events are persistent and ordered (per partition / key) in the log (e.g., Apache Kafka) source / stateful transform operation 54 State, Snapshots, Recovery Trigger checkpoint Inject checkpoint barrier source / stateful transform operation 55 State, Snapshots, Recovery Take state snapshot RocksDB: Trigger state copy-on-write source / stateful transform operation 56 State, Snapshots, Recovery Persist state snapshots Processing pipeline continues Durably persist snapshots asynchronously source / stateful transform operation 57 Powerful Abstractions Layered abstractions to navigate simple to complex use cases High-level Analytics API Stream SQL / Tables (dynamic tables) val stats = stream Stream- & Batch .keyBy("sensor") DataStream API (streams, windows) .timeWindow(Time.seconds(5)) Data Processing .sum((a, b) -> a.add(b)) Stateful Event- Process Function (events, state, time) Driven Applications def processElement(event: MyEvent, ctx: Context, out: Collector[Result]) = { // work with event and state (event, state.value) match { … } out.collect(…) // emit events state.update(…) // modify state // schedule a timer callback ctx.timerService.registerEventTimeTimer(event.timestamp + 500) } 58 Event Sourcing + Memory Image periodically snapshot the memory main memory event / command event log update local variables/structures persists events (temporarily) Process 59 Event Sourcing + Memory Image Recovery: Restore snapshot and replay events since snapshot event log persists events (temporarily) Process 60 Distributed Memory Image Distributed application, many memory images. Snapshots are all consistent together. 61.

Introduction to Stateful Stream Processing with Apache Flink

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support