Apache Beam intro, use-cases and demo Master plan
Technical introduction to what Apache Beam and Google Cloud Dataflow are, 1 Technical intro how they relate and what makes them special.
Overview of use cases how these technologies are used at other companies 2 Use cases and a deep-dive into the Sky use case.
Code walkthrough & small (interactive) demo. 3 Demo Technical intro 01 Technical intro Technical intro Technical intro Technical intro ❌
Technical intro
Distributed data processing ✅ day 1 day 2 day 3 Technical intro time
Batch processing ❌
Photo by Umanoide on Unsplash Technical intro day 1 day 2 day 3 time
Stream processing
time �� �� Event Time
10:00 11:00 12:00 13:00 Technical intro
Stream processing: Event time and Processing processing time Time
10:00 11:00 12:00 13:00 Watermark Processing Time Ideal Technical intro
Stream processing: Watermark Skew (unknown) Event Time Technical intro
Stream processing: Windowing window 1 window 2 window 3 Technical intro
Stream processing: Triggering 1 + 1 = 2 Completeness
Technical intro
Stream processing: Latency Trade-offs $$$ Cost
Beam Beam Beam
Python Java Go
Technical intro Pipeline (Runner API)
Abstraction of the Apache Cloud Apache Flink DataflowDataflow Spark execution engine: pipeline authoring vs Execution (Fn API) pipeline execution
Execution Execution Execution Languages
Technical intro
Apache Apache Apache Abstraction of the Spark MapReduce Gearpump execution engine: Runners pipeline authoring vs Apache Direct GCP Flink Runner Dataflow pipeline execution
Hazelcast Apache JStorm Jet Apex Fully managed, serverless data 1 processing solution
Technical intro Autoscaling of 2 resources Abstraction of the Optimisation execution engine: (graph, dynamic 3 Google Cloud work rebalancing, Dataflow runner ...)
Monitoring and 4 logging “The Kubernetes of the data processing world.”
Matthias Baetens & friends Beam Summit organisers Use cases 02 Lyft
Dynamic pricing Lyft rides using Apache Beam LinkedIn
Feature Generation at LinkedIn Sky
Large scale streaming analytics using Apache Beam & Google Cloud Dataflow Sky: intro
● TV platform
● Events from customer set-top boxes
● Goals:
○ Analyse user journey
○ Insights into feature and content usage Sky: considerations
● Human nature of the events
● End-to-end:
○ Firmware on the box
○ Data customers:
■ What questions are they trying to answer?
■ How often do they need to answer queries Sky: architecture Raw data
Sky customers ● Data generated by the set-top box (fact data) home ○ Unbounded or streaming nature ○ Important to decide on protocol
between schema developers and Sky Q pipeline developers boxes ● Data in on-prem Sky systems (dimension
data) Sky on-prem ○ Bounded or batch nature systems ○ Important to take this data into account to have a comprehensive view and perform meaningful analytics Reference data Sky: architecture ingesting data
SkySky customers customers ● Data in motion: message queue to homehome capture and transfer messages as a Ingest (streaming)
stream from STB to processing pipeline Raw messages SkySky Q Q Cloud Pub/Sub boxesboxes ● Data at rest: Getting the on-prem reference data in the right place and SkySky on-prem on-prem shape to be joined to the fact data systemssystems Ingest (batch)
Reference data BigQuery ReferenceReference datadata Sky: architecture processing data
● Beam: SkySky customers customers ○ Parsing data: homehome ■ Enforce schema Ingest (streaming) Processing ■ Log faulty messages to Raw messages Processing SkySky Q Q Cloud Pub/Sub Cloud Dataflow dead letter queue boxesboxes ○ Filtering data ○ Archiving data SkySky on-prem on-prem systemssystems ○ Sessionising data Ingest (batch) ● BigQuery: Reference data Raw, joined events BigQuery BigQuery ○ Intensive joins between fact and ReferenceReference dimensional data datadata Sky: architecture enriching data
● A second processing pipeline was used to analyse the customer journey, and we enriched the data with certain parts Processing Enrich
of the interface the user visited Processing Enriching ● Easiness of using imperative languages Cloud Dataflow Cloud Dataflow (Java / Python) vs declarative languages (SQL) for calculations
Raw, joined events Customer journey BigQuery BigQuery Sky: architecture enriching data ● Advanced data users:
○ Writing their own complex queries, Advanced data users
involving joins and processing code Raw events ○ Benefit from the low latency of the results BigQuery
● Data users: Data users
Customer journey ○ Can also do ad-hoc queries and analysis BigQuery on prepared data Dashboard users ● Dashboard users: Dashboards Tableau ○ Get the right daily data, just catered for their use case and dashboards Sky: architecture Architectural overview
Sky customers home
Advanced data users Ingest (streaming) Processing Enrich
Raw events Raw messages Processing Enriching BigQuery Sky Q Cloud Pub/Sub Cloud Dataflow Cloud Dataflow boxes
Data users
Customer journey Scheduling Sky on-prem BigQuery Airflow systems Ingest (batch)
Dashboard users Reference data Raw, joined events Customer journey Monitoring BigQuery BigQuery BigQuery Dashboards Tableau Stackdriver Reference data Google Cloud On-prem Hadoop Dataflow
Network, servers & Stack software to Serverless manage
Result Fixed, scale in Dynamic Scaling months autoscaling
Pay for what you Costs High fixed costs use
Dev Months Weeks
2-3 man days per Ops Near zero ops week Demo 03
Instructions
● Grab your phone
● Go to: bit.ly/Beam-Demo
● Submit a sentence in your favourite language bit.ly/Beam-Demo Overview
Translation API
BigQuery
App Cloud Cloud Engine Pub/Sub Dataflow
Cloud Cloud Pub/Sub Functions
Firebase What does this look like in the Console? What does this look like in Code? Q&A Conclusion & resources Conclusion 01 02 03
Technical Use cases Demo
Distributed data processing Dynamic pricing Real-time analytics with language inference Batch & streaming Feature pipelines
Language & runner abstraction Streaming analytics
Serverless? Google Cloud Dataflow Resources Website beam.apache.org Summit beamsummit.org YouTube youtube.com/apachebeamyt Twitter @ApacheBeam @BeamSummit Mailinglists user@ and [email protected] Blogposts oreilly.com/ideas/the-world-beyond-batch-streaming-101 oreilly.com/ideas/the-world-beyond-batch-streaming-102 Contact
@matthiasbaetens