Apache Beam intro, use-cases and demo Master plan

Technical introduction to what Apache Beam and Cloud Dataflow are, 1 Technical intro how they relate and what makes them special.

Overview of use cases how these technologies are used at other companies 2 Use cases and a deep-dive into the Sky use case.

Code walkthrough & small (interactive) demo. 3 Demo Technical intro 01 Technical intro Technical intro Technical intro Technical intro ❌

Technical intro

Distributed data processing ✅ day 1 day 2 day 3 Technical intro time

Batch processing ❌

Photo by Umanoide on Unsplash Technical intro day 1 day 2 day 3 time

Stream processing

time �� �� Event Time

10:00 11:00 12:00 13:00 Technical intro

Stream processing: Event time and Processing processing time Time

10:00 11:00 12:00 13:00 Watermark Processing Time Ideal Technical intro

Stream processing: Watermark Skew (unknown) Event Time Technical intro

Stream processing: Windowing window 1 window 2 window 3 Technical intro

Stream processing: Triggering 1 + 1 = 2 Completeness

Technical intro

Stream processing: Latency Trade-offs $$$ Cost

Beam Beam Beam

Python Java Go

Technical intro Pipeline (Runner API)

Abstraction of the Apache Cloud DataflowDataflow Spark execution engine: pipeline authoring vs Execution (Fn API) pipeline execution

Execution Execution Execution Languages

Technical intro

Apache Apache Apache Abstraction of the Spark MapReduce Gearpump execution engine: Runners pipeline authoring vs Apache Direct GCP Flink Runner Dataflow pipeline execution

Hazelcast Apache JStorm Jet Apex Fully managed, serverless data 1 processing solution

Technical intro Autoscaling of 2 resources Abstraction of the Optimisation execution engine: (graph, dynamic 3 Google Cloud work rebalancing, Dataflow runner ...)

Monitoring and 4 logging “The of the data processing world.”

Matthias Baetens & friends Beam Summit organisers Use cases 02 Lyft

Dynamic pricing Lyft rides using Apache Beam LinkedIn

Feature Generation at LinkedIn Sky

Large scale streaming analytics using Apache Beam & Google Cloud Dataflow Sky: intro

● TV platform

● Events from customer set-top boxes

● Goals:

○ Analyse user journey

○ Insights into feature and content usage Sky: considerations

● Human nature of the events

● End-to-end:

○ Firmware on the box

○ Data customers:

■ What questions are they trying to answer?

■ How often do they need to answer queries Sky: architecture Raw data

Sky customers ● Data generated by the set-top box (fact data) home ○ Unbounded or streaming nature ○ Important to decide on protocol

between schema developers and Sky Q pipeline developers boxes ● Data in on-prem Sky systems (dimension

data) Sky on-prem ○ Bounded or batch nature systems ○ Important to take this data into account to have a comprehensive view and perform meaningful analytics Reference data Sky: architecture ingesting data

SkySky customers customers ● Data in motion: message queue to homehome capture and transfer messages as a Ingest (streaming)

stream from STB to processing pipeline Raw messages SkySky Q Q Cloud Pub/Sub boxesboxes ● Data at rest: Getting the on-prem reference data in the right place and SkySky on-prem on-prem shape to be joined to the fact data systemssystems Ingest (batch)

Reference data BigQuery ReferenceReference datadata Sky: architecture processing data

● Beam: SkySky customers customers ○ Parsing data: homehome ■ Enforce schema Ingest (streaming) Processing ■ Log faulty messages to Raw messages Processing SkySky Q Q Cloud Pub/Sub Cloud Dataflow dead letter queue boxesboxes ○ Filtering data ○ Archiving data SkySky on-prem on-prem systemssystems ○ Sessionising data Ingest (batch) ● BigQuery: Reference data Raw, joined events BigQuery BigQuery ○ Intensive joins between fact and ReferenceReference dimensional data datadata Sky: architecture enriching data

● A second processing pipeline was used to analyse the customer journey, and we enriched the data with certain parts Processing Enrich

of the interface the user visited Processing Enriching ● Easiness of using imperative languages Cloud Dataflow Cloud Dataflow (Java / Python) vs declarative languages (SQL) for calculations

Raw, joined events Customer journey BigQuery BigQuery Sky: architecture enriching data ● Advanced data users:

○ Writing their own complex queries, Advanced data users

involving joins and processing code Raw events ○ Benefit from the low latency of the results BigQuery

● Data users: Data users

Customer journey ○ Can also do ad-hoc queries and analysis BigQuery on prepared data Dashboard users ● Dashboard users: Dashboards Tableau ○ Get the right daily data, just catered for their use case and dashboards Sky: architecture Architectural overview

Sky customers home

Advanced data users Ingest (streaming) Processing Enrich

Raw events Raw messages Processing Enriching BigQuery Sky Q Cloud Pub/Sub Cloud Dataflow Cloud Dataflow boxes

Data users

Customer journey Scheduling Sky on-prem BigQuery Airflow systems Ingest (batch)

Dashboard users Reference data Raw, joined events Customer journey Monitoring BigQuery BigQuery BigQuery Dashboards Tableau Stackdriver Reference data Google Cloud On-prem Hadoop Dataflow

Network, servers & Stack software to Serverless manage

Result Fixed, scale in Dynamic Scaling months autoscaling

Pay for what you Costs High fixed costs use

Dev Months Weeks

2-3 man days per Ops Near zero ops week Demo 03

Instructions

● Grab your phone

● Go to: bit.ly/Beam-Demo

● Submit a sentence in your favourite language bit.ly/Beam-Demo Overview

Translation API

BigQuery

App Cloud Cloud Engine Pub/Sub Dataflow

Cloud Cloud Pub/Sub Functions

Firebase What does this look like in the Console? What does this look like in Code? Q&A Conclusion & resources Conclusion 01 02 03

Technical Use cases Demo

Distributed data processing Dynamic pricing Real-time analytics with language inference Batch & streaming Feature pipelines

Language & runner abstraction Streaming analytics

Serverless? Google Cloud Dataflow Resources Website beam.apache.org Summit beamsummit.org YouTube youtube.com/apachebeamyt Twitter @ApacheBeam @BeamSummit Mailinglists user@ and [email protected] Blogposts oreilly.com/ideas/the-world-beyond-batch-streaming-101 oreilly.com/ideas/the-world-beyond-batch-streaming-102 Contact

@matthiasbaetens

[email protected]