Apache Beam Intro, Use-Cases and Demo Master Plan

Apache Beam intro, use-cases and demo Master plan Technical introduction to what Apache Beam and Google Cloud Dataflow are, 1 Technical intro how they relate and what makes them special. Overview of use cases how these technologies are used at other companies 2 Use cases and a deep-dive into the Sky use case. Code walkthrough & small (interactive) demo. 3 Demo Technical intro 01 Technical intro Technical intro Technical intro Technical intro ❌ Technical intro Distributed data processing ✅ day 1 day 2 day 3 Technical intro time Batch processing ❌ Photo by Umanoide on Unsplash Technical intro day 1 day 2 day 3 time Stream processing time �� Event Time 10:00 11:00 12:00 13:00 Technical intro Stream processing: Event time and Processing processing time Time 10:00 11:00 12:00 13:00 Watermark Processing Time Ideal Technical intro Stream processing: Watermark Skew (unknown) Event Time Technical intro Stream processing: Windowing window 1 window 2 window 3 Technical intro Stream processing: Triggering 1 + 1 = 2 Completeness Technical intro Stream processing: Latency Trade-offs $$$ Cost Beam Beam Beam Python Java Go Technical intro Pipeline (Runner API) Abstraction of the Apache Cloud Apache Flink DataflowDataflow Spark execution engine: pipeline authoring vs Execution (Fn API) pipeline execution Execution Execution Execution Languages Technical intro Apache Apache Apache Abstraction of the Spark MapReduce Gearpump execution engine: Runners pipeline authoring vs Apache Direct GCP Flink Runner Dataflow pipeline execution Hazelcast Apache JStorm Jet Apex Fully managed, serverless data 1 processing solution Technical intro Autoscaling of 2 resources Abstraction of the Optimisation execution engine: (graph, dynamic 3 Google Cloud work rebalancing, Dataflow runner ...) Monitoring and 4 logging “The Kubernetes of the data processing world.” Matthias Baetens & friends Beam Summit organisers Use cases 02 Lyft Dynamic pricing Lyft rides using Apache Beam LinkedIn Feature Generation at LinkedIn Sky Large scale streaming analytics using Apache Beam & Google Cloud Dataflow Sky: intro ● TV platform ● Events from customer set-top boxes ● Goals: ○ Analyse user journey ○ Insights into feature and content usage Sky: considerations ● Human nature of the events ● End-to-end: ○ Firmware on the box ○ Data customers: ■ What questions are they trying to answer? ■ How often do they need to answer queries Sky: architecture Raw data Sky customers ● Data generated by the set-top box (fact data) home ○ Unbounded or streaming nature ○ Important to decide on protocol between schema developers and Sky Q pipeline developers boxes ● Data in on-prem Sky systems (dimension data) Sky on-prem ○ Bounded or batch nature systems ○ Important to take this data into account to have a comprehensive view and perform meaningful analytics Reference data Sky: architecture ingesting data SkySky customers customers ● Data in motion: message queue to homehome capture and transfer messages as a Ingest (streaming) stream from STB to processing pipeline Raw messages SkySky Q Q Cloud Pub/Sub boxesboxes ● Data at rest: Getting the on-prem reference data in the right place and SkySky on-prem on-prem shape to be joined to the fact data systemssystems Ingest (batch) Reference data BigQuery ReferenceReference datadata Sky: architecture processing data ● Beam: SkySky customers customers ○ Parsing data: homehome ■ Enforce schema Ingest (streaming) Processing ■ Log faulty messages to Raw messages Processing SkySky Q Q Cloud Pub/Sub Cloud Dataflow dead letter queue boxesboxes ○ Filtering data ○ Archiving data SkySky on-prem on-prem systemssystems ○ Sessionising data Ingest (batch) ● BigQuery: Reference data Raw, joined events BigQuery BigQuery ○ Intensive joins between fact and ReferenceReference dimensional data datadata Sky: architecture enriching data ● A second processing pipeline was used to analyse the customer journey, and we enriched the data with certain parts Processing Enrich of the interface the user visited Processing Enriching ● Easiness of using imperative languages Cloud Dataflow Cloud Dataflow (Java / Python) vs declarative languages (SQL) for calculations Raw, joined events Customer journey BigQuery BigQuery Sky: architecture enriching data ● Advanced data users: ○ Writing their own complex queries, Advanced data users involving joins and processing code Raw events ○ Benefit from the low latency of the results BigQuery ● Data users: Data users Customer journey ○ Can also do ad-hoc queries and analysis BigQuery on prepared data Dashboard users ● Dashboard users: Dashboards Tableau ○ Get the right daily data, just catered for their use case and dashboards Sky: architecture Architectural overview Sky customers home Advanced data users Ingest (streaming) Processing Enrich Raw events Raw messages Processing Enriching BigQuery Sky Q Cloud Pub/Sub Cloud Dataflow Cloud Dataflow boxes Data users Customer journey Scheduling Sky on-prem BigQuery Airflow systems Ingest (batch) Dashboard users Reference data Raw, joined events Customer journey Monitoring BigQuery BigQuery BigQuery Dashboards Tableau Stackdriver Reference data Google Cloud On-prem Hadoop Dataflow Network, servers & Stack software to Serverless manage Result Fixed, scale in Dynamic Scaling months autoscaling Pay for what you Costs High fixed costs use Dev Months Weeks 2-3 man days per Ops Near zero ops week Demo 03 Instructions ● Grab your phone ● Go to: bit.ly/Beam-Demo ● Submit a sentence in your favourite language bit.ly/Beam-Demo Overview Translation API BigQuery App Cloud Cloud Engine Pub/Sub Dataflow Cloud Cloud Pub/Sub Functions Firebase What does this look like in the Console? What does this look like in Code? Q&A Conclusion & resources Conclusion 01 02 03 Technical Use cases Demo Distributed data processing Dynamic pricing Real-time analytics with language inference Batch & streaming Feature pipelines Language & runner abstraction Streaming analytics Serverless? Google Cloud Dataflow Resources Website beam.apache.org Summit beamsummit.org YouTube youtube.com/apachebeamyt Twitter @ApacheBeam @BeamSummit Mailinglists user@ and [email protected] Blogposts oreilly.com/ideas/the-world-beyond-batch-streaming-101 oreilly.com/ideas/the-world-beyond-batch-streaming-102 Contact @matthiasbaetens [email protected].

Apache Beam Intro, Use-Cases and Demo Master Plan

Regeldokument

Trifacta Data Preparation for Amazon Redshift and S3 Must Be Deployed Into an Existing Virtual Private Cloud (VPC)

Portable Stateful Big Data Processing in Apache Beam

The Forrester Wave™: Streaming Analytics, Q3 2019 the 11 Providers That Matter Most and How They Stack up by Mike Gualtieri September 23, 2019

Scalable and Flexible Middleware for Dynamic Data Flows

Big Data Analysis Using Hadoop Lecture 4 Hadoop Ecosystem

Researching Algorithmic Institutions Essay

CIF21 Dibbs: Middleware and High Performance Analytics Libraries for Scalable Data Science

Issues at the Intersection of AI, Streaming, HPC, Data Centers And

Apache Beam: Portable and Evolutive Data-Intensive Applications

Spring 2020 1/21

Code Smell Prediction Employing Machine Learning Meets Emerging Java Language Constructs"