
Apache Beam intro, use-cases and demo Master plan Technical introduction to what Apache Beam and Google Cloud Dataflow are, 1 Technical intro how they relate and what makes them special. Overview of use cases how these technologies are used at other companies 2 Use cases and a deep-dive into the Sky use case. Code walkthrough & small (interactive) demo. 3 Demo Technical intro 01 Technical intro Technical intro Technical intro Technical intro ❌ Technical intro Distributed data processing ✅ day 1 day 2 day 3 Technical intro time Batch processing ❌ Photo by Umanoide on Unsplash Technical intro day 1 day 2 day 3 time Stream processing time �� �� Event Time 10:00 11:00 12:00 13:00 Technical intro Stream processing: Event time and Processing processing time Time 10:00 11:00 12:00 13:00 Watermark Processing Time Ideal Technical intro Stream processing: Watermark Skew (unknown) Event Time Technical intro Stream processing: Windowing window 1 window 2 window 3 Technical intro Stream processing: Triggering 1 + 1 = 2 Completeness Technical intro Stream processing: Latency Trade-offs $$$ Cost Beam Beam Beam Python Java Go Technical intro Pipeline (Runner API) Abstraction of the Apache Cloud Apache Flink DataflowDataflow Spark execution engine: pipeline authoring vs Execution (Fn API) pipeline execution Execution Execution Execution Languages Technical intro Apache Apache Apache Abstraction of the Spark MapReduce Gearpump execution engine: Runners pipeline authoring vs Apache Direct GCP Flink Runner Dataflow pipeline execution Hazelcast Apache JStorm Jet Apex Fully managed, serverless data 1 processing solution Technical intro Autoscaling of 2 resources Abstraction of the Optimisation execution engine: (graph, dynamic 3 Google Cloud work rebalancing, Dataflow runner ...) Monitoring and 4 logging “The Kubernetes of the data processing world.” Matthias Baetens & friends Beam Summit organisers Use cases 02 Lyft Dynamic pricing Lyft rides using Apache Beam LinkedIn Feature Generation at LinkedIn Sky Large scale streaming analytics using Apache Beam & Google Cloud Dataflow Sky: intro ● TV platform ● Events from customer set-top boxes ● Goals: ○ Analyse user journey ○ Insights into feature and content usage Sky: considerations ● Human nature of the events ● End-to-end: ○ Firmware on the box ○ Data customers: ■ What questions are they trying to answer? ■ How often do they need to answer queries Sky: architecture Raw data Sky customers ● Data generated by the set-top box (fact data) home ○ Unbounded or streaming nature ○ Important to decide on protocol between schema developers and Sky Q pipeline developers boxes ● Data in on-prem Sky systems (dimension data) Sky on-prem ○ Bounded or batch nature systems ○ Important to take this data into account to have a comprehensive view and perform meaningful analytics Reference data Sky: architecture ingesting data SkySky customers customers ● Data in motion: message queue to homehome capture and transfer messages as a Ingest (streaming) stream from STB to processing pipeline Raw messages SkySky Q Q Cloud Pub/Sub boxesboxes ● Data at rest: Getting the on-prem reference data in the right place and SkySky on-prem on-prem shape to be joined to the fact data systemssystems Ingest (batch) Reference data BigQuery ReferenceReference datadata Sky: architecture processing data ● Beam: SkySky customers customers ○ Parsing data: homehome ■ Enforce schema Ingest (streaming) Processing ■ Log faulty messages to Raw messages Processing SkySky Q Q Cloud Pub/Sub Cloud Dataflow dead letter queue boxesboxes ○ Filtering data ○ Archiving data SkySky on-prem on-prem systemssystems ○ Sessionising data Ingest (batch) ● BigQuery: Reference data Raw, joined events BigQuery BigQuery ○ Intensive joins between fact and ReferenceReference dimensional data datadata Sky: architecture enriching data ● A second processing pipeline was used to analyse the customer journey, and we enriched the data with certain parts Processing Enrich of the interface the user visited Processing Enriching ● Easiness of using imperative languages Cloud Dataflow Cloud Dataflow (Java / Python) vs declarative languages (SQL) for calculations Raw, joined events Customer journey BigQuery BigQuery Sky: architecture enriching data ● Advanced data users: ○ Writing their own complex queries, Advanced data users involving joins and processing code Raw events ○ Benefit from the low latency of the results BigQuery ● Data users: Data users Customer journey ○ Can also do ad-hoc queries and analysis BigQuery on prepared data Dashboard users ● Dashboard users: Dashboards Tableau ○ Get the right daily data, just catered for their use case and dashboards Sky: architecture Architectural overview Sky customers home Advanced data users Ingest (streaming) Processing Enrich Raw events Raw messages Processing Enriching BigQuery Sky Q Cloud Pub/Sub Cloud Dataflow Cloud Dataflow boxes Data users Customer journey Scheduling Sky on-prem BigQuery Airflow systems Ingest (batch) Dashboard users Reference data Raw, joined events Customer journey Monitoring BigQuery BigQuery BigQuery Dashboards Tableau Stackdriver Reference data Google Cloud On-prem Hadoop Dataflow Network, servers & Stack software to Serverless manage Result Fixed, scale in Dynamic Scaling months autoscaling Pay for what you Costs High fixed costs use Dev Months Weeks 2-3 man days per Ops Near zero ops week Demo 03 Instructions ● Grab your phone ● Go to: bit.ly/Beam-Demo ● Submit a sentence in your favourite language bit.ly/Beam-Demo Overview Translation API BigQuery App Cloud Cloud Engine Pub/Sub Dataflow Cloud Cloud Pub/Sub Functions Firebase What does this look like in the Console? What does this look like in Code? Q&A Conclusion & resources Conclusion 01 02 03 Technical Use cases Demo Distributed data processing Dynamic pricing Real-time analytics with language inference Batch & streaming Feature pipelines Language & runner abstraction Streaming analytics Serverless? Google Cloud Dataflow Resources Website beam.apache.org Summit beamsummit.org YouTube youtube.com/apachebeamyt Twitter @ApacheBeam @BeamSummit Mailinglists user@ and [email protected] Blogposts oreilly.com/ideas/the-world-beyond-batch-streaming-101 oreilly.com/ideas/the-world-beyond-batch-streaming-102 Contact @matthiasbaetens [email protected].
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages46 Page
-
File Size-