Large-Scale Data Analysis at Cloud Scale [email protected] 2016-09

Large-scale data analysis at cloud scale [email protected] 2016-09 Based on slides from Eric Schmidt, Greg DeMichillie, Frances Perry, Tyler Akidau, Dan Halperin. Data & Analytics Application Development Infrastructure & Operations GCP Big Data reference architecture 3 One possible flow Application logs Impression Cloud Pub/Sub Cloud Dataflow BigQuery Start stream Ingest and Processing and Durable storage Add to playlist propagate imperative and interactive Share analysis analysis .... Questions you might ask - Which application sections are receiving the most impressions? - Who is the top artist by stream starts in the last minute? - What is the average session length for users in Seattle? 4 10+ years of tackling Big Data problems 2002 2004 2005 2006 2008 2010 2012 2014 2015 Google Papers GFS Map Reduce BigTable Dremel PubSub Flume Java Millwheel Dataflow Tensorflow 5 10+ years of tackling Big Data problems 2002 2004 2005 2006 2008 2010 2012 2014 2015 Google Papers GFS Map Reduce BigTable Dremel PubSub Flume Java Millwheel Dataflow Tensorflow 6 In: Proc. 36th VLDB, Sept 2010. Link. The Dremel paper 7 Dremel in 2016 Dremel is mission critical for Google In production for 10+ years, in every Google datacenter Internal usage every month: ● O(Exabytes) analyzed ● O(Quadrillions) of rows scanned ● 80% of Googlers use it 8 BigQuery ≃ Dremel + cloud Our idea of highly available, managed analytics: ● no indexing, no resharding, no storage resizing ● just ... RUN QUERY 9 BigQuery ≃ Dremel + cloud Full-managed data warehouse – General Availability in 2012 – Same engineering team – Same code base Fast, petabyte-scale with SQL interface Encrypted, durable and highly available Near-unlimited resources, only pay for what you use Streaming ingest for unbounded data sets 10 But what have we done lately? 11 code submits BigQuery: 5+ years of innovation 2010 2011 2012 2013 2014 2015 2016 1,200 Beta Release at Google I/O Dynamic Execution Public launch Faster shuffle 900 Big JOIN support Large query results Capacitor 300 100k qps streaming Dremel X User-defined functions 0 12 Dremel architecture: 2006–2015 Mixer 0 ● Partial Reduction ● Mixer 1 Mixer 1 Long-lived Serving Tree ● Diskless data flow Leaf Leaf Leaf Leaf Distributed Storage ● Columnar Storage 13 Dremel X architecture (2015–now) Master Shard Shard Shard Shard ● Dynamic Serving Tree Shard Shard Shard Shard Shard Shard Shard Shard Distributed Storage ● Columnar Storage 14 Storage engine: ColumnIO (2006–2015) SELECT play_count FROM songs WHERE name CONTAINS “Sun”; Data Decompress Filter Emit F$#h5 xc* {Hey Jude, 5375} CONTAINS(“Sun”) rm7y5 c8! {My Michelle, 2188} CONTAINS(“Sun”) rm7y5 8ec {My Michelle, 9363} CONTAINS(“Sun”) F$#h5 7h! {Hey Jude, 9502} CONTAINS(“Sun”) 4t#@h a7c {Here Comes The Sun, 7383} CONTAINS(“Sun”) {7833} rm7y5 c-% {My Michelle, 3912} CONTAINS(“Sun”) 15 Storage engine: Capacitor (2016–now) SELECT play_count FROM songs WHERE name CONTAINS “Sun”; Data Emit Dictionary Filter Lookup 2 xc* 0 Hey Jude CONTAINS(“Sun”) F 1 c8! 1 My Michelle CONTAINS(“Sun”) F 1 8ec 2 Here Comes the Sun CONTAINS(“Sun”) T 2 7h! 0 a7c {7833} Performance improvement: 1 c-% ● 2x faster average over all queries ● 10–1000x faster for selective filters 16 Cloud Dataflow 17 One possible flow Application logs Impression Cloud Pub/Sub Cloud Dataflow BigQuery Start stream Ingest and Processing and Durable storage Add to playlist propagate imperative and interactive Share analysis analysis .... Questions you might ask - Which application sections are receiving the most impressions? - Who is the top artist by stream starts in the last minute? - What is the average session length for users in Seattle? 18 Time-to-answers matters Sequence a human Process point of sale Who is the best at Who/what is trending? genome transaction logs collecting Poké Balls? 19 Data boundedness Bounded Unbounded Time Finite data set Infinite data set Fixed schema Potentially changing schema Complete Never complete Typically at rest Typically not at rest 20 Latency happens Transmit Ingest Process Network Throughput Poor engine delay or (read design unavailable latency) Starved Ingest delay Backlog resources (write latency) Hardware Internal failure backlog Ingest failure Confused heuristics 21 The evolution of Apache Beam Colossus BigTable PubSub Dremel Google Cloud Dataflow Spanner Megastore Millwheel Flume Apache Beam MapReduce 22 Apache Beam (incubating) is a unified programming model designed to provide efficient and portable data processing pipelines 1.Classic batch 2. Batch with fixed 3. Streaming windows 4. Streaming with 5. Streaming with 6. Sessions speculative + late data retractions Formalizing Event-Time Skew Watermarks describe event time progress in the processing time space: "No timestamp earlier than the watermark will be seen" Often based on heuristics ● Too slow? Results are delayed :-( ● Too fast? Some data is late :-( What Where When How ● What are you computing? ● Where in event time? ● When in processing time? ● How do refinements relate? What Where When How What are you computing? Per-element Aggregations Compositions (subgraphs) What Where When How What: computing integer sums What Where When How What: computing integer sums What Where When How Where in event time? Windowing divides data into event-time-based finite chunks. Fixed Sliding Sessions 1 1 2 3 4 2 3 1 3 4 Key 1 Key 2 Key 3 4 5 2 Time Often required when doing aggregations over unbounded data. What Where When How Where: Fixed 2-minute Windows What Where When How When in processing time? • Triggers control when results are emitted. • Triggers are often relative to the watermark. What Where When How When: triggering at the watermark What Where When How When: early and late firings What Where When How How do refinements relate? Q: if there are multiple panes per window … what should be emitted? • discard – emit the delta over the pane • accumulate – emit running total • accumulate + retract* – retract last sum & emit new running total A: drive by needs of the downstream consumer *Accumulating & Retracting not yet implemented in Apache Beam. What Where When How How: add newest, remove previous What Where When How Customizing What When Where How 1.Classic batch 2. Batch with fixed windows 3. Streaming 4. Streaming with 5. Streaming with speculative + late data retractions The straggler problem Work is unevenly distributed across tasks •Underlying data size •Processing differences Worker •Runtime effects Time Effects are cumulative per stage Beam readers enable dynamic adaptation Beam readers provide simple progress signals, enable runners to take action based on execution-time characteristics. APIs for how much work is pending: •Bounded: double getFractionConsumed() •Unbounded: long getBacklogBytes() Work-stealing: •Bounded: Source splitAtFraction(double) int getParallelismRemaining() Dynamic work rebalancing Done work Active work Predicted completion Tasks Time Now Predicted mean time Dynamic work rebalancing Done work Active work Predicted completion Tasks Time Time Now Now Dynamic work rebalancing: a real example Beam pipeline on the Google Cloud Dataflow runner Savings 2-stage pipeline, Same pipeline split “evenly” but uneven in practice with dynamic work rebalancing Dynamic work rebalancing + Autoscaling Beam pipeline on the Google Cloud Dataflow runner Initially allocate ~80 workers based on input size Long-running tasks aborted Multiple rounds of without causing upsizing enabled stragglers by dynamic splitting Scale up to 1000 workers * tasks stay well-balanced * without initial oversplitting Cloud Dataflow execution runner Resource management S Graph optimization Compute and Storage Intelligent watermarking O S Work scheduler Auto-healing U Unbounded I R Resource auto-scaler Monitoring N C Dynamic work K Log collection E rebalancer Bounded 44 The Dataflow Model & Cloud Dataflow Dataflow Model & SDKs Google Cloud Dataflow a unified model for no-ops, fully managed service batch and stream processing The DataflowBeam Model & Cloud Dataflow Apache Beam Google Cloud Dataflow a unified model for a great place to run Beam batch and stream processing supporting multiple runtimes What is in Apache Beam? 1. The Beam Model: What / Where / When / How 2. SDKs for writing Beam pipelines – starting with Java 3. Runners for existing distributed processing backends • Apache Flink (thanks to data Artisans) • Apache Spark (thanks to Cloudera) • Google Cloud Dataflow (fully managed service) • Local (in-process) runner for testing Categorizing runner capabilities http://beam.incubator.apache.org/capability-matrix/ Apache Beam roadmap 02/25/2016 1st commit to ASF repository 1H 2016 1H 2017 Refactoring Stable APIs (slight chaos) 2H 2016 Portability basics + 02/01/2016 Multiple runners + Enter Apache Community growth Incubator ⇒ towards a Top-Level Project Growing the Beam community Collaborate – a community-driven effort across many organizations and contributors Grow – the Beam ecosystem and community with active, open involvement Now Next 2nd Wave 1st Wave Virtualized Colocation Datacenters User Managed Intelligent Services User Configured User Maintained Auto Everything Big Data Programming with Google Focus on insights, not infrastructure From batch to real-time Understanding Apache Beam offers a common model across execution engines: please contribute! [email protected].

Large-Scale Data Analysis at Cloud Scale [email protected] 2016-09

Google Cloud Platform Integration

Google's Mission

Are3na Crabbé Et Al

Regeldokument

Trifacta Data Preparation for Amazon Redshift and S3 Must Be Deployed Into an Existing Virtual Private Cloud (VPC)

Economic and Social Impacts of Google Cloud September 2018 Economic and Social Impacts of Google Cloud |

Portable Stateful Big Data Processing in Apache Beam

The Forrester Wave™: Streaming Analytics, Q3 2019 the 11 Providers That Matter Most and How They Stack up by Mike Gualtieri September 23, 2019

Google Cloud Dataflow – an Insight

Scalable and Flexible Middleware for Dynamic Data Flows

Google Cloud Dataflow a Unified Model for Batch and Streaming Data Processing Jelena Pjesivac-Grbovic

Big Data Analysis Using Hadoop Lecture 4 Hadoop Ecosystem