Large-scale data analysis at cloud scale johnwilkes@google.com 2016-09
Based on slides from Eric Schmidt, Greg DeMichillie, Frances Perry, Tyler Akidau, Dan Halperin. Data & Analytics
Application Development
Infrastructure & Operations GCP Big Data reference architecture
3 One possible flow
Application logs
Impression Cloud Pub/Sub Cloud Dataflow BigQuery Start stream Ingest and Processing and Durable storage Add to playlist propagate imperative and interactive Share analysis analysis ....
Questions you might ask - Which application sections are receiving the most impressions? - Who is the top artist by stream starts in the last minute? - What is the average session length for users in Seattle?
4 10+ years of tackling Big Data problems
2002 2004 2005 2006 2008 2010 2012 2014 2015
Google Papers
GFS Map Reduce BigTable Dremel PubSub Flume Java Millwheel Dataflow Tensorflow
5 10+ years of tackling Big Data problems
2002 2004 2005 2006 2008 2010 2012 2014 2015
Google Papers
GFS Map Reduce BigTable Dremel PubSub Flume Java Millwheel Dataflow Tensorflow
6 In: Proc. 36th VLDB, Sept 2010. Link. The Dremel paper
7 Dremel in 2016
Dremel is mission critical for Google In production for 10+ years, in every Google datacenter Internal usage every month:
● O(Exabytes) analyzed ● O(Quadrillions) of rows scanned ● 80% of Googlers use it
8 BigQuery ≃ Dremel + cloud
Our idea of highly available, managed analytics:
● no indexing, no resharding, no storage resizing ● just ...
RUN QUERY
9 BigQuery ≃ Dremel + cloud
Full-managed data warehouse – General Availability in 2012 – Same engineering team – Same code base Fast, petabyte-scale with SQL interface
Encrypted, durable and highly available
Near-unlimited resources, only pay for what you use
Streaming ingest for unbounded data sets
10 But what have we done lately?
11 code submits BigQuery: 5+ years of innovation
2010 2011 2012 2013 2014 2015 2016 1,200
Beta Release at Google I/O Dynamic Execution
Public launch Faster shuffle 900
Big JOIN support
Large query results Capacitor
300
100k qps streaming Dremel X
User-defined functions
0
12 Dremel architecture: 2006–2015
Mixer 0
● Partial Reduction
● Mixer 1 Mixer 1 Long-lived Serving Tree ● Diskless data flow
Leaf Leaf Leaf Leaf
Distributed Storage ● Columnar Storage
13 Dremel X architecture (2015–now)
Master
Shard Shard Shard Shard ● Dynamic Serving Tree
Shard Shard Shard Shard
Shard Shard Shard Shard
Distributed Storage ● Columnar Storage
14 Storage engine: ColumnIO (2006–2015)
SELECT play_count FROM songs WHERE name CONTAINS “Sun”;
Data Decompress Filter Emit
F$#h5 xc* {Hey Jude, 5375} CONTAINS(“Sun”)
rm7y5 c8! {My Michelle, 2188} CONTAINS(“Sun”)
rm7y5 8ec {My Michelle, 9363} CONTAINS(“Sun”)
F$#h5 7h! {Hey Jude, 9502} CONTAINS(“Sun”)
4t#@h a7c {Here Comes The Sun, 7383} CONTAINS(“Sun”) {7833}
rm7y5 c-% {My Michelle, 3912} CONTAINS(“Sun”)
15 Storage engine: Capacitor (2016–now)
SELECT play_count FROM songs WHERE name CONTAINS “Sun”; Data Emit Dictionary Filter Lookup 2 xc* 0 Hey Jude CONTAINS(“Sun”) F
1 c8! 1 My Michelle CONTAINS(“Sun”) F
1 8ec 2 Here Comes the Sun CONTAINS(“Sun”) T
2 7h!
0 a7c {7833} Performance improvement:
1 c-% ● 2x faster average over all queries ● 10–1000x faster for selective filters
16 Cloud Dataflow
17 One possible flow
Application logs
Impression Cloud Pub/Sub Cloud Dataflow BigQuery Start stream Ingest and Processing and Durable storage Add to playlist propagate imperative and interactive Share analysis analysis ....
Questions you might ask - Which application sections are receiving the most impressions? - Who is the top artist by stream starts in the last minute? - What is the average session length for users in Seattle?
18 Time-to-answers matters
Sequence a human Process point of sale Who is the best at Who/what is trending? genome transaction logs collecting Poké Balls?
19 Data boundedness
Bounded Unbounded
Time
Finite data set Infinite data set Fixed schema Potentially changing schema Complete Never complete Typically at rest Typically not at rest
20 Latency happens
Transmit Ingest Process
Network Throughput Poor engine delay or (read design unavailable latency) Starved Ingest delay Backlog resources (write latency) Hardware Internal failure backlog Ingest failure Confused heuristics
21 The evolution of Apache Beam
Colossus BigTable PubSub Dremel
Google Cloud Dataflow
Spanner Megastore Millwheel Flume Apache Beam
MapReduce
22 Apache Beam (incubating) is a unified programming model designed to provide efficient and portable data processing pipelines
1.Classic batch 2. Batch with fixed 3. Streaming windows
4. Streaming with 5. Streaming with 6. Sessions speculative + late data retractions
Formalizing Event-Time Skew
Watermarks describe event time progress in the processing time space: "No timestamp earlier than the watermark will be seen"
Often based on heuristics ● Too slow? Results are delayed :-( ● Too fast? Some data is late :-(
What Where When How
● What are you computing?
● Where in event time?
● When in processing time?
● How do refinements relate?
What Where When How What are you computing?
Per-element Aggregations Compositions (subgraphs)
What Where When How What: computing integer sums
What Where When How What: computing integer sums
What Where When How Where in event time?
Windowing divides data into event-time-based finite chunks.
Fixed Sliding Sessions 1 1 2 3 4 2 3 1 3 4
Key 1
Key 2
Key 3
4 5 2
Time Often required when doing aggregations over unbounded data.
What Where When How Where: Fixed 2-minute Windows
What Where When How When in processing time? • Triggers control when results are emitted.
• Triggers are often relative to the watermark.
What Where When How When: triggering at the watermark
What Where When How When: early and late firings
What Where When How How do refinements relate? Q: if there are multiple panes per window … what should be emitted? • discard – emit the delta over the pane • accumulate – emit running total • accumulate + retract* – retract last sum & emit new running total
A: drive by needs of the downstream consumer
*Accumulating & Retracting not yet implemented in Apache Beam.
What Where When How How: add newest, remove previous
What Where When How Customizing What When Where How
1.Classic batch 2. Batch with fixed windows
3. Streaming 4. Streaming with 5. Streaming with speculative + late data retractions
Worker The stragglerproblem Time Effects arecumulative perstage Work isunevenlydistributedacross tasks •Runtime effects •Runtime differences •Processing datasize •Underlying Beam readers enable dynamic adaptation
Beam readers provide simple progress signals, enable runners to take action based on execution-time characteristics.
APIs for how much work is pending: •Bounded: double getFractionConsumed() •Unbounded: long getBacklogBytes()
Work-stealing: •Bounded: Source splitAtFraction(double) int getParallelismRemaining() Dynamic work rebalancing
Done work Active work Predicted completion Tasks
Time
Now Predicted mean time Dynamic work rebalancing
Done work Active work Predicted completion Tasks
Time Time
Now Now Dynamic work rebalancing: a real example Beam pipeline on the Google Cloud Dataflow runner
Savings
2-stage pipeline, Same pipeline split “evenly” but uneven in practice with dynamic work rebalancing Dynamic work rebalancing + Autoscaling Beam pipeline on the Google Cloud Dataflow runner
Initially allocate ~80 workers based on input size Long-running tasks aborted Multiple rounds of without causing upsizing enabled stragglers by dynamic splitting
Scale up to 1000 workers * tasks stay well-balanced * without initial oversplitting Cloud Dataflow execution runner
Resource management
S Graph optimization Compute and Storage Intelligent watermarking O S Work scheduler Auto-healing U Unbounded I R Resource auto-scaler Monitoring N
C Dynamic work K Log collection E rebalancer
Bounded
44 The Dataflow Model & Cloud Dataflow
Dataflow Model & SDKs Google Cloud Dataflow
a unified model for no-ops, fully managed service batch and stream processing
The DataflowBeam Model & Cloud Dataflow
Apache Beam Google Cloud Dataflow
a unified model for a great place to run Beam batch and stream processing supporting multiple runtimes
What is in Apache Beam?
1. The Beam Model: What / Where / When / How
2. SDKs for writing Beam pipelines – starting with Java
3. Runners for existing distributed processing backends • Apache Flink (thanks to data Artisans) • Apache Spark (thanks to Cloudera) • Google Cloud Dataflow (fully managed service) • Local (in-process) runner for testing
Categorizing runner capabilities
http://beam.incubator.apache.org/capability-matrix/
Apache Beam roadmap
02/25/2016 1st commit to ASF repository
1H 2016 1H 2017 Refactoring Stable APIs (slight chaos)
2H 2016 Portability basics + 02/01/2016 Multiple runners + Enter Apache Community growth Incubator ⇒ towards a Top-Level Project
Growing the Beam community
Collaborate – a community-driven effort across many organizations and contributors
Grow – the Beam ecosystem and community with active, open involvement
Now Next
2nd Wave 1st Wave Virtualized Colocation Datacenters
User Managed Intelligent Services User Configured User Maintained Auto Everything Big Data Programming with Google
Focus on insights, not infrastructure
From batch to real-time Understanding
Apache Beam offers a common model across execution engines: please contribute! [email protected]