Large-scale data analysis at cloud scale johnwilkes@.com 2016-09

Based on slides from , Greg DeMichillie, Frances Perry, Tyler Akidau, Dan Halperin. Data & Analytics

Application Development

Infrastructure & Operations GCP Big Data reference architecture

3 One possible flow

Application logs

Impression Cloud Pub/Sub Cloud Dataflow BigQuery Start stream Ingest and Processing and Durable storage Add to playlist propagate imperative and interactive Share analysis analysis ....

Questions you might ask - Which application sections are receiving the most impressions? - Who is the top artist by stream starts in the last minute? - What is the average session length for users in Seattle?

4 10+ years of tackling Big Data problems

2002 2004 2005 2006 2008 2010 2012 2014 2015

Google Papers

GFS Map Reduce Dremel PubSub Flume Java Millwheel Dataflow Tensorflow

5 10+ years of tackling Big Data problems

2002 2004 2005 2006 2008 2010 2012 2014 2015

Google Papers

GFS Map Reduce BigTable Dremel PubSub Flume Java Millwheel Dataflow Tensorflow

6 In: Proc. 36th VLDB, Sept 2010. Link. The Dremel paper

7 Dremel in 2016

Dremel is mission critical for Google In production for 10+ years, in every Google datacenter Internal usage every month:

● O(Exabytes) analyzed ● O(Quadrillions) of rows scanned ● 80% of Googlers use it

8 BigQuery ≃ Dremel + cloud

Our idea of highly available, managed analytics:

● no indexing, no resharding, no storage resizing ● just ...

RUN QUERY

9 BigQuery ≃ Dremel + cloud

Full-managed data warehouse – General Availability in 2012 – Same engineering team – Same code base Fast, petabyte-scale with SQL interface

Encrypted, durable and highly available

Near-unlimited resources, only pay for what you use

Streaming ingest for unbounded data sets

10 But what have we done lately?

11 code submits BigQuery: 5+ years of innovation

2010 2011 2012 2013 2014 2015 2016 1,200

Beta Release at Google I/O Dynamic Execution

Public launch Faster shuffle 900

Big JOIN support

Large query results Capacitor

300

100k qps streaming Dremel X

User-defined functions

0

12 Dremel architecture: 2006–2015

Mixer 0

● Partial Reduction

● Mixer 1 Mixer 1 Long-lived Serving Tree ● Diskless data flow

Leaf Leaf Leaf Leaf

Distributed Storage ● Columnar Storage

13 Dremel X architecture (2015–now)

Master

Shard Shard Shard Shard ● Dynamic Serving Tree

Shard Shard Shard Shard

Shard Shard Shard Shard

Distributed Storage ● Columnar Storage

14 Storage engine: ColumnIO (2006–2015)

SELECT play_count FROM songs WHERE name CONTAINS “Sun”;

Data Decompress Filter Emit

F$#h5 xc* {Hey Jude, 5375} CONTAINS(“Sun”)

rm7y5 c8! {My Michelle, 2188} CONTAINS(“Sun”)

rm7y5 8ec {My Michelle, 9363} CONTAINS(“Sun”)

F$#h5 7h! {Hey Jude, 9502} CONTAINS(“Sun”)

4t#@h a7c {Here Comes The Sun, 7383} CONTAINS(“Sun”) {7833}

rm7y5 c-% {My Michelle, 3912} CONTAINS(“Sun”)

15 Storage engine: Capacitor (2016–now)

SELECT play_count FROM songs WHERE name CONTAINS “Sun”; Data Emit Dictionary Filter Lookup 2 xc* 0 Hey Jude CONTAINS(“Sun”) F

1 c8! 1 My Michelle CONTAINS(“Sun”) F

1 8ec 2 Here Comes the Sun CONTAINS(“Sun”) T

2 7h!

0 a7c {7833} Performance improvement:

1 c-% ● 2x faster average over all queries ● 10–1000x faster for selective filters

16 Cloud Dataflow

17 One possible flow

Application logs

Impression Cloud Pub/Sub Cloud Dataflow BigQuery Start stream Ingest and Processing and Durable storage Add to playlist propagate imperative and interactive Share analysis analysis ....

Questions you might ask - Which application sections are receiving the most impressions? - Who is the top artist by stream starts in the last minute? - What is the average session length for users in Seattle?

18 Time-to-answers matters

Sequence a human Process point of sale Who is the best at Who/what is trending? genome transaction logs collecting Poké Balls?

19 Data boundedness

Bounded Unbounded

Time

Finite data set Infinite data set Fixed schema Potentially changing schema Complete Never complete Typically at rest Typically not at rest

20 Latency happens

Transmit Ingest Process

Network Throughput Poor engine delay or (read design unavailable latency) Starved Ingest delay Backlog resources (write latency) Hardware Internal failure backlog Ingest failure Confused heuristics

21 The evolution of

Colossus BigTable PubSub Dremel

Google Cloud Dataflow

Spanner Megastore Millwheel Flume Apache Beam

MapReduce

22 Apache Beam (incubating) is a unified programming model designed to provide efficient and portable data processing pipelines

1.Classic batch 2. Batch with fixed 3. Streaming windows

4. Streaming with 5. Streaming with 6. Sessions speculative + late data retractions

Formalizing Event-Time Skew

Watermarks describe event time progress in the processing time space: "No timestamp earlier than the watermark will be seen"

Often based on heuristics ● Too slow? Results are delayed :-( ● Too fast? Some data is late :-(

What Where When How

● What are you computing?

● Where in event time?

● When in processing time?

● How do refinements relate?

What Where When How What are you computing?

Per-element Aggregations Compositions (subgraphs)

What Where When How What: computing integer sums

What Where When How What: computing integer sums

What Where When How Where in event time?

Windowing divides data into event-time-based finite chunks.

Fixed Sliding Sessions 1 1 2 3 4 2 3 1 3 4

Key 1

Key 2

Key 3

4 5 2

Time Often required when doing aggregations over unbounded data.

What Where When How Where: Fixed 2-minute Windows

What Where When How When in processing time? • Triggers control when results are emitted.

• Triggers are often relative to the watermark.

What Where When How When: triggering at the watermark

What Where When How When: early and late firings

What Where When How How do refinements relate? Q: if there are multiple panes per window … what should be emitted? • discard – emit the delta over the pane • accumulate – emit running total • accumulate + retract* – retract last sum & emit new running total

A: drive by needs of the downstream consumer

*Accumulating & Retracting not yet implemented in Apache Beam.

What Where When How How: add newest, remove previous

What Where When How Customizing What When Where How

1.Classic batch 2. Batch with fixed windows

3. Streaming 4. Streaming with 5. Streaming with speculative + late data retractions

Worker The stragglerproblem Time Effects arecumulative perstage Work isunevenlydistributedacross tasks •Runtime effects •Runtime differences •Processing datasize •Underlying Beam readers enable dynamic adaptation

Beam readers provide simple progress signals, enable runners to take action based on execution-time characteristics.

APIs for how much work is pending: •Bounded: double getFractionConsumed() •Unbounded: long getBacklogBytes()

Work-stealing: •Bounded: Source splitAtFraction(double) int getParallelismRemaining() Dynamic work rebalancing

Done work Active work Predicted completion Tasks

Time

Now Predicted mean time Dynamic work rebalancing

Done work Active work Predicted completion Tasks

Time Time

Now Now Dynamic work rebalancing: a real example Beam pipeline on the runner

Savings

2-stage pipeline, Same pipeline split “evenly” but uneven in practice with dynamic work rebalancing Dynamic work rebalancing + Autoscaling Beam pipeline on the Google Cloud Dataflow runner

Initially allocate ~80 workers based on input size Long-running tasks aborted Multiple rounds of without causing upsizing enabled stragglers by dynamic splitting

Scale up to 1000 workers * tasks stay well-balanced * without initial oversplitting Cloud Dataflow execution runner

Resource management

S Graph optimization Compute and Storage Intelligent watermarking O S Work scheduler Auto-healing U Unbounded I R Resource auto-scaler Monitoring N

C Dynamic work K Log collection E rebalancer

Bounded

44 The Dataflow Model & Cloud Dataflow

Dataflow Model & SDKs Google Cloud Dataflow

a unified model for no-ops, fully managed service batch and

The DataflowBeam Model & Cloud Dataflow

Apache Beam Google Cloud Dataflow

a unified model for a great place to run Beam batch and stream processing supporting multiple runtimes

What is in Apache Beam?

1. The Beam Model: What / Where / When / How

2. SDKs for writing Beam pipelines – starting with Java

3. Runners for existing distributed processing backends • (thanks to data Artisans) • (thanks to Cloudera) • Google Cloud Dataflow (fully managed service) • Local (in-process) runner for testing

Categorizing runner capabilities

http://beam.incubator.apache.org/capability-matrix/

Apache Beam roadmap

02/25/2016 1st commit to ASF repository

1H 2016 1H 2017 Refactoring Stable APIs (slight chaos)

2H 2016 Portability basics + 02/01/2016 Multiple runners + Enter Apache Community growth Incubator ⇒ towards a Top-Level Project

Growing the Beam community

Collaborate – a community-driven effort across many organizations and contributors

Grow – the Beam ecosystem and community with active, open involvement

Now Next

2nd Wave 1st Wave Virtualized Colocation Datacenters

User Managed Intelligent Services User Configured User Maintained Auto Everything Big Data Programming with Google

Focus on insights, not infrastructure

From batch to real-time Understanding

Apache Beam offers a common model across execution engines: please contribute! [email protected]