Large-Scale Data and Computation: Challenges and Opportunities

Large-Scale Data and Computation: Challenges and Opportunities Jeff Dean Google Joint work with many collaborators Saturday, January 19, 13 Google’s Computational Environment Today • Many datacenters around the world Saturday, January 19, 13 Google’s Computational Environment Today • Many datacenters around the world Saturday, January 19, 13 Zooming In... Saturday, January 19, 13 Zooming In... Saturday, January 19, 13 Lots of machines... Saturday, January 19, 13 Save a bit of power: turn out the lights... Saturday, January 19, 13 Cool... Saturday, January 19, 13 Decomposition into Services query Frontend Web Server Ad System Super root Spelling correction News Local Video Images Blogs Books Web Storage Scheduling Naming ... Saturday, January 19, 13 Replication • Data loss – replicate the data on multiple disks/machines (GFS/Colossus) • Slow machines – replicate the computation (MapReduce) • Too much load – replicate for better throughput (nearly all of our services) • Bad latency – utilize replicas to improve latency – improved worldwide placement of data and services Saturday, January 19, 13 Shared Environment Linux Saturday, January 19, 13 Shared Environment file system chunkserver Linux Saturday, January 19, 13 Shared Environment file system scheduling chunkserver system Linux Saturday, January 19, 13 Shared Environment various other system services file system scheduling chunkserver system Linux Saturday, January 19, 13 Shared Environment Bigtable tablet server various other system services file system scheduling chunkserver system Linux Saturday, January 19, 13 Shared Environment cpu intensive job Bigtable tablet server various other system services file system scheduling chunkserver system Linux Saturday, January 19, 13 Shared Environment cpu intensive job random Bigtable MapReduce #1 tablet server various other system services file system scheduling chunkserver system Linux Saturday, January 19, 13 Shared Environment random app #2 cpu intensive random job app random Bigtable MapReduce #1 tablet server various other system services file system scheduling chunkserver system Linux Saturday, January 19, 13 Shared Environment • Huge benefit: greatly increased utilization • ... but hard to predict effects increase variability –network congestion –background activities –bursts of foreground activity –not just your jobs, but everyone else’s jobs, too –not static: change happening constantly • Exacerbated by large fanout systems Saturday, January 19, 13 The Problem with Shared Environments Saturday, January 19, 13 The Problem with Shared Environments Saturday, January 19, 13 The Problem with Shared Environments • Server with 10 ms avg. but 1 sec 99%ile latency –touch 1 of these: 1% of requests take ≥1 sec –touch 100 of these: 63% of requests take ≥1 sec Saturday, January 19, 13 Tolerating Faults vs. Tolerating Variability • Tolerating faults: – rely on extra resources • RAIDed disks, ECC memory, dist. system components, etc. – make a reliable whole out of unreliable parts • Tolerating variability: – use these same extra resources – make a predictable whole out of unpredictable parts • Times scales are very different: – variability: 1000s of disruptions/sec, scale of milliseconds – faults: 10s of failures per day, scale of tens of seconds Saturday, January 19, 13 Latency Tolerating Techniques • Cross request adaptation –examine recent behavior –take action to improve latency of future requests –typically relate to balancing load across set of servers –time scale: 10s of seconds to minutes • Within request adaptation –cope with slow subsystems in context of higher level request –time scale: right now, while user is waiting • Many such techniques [The Tail at Scale, Dean & Barroso, to appear in CACM Feb. 2013] Saturday, January 19, 13 Tied Requests req 3 req 5 req 6 Server 1 Server 2 Client Similar to Michael Mitzenmacher’s work on “The Power of Two Choices”, except send to both, rather than just picking “best” one Saturday, January 19, 13 Tied Requests req 3 req 5 req 6 Server 1 Server 2 Clientreq 9 Similar to Michael Mitzenmacher’s work on “The Power of Two Choices”, except send to both, rather than just picking “best” one Saturday, January 19, 13 Tied Requests req 3 req 5 req 6 req 9 also: server 2 Server 1 Server 2 Clientreq 9 Similar to Michael Mitzenmacher’s work on “The Power of Two Choices”, except send to both, rather than just picking “best” one Each request identifies other server(s) to which request might be sent Saturday, January 19, 13 Tied Requests req 3 req 5 req 6 req 9 also: server 1 req 9 also: server 2 Server 1 Server 2 Client Similar to Michael Mitzenmacher’s work on “The Power of Two Choices”, except send to both, rather than just picking “best” one Each request identifies other server(s) to which request might be sent Saturday, January 19, 13 Tied Requests req 3 req 6 req 9 also: server 1 req 9 also: server 2 Server 1 Server 2 Client Similar to Michael Mitzenmacher’s work on “The Power of Two Choices”, except send to both, rather than just picking “best” one Each request identifies other server(s) to which request might be sent Saturday, January 19, 13 Tied Requests req 3 req 6 req 9 also: server 1 req 9 also: server 2 Server 1 Server 2 “Server 2: Starting req 9” Client Similar to Michael Mitzenmacher’s work on “The Power of Two Choices”, except send to both, rather than just picking “best” one Each request identifies other server(s) to which request might be sent Saturday, January 19, 13 Tied Requests req 3 req 6 req 9 also: server 1 req 9 also: server 2 Server 1 Server 2 “Server 2: Starting req 9” Client Similar to Michael Mitzenmacher’s work on “The Power of Two Choices”, except send to both, rather than just picking “best” one Each request identifies other server(s) to which request might be sent Saturday, January 19, 13 Tied Requests req 3 req 6 req 9 also: server 1 Server 1 Server 2 “Server 2: Starting req 9” Client Similar to Michael Mitzenmacher’s work on “The Power of Two Choices”, except send to both, rather than just picking “best” one Each request identifies other server(s) to which request might be sent Saturday, January 19, 13 Tied Requests req 3 req 6 req 9 also: server 1 Server 1 Server 2 reply Client Similar to Michael Mitzenmacher’s work on “The Power of Two Choices”, except send to both, rather than just picking “best” one Each request identifies other server(s) to which request might be sent Saturday, January 19, 13 Tied Requests req 3 req 6 req 9 also: server 1 Server 1 Server 2 Clientreply Similar to Michael Mitzenmacher’s work on “The Power of Two Choices”, except send to both, rather than just picking “best” one Each request identifies other server(s) to which request might be sent Saturday, January 19, 13 Tied Requests: Bad Case req 3 req 5 Server 1 Server 2 Client Saturday, January 19, 13 Tied Requests: Bad Case req 3 req 5 Server 1 Server 2 Clientreq 9 Saturday, January 19, 13 Tied Requests: Bad Case req 3 req 5 req 9 also: server 2 Server 1 Server 2 Clientreq 9 Saturday, January 19, 13 Tied Requests: Bad Case req 3 req 5 req 9 req 9 also: server 2 also: server 1 Server 1 Server 2 Client Saturday, January 19, 13 Tied Requests: Bad Case req 9 req 9 also: server 2 also: server 1 Server 1 Server 2 Client Saturday, January 19, 13 Tied Requests: Bad Case req 9 req 9 also: server 2 also: server 1 Server 1 Server 2 “Server 2: Starting req 9” “Server 1: Starting req 9” Client Saturday, January 19, 13 Tied Requests: Bad Case req 9 req 9 also: server 2 also: server 1 Server 1 Server 2 “Server 2: Starting req 9” “Server 1: Starting req 9” Client Saturday, January 19, 13 Tied Requests: Bad Case req 9 req 9 also: server 2 also: server 1 Server 1 Server 2 reply Client Saturday, January 19, 13 Tied Requests: Bad Case req 9 req 9 also: server 2 also: server 1 Server 1 Server 2 Clientreply Saturday, January 19, 13 Tied Requests: Bad Case req 9 req 9 also: server 2 also: server 1 Server 1 Server 2 Clientreply Likelihood of this bad case is reduced with lower latency networks Saturday, January 19, 13 Tied Requests: Performance Benefits • Read operations in distributed file system client – send tied request to first replica – wait 2 ms, and send tied request to second replica – servers cancel tied request on other replica when starting read • Measure higher-level monitoring ops that touch disk Saturday, January 19, 13 Tied Requests: Performance Benefits • Read operations in distributed file system client – send tied request to first replica – wait 2 ms, and send tied request to second replica – servers cancel tied request on other replica when starting read • Measure higher-level monitoring ops that touch disk Cluster state Policy 50%ile 90%ile 99%ile 99.9%ile Mostly idle No backups 19 ms 38 ms 67 ms 98 ms Backup after 2 ms 16 ms 28 ms 38 ms 51 ms Saturday, January 19, 13 Tied Requests: Performance Benefits • Read operations in distributed file system client – send tied request to first replica – wait 2 ms, and send tied request to second replica – servers cancel tied request on other replica when starting read • Measure higher-level monitoring ops that touch disk -43% Cluster state Policy 50%ile 90%ile 99%ile 99.9%ile Mostly idle No backups 19 ms 38 ms 67 ms 98 ms Backup after 2 ms 16 ms 28 ms 38 ms 51 ms Saturday, January 19, 13 Tied Requests: Performance Benefits • Read operations in distributed file system client – send tied request to first replica – wait 2 ms, and send tied request to second replica – servers cancel tied request on other replica when starting read • Measure higher-level monitoring ops that

Large-Scale Data and Computation: Challenges and Opportunities

The Deep Learning Revolution and Its Implications for Computer Architecture and Chip Design

The Pagerank Algorithm and Application on Searching of Academic Papers

Summer 2017 Issue 11.1

The Pile: an 800GB Dataset of Diverse Text for Language Modeling Leo Gao Stella Biderman Sid Black Laurence Golding

Device Placement with Reinforcement Learning

Large-Scale Deep Learning with Tensorflow

Towards Teachable Conversational Agents

Platform Competition in an Age of Machine Learning

UW CSE's Annual Donor- Scholar Luncheon

Why More Tech Companies Should Put AI Visionaries in the Executive Suite

Machine Intelligence at Google Scale: Vision/Speech API, Tensorflow and Cloud ML Kaz Sato

Where We're At