Amdahl's Law for Tail Latency

DOI:10.1145/3232559 Queueing theoretic models can guide design trade-offs in systems targeting tail latency, not just average performance. BY CHRISTINA DELIMITROU AND CHRISTOS KOZYRAKIS Amdahl’s Law for Tail Latency TRANSLATING THE IMPACT of Amdahl’s Law on tail latency provides new insights on what future generations of data-center hardware and software architectures should look like. The emphasis on latency, instead of just throughput, puts increased pressure on system designs that improve both parallelism and single-thread performance. Computer architecture is at an in- sions interact with the desire to mini- flection point. The emergence of ware- mize energy consumption at the chip house-scale computers has brought or data-center level.2 large online services to the forefront While the precise answers will come in the form of Web search, social net- from detailed experiments with both works, software-as-a-service, and more. simulated and real systems, there is These applications service millions of great value in having an analytical user queries daily, run distributed over framework that identifies the major thousands of machines, and are con- trade-offs and challenges in latency- cerned with tail latency (such as the sensitive cloud systems. We aim here 99th percentile) of user requests in ad- to complement the previous analyses dition to high throughput.6 These char- on Amdahl’s Law for parallel and mul- acteristics represent a significant de- ticore systems1,11 by designing a model parture from previous systems, where that draws from basic queueing theory the performance metric of interest was only throughput, or, at most, aver- key insights age latency. Optimizing for tail latency is already changing the way we build ˽ Optimizing for tail latency makes Amdahl’s Law more consequential operating systems, cluster managers, than when optimizing for average and data services.7,8 This article inves- performance. tigates how the focus on tail latency af- ˽ Queueing theory can provide accurate fects hardware designs, including what first-order insights into how hardware types of processor cores to build, how for future interactive services should much chip area to invest in caching be designed. structures, how much resource inter- ˽ As service responsiveness and ference between services matters, how predictability become more critical, finding a balance between compute and to schedule different user requests in memory resources likewise becomes multicore chips, and how these deci- more critical. AUGUST 2018 | VOL. 61 | NO. 8 | COMMUNICATIONS OF THE ACM 65 contributed articles Analytical Framework Amdahl’s Law describes the speedup of a program when Figure 1a outlines the 99th percentile of request latency as a a fraction f of the computation is accelerated by a factor S. function of the service rate µ. As µ increases, tail latency drops Speedup is then defined as both at low and high load. M/M/k model. We now extend the M/M/1 model to a more realistic system with k equivalent servers in order to model a multicore machine. Tasks are now added to a single, shared queue, where servers draw them from for processing. As with In a multi-core machine, Amdahl’s Law captures the the M/M/1 model, tasks arrive under a Poisson process with benefit from multiple cores in average performance. While arrival rate λ and each server processes tasks with service rate this interpretation is still relevant, it is, by itself, insufficient µ. Closed-form solutions for the mean response time and for describing tail latency requirements. To bridge the gap response-time percentiles exist but are more complicated we build upon ideas from queueing theory, which provides than in the M/M/1 model. Specifically, system load isρ = λ/ a framework to reason about task-arrival rates, service (kµ). The probability that a new task must be enqueued is times, and end-to-end response times. Simple models (such given by Erlang’s C formula as M/M/1 and M/M/k) are particularly attractive for first- order performance calculations because they can concisely describe performance in closed-form expressions. M/M/1 model. We start with one of the simplest queueing models: the M/M/1 queue, modeling a system in which a single server processes incoming tasks. Tasks arrive under a and the mean number of tasks in the system Poisson process with rate λ. The service times also follow an exponential distribution, with rate parameter µ and mean service time Ts = 1/µ (µ=per f (r) in the main text of the article. A larger µ means a more powerful server and results in lower The average response time is latency. Tasks are processed in a simple first-in-first-out order. This simple queueing system is stable when µ > λ. In contrast, when µ > λ, queued tasks keep increasing, leading to instability. The load of the system is defined asρ = λ/µ. Given Finally, the p-th percentile of queueing time is these definitions, the mean number of tasks in the system is where N is a random variable for the number of tasks. Figure 1b outlines how the 99th percentile of queueing Likewise, the mean of task response time (using random time correlates to the service rate µ for one and four servers. variable R) is Higher service rates correspond to less time spent by requests in the queue. We use the M/M/k model for analysis of system trade-offs unless otherwise specified. In the article’s section on validation, we verify that this model closely reflects real system behavior. For applications with non-Poisson arrival- and the ρ-th percentile of response time is and service-time distributions, more general queueing models may be needed (such as the G/G/k model).10,24 For more complex applications (such as multi-tier services), system architects would need a more sophisticated analytical model (such as a queueing network). Figure 1. Building system insights from queueing theory: (a) 99th percentile response time in an M/M/1 model; and (b) 99th percentile queueing time in an M/M/4 model as a function of µ. 10% load 50% load 99% load 105 100 –1 104 10 = 4) k 10–2 103 10–3 2 Percentile 10 th –4 99 10 Percentile Latency Waiting Time ( th 1 10 10–5 99 100 10–6 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Service Rate mµ Service Rate mµ (a) (b) 66 COMMUNICATIONS OF THE ACM | AUGUST 2018 | VOL. 61 | NO. 8 contributed articles (see Figure 1 in the sidebar “Analytical creating trade-offs between opting for Figure 4a shows how throughput in Framework”) and can provide first-or- few brawny or many wimpy cores. By queries per second (QPS) changes for der insights on how design decisions default, we follow Shekhar Borkar3 and different latency QoS targets, under the interact with tail latency. As was the use per f (r) = sqrt(r) but have also inves- M/M/N queueing model described in case with the previous analyses based tigated how higher roots affect the cor- the sidebar. Throughput of 100QPS for on Amdahl’s Law, our model has sig- responding insights. QoS=10Ts means the system achieved nificant implications for processor de- 100QPS for which the 99th latency per- signs for cloud servers. Brawny Versus Wimpy Cores centile is 10Ts. The x-axis captures the While analytical models help draw We first examine a system where all size of selected cores, moving from first-order insights, they run the risk cores are homogeneous and have many small cores on the left side to a of not accurately reflecting the com- identical cost. An important question single core of 100BCEs on the right plex operation of a real system. In Fig- the designer must answer is: Given a side. We examine all core sizes from ure 2, we show a brief validation study constrained aggregate power or area 1BCE up to 100BCEs in increments of of the queueing model, as discussed budget, should architects build a few a single resource unit. In configura- in the sidebar, with {1, 4, 8, 16} com- large cores or many small cores? The tions with multiple cores, throughput pute cores against a real instantiation answer has been heavily debated in is aggregated across all cores. The dis- of memcached, a popular in-memory, recent years in both academia and continuities in the graph are an artifact key-value store, with the same number industry,4,12,14,17,19,22 as it relates to the of the limited resource budget and ho- of cores. We set the mean interarrival introduction of new designs (such as mogeneous design; for example, for rate and service time of the queueing the ARM server chips and throughput U = 51, an architect can build a single model based on the measured times processors like Xeon Phi). 51BCE core, while 49 resource units re- with memcached. In both cases, when Assuming the total budget is R = main unused. Throughput for 10Ts for providing memcached with exponen- 100BCEs, an architect can build 100 cores greater than 7BCE overlaps with tially distributed input load, the mem- basic cores of 1BCE each, 25 cores of 100Ts, as does throughput for 5Ts for cached request latency is close to the 4BCEs each, one large core of 100BCEs, cores of more than 12BCEs. one estimated by the queueing model or in general R/U cores of U units each, Finding 1. Very strict QoS targets put across load levels. as shown in Figure 3.

Amdahl's Law for Tail Latency

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support