Caladan: Mitigating Interference at Microsecond Timescales
Total Page:16
File Type:pdf, Size:1020Kb
Caladan: Mitigating Interference at Microsecond Timescales Joshua Fried and Zhenyuan Ruan, MIT CSAIL; Amy Ousterhout, UC Berkeley; Adam Belay, MIT CSAIL https://www.usenix.org/conference/osdi20/presentation/fried This paper is included in the Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation November 4–6, 2020 978-1-939133-19-9 Open access to the Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation is sponsored by USENIX Caladan: Mitigating Interference at Microsecond Timescales Joshua Fried, Zhenyuan Ruan, Amy Ousterhout†, Adam Belay MIT CSAIL, †UC Berkeley Abstract example, Intel’s Cache Allocation Technology (CAT) uses way-based cache partitioning to reserve portions of the last The conventional wisdom is that CPU resources such as cores, level cache (LLC) for specific cores [21]. Many systems use caches, and memory bandwidth must be partitioned to achieve these partitioning mechanisms to improve performance isola- performance isolation between tasks. Both the widespread tion [8,12,28,38,62,73]. They either statically assign enough availability of cache partitioning in modern CPUs and the rec- resources for peak load, leaving significant CPU utilization ommended practice of pinning latency-sensitive applications on the table, or else make dynamic adjustments over hundreds to dedicated cores attest to this belief. of milliseconds to seconds. Because each adjustment is incre- In this paper, we show that resource partitioning is nei- mental, converging to the right configuration after a change ther necessary nor sufficient. Many applications experience in resource usage can take dozens of seconds [8, 12, 38]. bursty request patterns or phased behavior, drastically chang- ing the amount and type of resources they need. Unfortunately, Unfortunately, real-world workloads experience changes partitioning-based systems fail to react quickly enough to keep in resource usage over much shorter timescales. For example, up with these changes, resulting in extreme spikes in latency network traffic was observed to be very bursty in Google’s and lost opportunities to increase CPU utilization. datacenters, sometimes consuming more than a dozen cores over short time periods [42], and a study of Microsoft’s Bing Caladan is a new CPU scheduler that can achieve signifi- reports highly bursty thread wakeups on the order of microsec- cantly better quality of service (tail latency, throughput, etc.) onds [27]. Phased resource usage is also common. For exam- through a collection of control signals and policies that rely ple, we found that tasks that rely on garbage collection (GC) on fast core allocation instead of resource partitioning. Cal- periodically consume all available memory bandwidth (§2). adan consists of a centralized scheduler core that actively Detecting and reacting to such sudden changes in resource manages resource contention in the memory hierarchy and usage is not possible with existing systems. between hyperthreads, and a kernel module that bypasses the standard Linux Kernel scheduler to support microsecond- Our goal is to maintain both high CPU utilization and strict scale monitoring and placement of tasks. When colocating performance isolation (for throughput and tail latency) under memcached with a best-effort, garbage-collected workload, realistic conditions in which resource usage, and therefore Caladan outperforms Parties, a state-of-the-art resource parti- interference, changes frequently. A key requirement is faster tioning system, by 11,000×, reducing tail latency from 580 reaction times, as even microsecond delays can impact la- ms to 52 µs during shifts in resource usage while maintaining tency after an abrupt increase in interference (§2). There are high CPU utilization. two challenges toward achieving microsecond reaction times. First, there are many types of interference in a shared CPU 1 Introduction (hyperthreading, memory bandwidth, LLC, etc.), and obtain- ing the right control signals that can accurately detect each of Interactive, data-intensive web services like web search, social them over microsecond timescales is difficult. Second, exist- networking, and online retail commonly distribute requests ing systems face too much software overhead to either gather across thousands of servers. Minimizing tail latency is critical control signals or adjust resource allocations quickly. for these services because end-to-end response times are de- To overcome these challenges, we present an interference- termined by the slowest individual response [4,14]. Efforts to aware CPU scheduler, called Caladan. Caladan consists of reduce tail latency, however, must be carefully balanced with a centralized, dedicated scheduler core that collects control the need to maximize datacenter efficiency; large-scale data- signals and makes resource allocation decisions, and a Linux center operators often pack several tasks together on the same Kernel module, called KSCHED, that efficiently adjusts re- machine to improve CPU utilization in the presence of vari- source allocations. Our scheduler core distinguishes between able load [22,57,66,71]. Under these conditions, tasks must high-priority, latency-critical (LC) tasks and low-priority, best- compete over shared resources such as cores, memory band- effort (BE) tasks. To avoid the reaction time limitations im- width, caches, and execution units. When shared resource con- posed by hardware partitioning (§3), Caladan relies exclu- tention is high, latency increases significantly; this slowdown sively on core allocation to manage interference. of tasks due to resource contention is called interference. Caladan uses a carefully selected set of control signals The need to manage interference has led to the development and corresponding actions to quickly and accurately detect of several hardware mechanisms that partition resources. For and respond to interference over microsecond timescales. We USENIX Association 14th USENIX Symposium on Operating Systems Design and Implementation 281 observe that interference has two interrelated effects: first, GC Init Mark Phase Sweep Phase interference slows down the execution speed of cores (more 100 cache misses, higher memory latency, etc.), impacting the ser- vice times of requests; second, as cores slow down, compute 50 capacity drops; when it falls below offered load, queueing delays increase dramatically. Mem. B/W (%) 0 ) 5 s 10 μ Caladan’s scheduler targets these effects. It collects fine- ( memcached . 4 t 10 grained measurements of memory bandwidth usage and re- a L 103 quest processing times, using these to detect memory band- % 9 2 . 10 9 width and hyperthreading interference, respectively. It then 9 restricts cores from the antagonizing BE task(s), eliminating 0 1 2 3 4 5 6 Time (s) most of the impact on service times. For LLC interference, Figure 1: Periodic GC in a background task (shaded regions) in- Caladan cannot eliminate service time overheads directly, but creases memory bandwidth usage (top), causing severe latency it can still prevent a decrease in compute capacity by allowing spikes for a colocated memcached instance (bottom). Note the log- LC tasks to steal extra cores from BE tasks. scaled y-axis in the bottom graph. The KSCHED kernel module accelerates scheduling opera- tions such as waking tasks and collecting interference metrics. in resource usage can abruptly increase interference. This It does so by amortizing the cost of sending interrupts, of- degrades request service times and causes request queues to floading scheduling work from the scheduler core to the tasks’ grow when the rate of arriving requests exceeds the rate at cores, and providing a non-blocking API that allows the sched- which a task can process them. uler core to handle many inflight operations at once. These To better understand the challenges associated with time- techniques eliminate scheduling bottlenecks, allowing Cal- varying interference, we consider what happens when we adan to react quickly while scaling to many cores and tasks, colocate an LC task, memcached [43], with a BE workload even under heavy interference. that exhibits phased behavior due to garbage collection. In To the best of our knowledge, Caladan is the first system this example, we use the Boehm GC (see §7), which employs that can maintain both strict performance isolation and high the mark-sweep algorithm to reclaim dead heap objects [10]. CPU utilization under frequently changing interference and We have observed similar problems with more sophisticated, load. To achieve these benefits, Caladan imposes two new re- incremental GCs, such as the Go Language Runtime [60]. quirements on applications: the adoption of a custom runtime In this experiment, we offer a fixed load to memcached and system for scheduling and the need for LC tasks to expose statically partition cores between the two tasks. memcached their internal concurrency (§8). In exchange, Caladan is able is given enough cores to keep its 99:9th percentile tail latency to converge to the right resource configuration 500,000× below 50 µs when run in isolation. As shown in Figure1, this faster than the typical speed reported for Parties, a state-of- allocation is sufficient to protect tail latency when the GC is the-art resource partitioning system [12]. We show that this not running but it fails when the GC starts. The GC pauses speedup yields an 11,000× reduction in tail latency when normal execution of the BE task for 100–200 ms and scans colocating memcached