Attack of the Killer Microseconds
Total Page:16
File Type:pdf, Size:1020Kb
contributed articles DOI:10.1145/3015146 block a thread’s execution, with the Microsecond-scale I/O means tension program appearing to resume after the load completes. A host of complex mi- between performance and productivity croarchitectural techniques make high that will need new latency-mitigating ideas, performance possible while supporting including in hardware. this intuitive programming model. Tech- niques include prefetching, out-of-order execution, and branch prediction. Since BY LUIZ BARROSO, MIKE MARTY, DAVID PATTERSON, AND nanosecond-scale devices are so fast, PARTHASARATHY RANGANATHAN low-level interactions are performed pri- marily by hardware. At the other end of the latency-mit- igating spectrum, computer scientists have worked on a number of tech- Attack of niques—typically software based—to deal with the millisecond time scale. Operating system context switching is a notable example. For instance, when the Killer a read() system call to a disk is made, the operating system kicks off the low- level I/O operation but also performs a software context switch to a different Microseconds thread to make use of the processor during the disk operation. The original thread resumes execution sometime af- ter the I/O completes. The long overhead of making a disk access (milliseconds) easily outweighs the cost of two context switches (microseconds). Millisecond- THE COMPUTER SYSTEMS we use today make it easy scale devices are slow enough that the cost of these software-based mecha- for programmers to mitigate event latencies in the nisms can be amortized (see Table 1). nanosecond and millisecond time scales (such as These synchronous models for in- teracting with nanosecond- and milli- DRAM accesses at tens or hundreds of nanoseconds second-scale devices are easier than the and disk I/Os at a few milliseconds) but significantly alternative of asynchronous models. In lack support for microsecond (µs)-scale events. This an asynchronous programming model, the program sends a request to a device oversight is quickly becoming a serious problem for and continue processing other work programming warehouse-scale computers, where efficient handling of microsecond-scale events is key insights becoming paramount for a new breed of low-latency ˽ A new breed of low-latency I/O devices, ranging from faster datacenter networking I/O devices ranging from datacenter networking to to emerging non-volatile memories and accelerators, motivates greater interest in emerging memories (see the first sidebar “Is the microsecond-scale latencies. Microsecond Getting Enough Respect?”). ˽ Existing system optimizations targeting nanosecond- and millisecond-scale Processor designers have developed multiple events are inadequate for events in the techniques to facilitate a deep memory hierarchy microsecond range. ˽ New techniques are needed to enable that works at the nanosecond scale by providing simple programs to achieve high a simple synchronous programming interface to performance when microsecond-scale latencies are involved, including new the memory system. A load operation will logically microarchitecture support. PETER CROWTHER ASSOCIATES BY ILLUSTRATION 48 COMMUNICATIONS OF THE ACM | APRIL 2017 | VOL. 60 | NO. 4 APRIL 2017 | VOL. 60 | NO. 4 | COMMUNICATIONS OF THE ACM 49 contributed articles until the request has finished. To de- the codebase significantly improves work in neither the millisecond nor tect when the request has finished, the software-development productivity. As the nanosecond time scales. Consider program must either periodically poll one illustrative example of the overall datacenter networking. The trans- the status of the request or use an in- benefits of a synchronous programming mission time for a full-size packet at terrupt mechanism. Our experience model in Google, a rewrite of the Colos- 40Gbps is less than one microsecond at Google leads us to strongly prefer sus6 client library to use a synchronous (300ns). At the speed of light, the time a synchronous programming model. I/O model with lightweight threads to traverse the length of a typical data- Synchronous code is a lot simpler, gained a significant improvement in per- center (say, 200 meters to 300 meters) is hence easier to write, tune, and debug formance while making the code more approximately one microsecond. Fast (see the second sidebar “Synchronous/ compact and easier to understand. More datacenter networks are thus likely to Asynchronous Programming Models”). important, however, was that shifting have latencies of the order of microsec- The benefits of synchronous pro- the burden of managing asynchronous onds. Likewise, raw flash device latency gramming models are further ampli- events away from the programmer to the is on the order of tens of microseconds. fied at scale, when a typical end-to-end operating system or the thread library The latencies of emerging new non-vol- application can span multiple small makes the code significantly simpler. atile memory technologies (such as the closely interacting systems, often written This transfer is very important in ware- Intel-Micron Xpoint 3D memory9 and across several languages—C, C++, Java, house-scale environments where the co- Moneta3) are expected to be at the lower JavaScript, Go, and the like—where the debase is touched by thousands of devel- end of the microsecond regime as well. languages all have their own differing opers, with significant software releases In-memory systems (such as the Stan- and ever-changing idioms for asynchro- multiple times per week. ford RAMCloud14 project) have also es- nous programming.2,11 Having simple Recent trends point to a new breed timated latencies on the order of mi- and consistent APIs and idioms across of low-latency I/O devices that will croseconds. Fine-grain GPU offloads (and other accelerators) have similar Table 1. Events and their latencies showing emergence of a new breed of microsecond microsecond-scale latencies. events. Not only are microsecond-scale hard- ware devices becoming available, we also nanosecond events microsecond events millisecond events see a growing need to exploit lower laten- register file: 1ns–5ns datacenter networking: O(1µs) disk: O(10ms) cy storage and communication. One key cache accesses: 4ns–30ns new NVM memories: O(1µs) low-end flash: O(1ms) reason is the demise of Dennard scaling memory access: 100ns high-end flash: O(10µs) wide-area networking: O(10ms) and the slowing of Moore’s Law in 2003 GPU/accelerator: O(10µs) that ended the rapid improvement in microprocessor performance; today’s processors are approximately 20 times Figure 1. Cumulative software overheads, all in the range of microseconds can degrade slower than if they had continued to dou- performance a few orders of magnitude. ble in performance every 18 months.8 In 100 response, to enhance online services, cloud companies have increased the 75 number of computers they use in a cus- tomer query. For instance, single-user search query already turns into thou- 50 Latency (us) sands of remote procedure calls (RPCs), a number that will increase in the future. 25 Techniques optimized for nanosec- ond or millisecond time scales do not 0 RDMA Two-sided Thread Interrupts RPC TCP scale well for this microsecond regime. dispatch Superscalar out-of-order execution, Cumulative Overheads branch prediction, prefetching, simul- taneous multithreading, and other techniques for nanosecond time scales Table 2. Service times measuring number of instructions between I/O events for a do not scale well to the microsecond production-quality-tuned web search workload. The flash and DRAM data are measured for real production use; the bolded data for new memory/storage tiers is based on regime; system designers do not have detailed trace analysis. enough instruction-level parallelism or hardware-managed thread contexts to hide the longer latencies. Likewise, Task: data-intensive workload Service time (task length between I/Os) software techniques to tolerate milli- Flash 225K instructions = O(100 s) µ second-scale latencies (such as soft- Fast flash ~20K instructions = O(10µs) ware-directed context switching) scale New NVM memory ~2K instructions = O(1µs) poorly down to microseconds; the over- DRAM 500 instructions = O(100ns–1µs) heads in these techniques often equal or exceed the latency of the I/O device 50 COMMUNICATIONS OF THE ACM | APRIL 2017 | VOL. 60 | NO. 4 contributed articles Is the Microsecond Getting Enough Respect? A simple comparison, as in the figure here, of the occurrence of the terms “millisecond,” “microsecond,” and “nanosecond” in Google’s n-gram viewer (a tool that charts frequencies of words in a large corpus of books printed from 1800 to 2012, https://books.google.com/ngrams) points to the lack of adequate attention to the microsecond-level time scale. Microprocessors moved out of the microsecond scale toward nanoseconds, while networking and storage latencies have remained in the milliseconds. With the rise of a new breed of I/O devices in the datacenter, it is time for system designers to refocus on how to achieve high performance and ease of programming at the microsecond-scale. n-gram viewer of ms, µs, and ns. 0.0000500% 0.0000450% 0.0000400% millisecond 0.0000350% (All) 0.0000300% nanosecond (All) 0.0000250% 0.0000200% 0.0000150% microsecond (All) 0.0000100% 0.0000050% 0.0000000% 1920 1930 1940 1950 1960 1970 1980 1990 2000 itself. As we will see, it is quite easy to just remote hardware) adds several more and core hops) all add overheads, again take fast