<<

contributed articles

DOI:10.1145/3015146 block a thread’s execution, with the -scale I/O means tension program appearing to resume after the load completes. A host of complex mi- between performance and productivity croarchitectural techniques make high that will need new -mitigating ideas, performance possible while supporting including in hardware. this intuitive programming model. Tech- niques include prefetching, out-of-order execution, and branch prediction. Since BY LUIZ BARROSO, MIKE MARTY, DAVID PATTERSON, AND -scale devices are so fast, PARTHASARATHY RANGANATHAN low-level interactions are performed pri- marily by hardware. At the other end of the latency-mit- igating spectrum, computer scientists have worked on a number of tech- Attack of niques—typically software based—to deal with the scale. Operating system context switching is a notable example. For instance, when the Killer a read() system call to a disk is made, the operating system kicks off the low- level I/O operation but also performs a software context switch to a different thread to make use of the processor during the disk operation. The original thread resumes execution sometime af- ter the I/O completes. The long overhead of making a disk access () easily outweighs the cost of two context switches (microseconds). Millisecond- THE COMPUTER SYSTEMS we use today make it easy scale devices are slow enough that the cost of these software-based mecha- for programmers to mitigate event latencies in the nisms can be amortized (see Table 1). nanosecond and millisecond time scales (such as These synchronous models for in- teracting with nanosecond- and milli- DRAM accesses at tens or hundreds of -scale devices are easier than the and disk I/Os at a few milliseconds) but significantly alternative of asynchronous models. In lack support for microsecond (µs)-scale events. This an asynchronous programming model, the program sends a request to a device oversight is quickly becoming a serious problem for and continue processing other work programming warehouse-scale computers, where efficient handling of microsecond-scale events is key insights becoming paramount for a new breed of low-latency ˽˽ A new breed of low-latency I/O devices, ranging from faster datacenter networking I/O devices ranging from datacenter networking to to emerging non-volatile memories and accelerators, motivates greater interest in emerging memories (see the first sidebar “Is the microsecond-scale latencies.

Microsecond Getting Enough Respect?”). ˽˽ Existing system optimizations targeting nanosecond- and millisecond-scale Processor designers have developed multiple events are inadequate for events in the techniques to facilitate a deep memory hierarchy microsecond range. ˽˽ New techniques are needed to enable that works at the nanosecond scale by providing simple programs to achieve high a simple synchronous programming interface to performance when microsecond-scale latencies are involved, including new

the memory system. A load operation will logically microarchitecture support. PETER CROWTHER ASSOCIATES BY ILLUSTRATION

48 COMMUNICATIONS OF THE ACM | APRIL 2017 | VOL. 60 | NO. 4 APRIL 2017 | VOL. 60 | NO. 4 | COMMUNICATIONS OF THE ACM 49 contributed articles

until the request has finished. To de- the codebase significantly improves work in neither the millisecond nor tect when the request has finished, the software-development productivity. As the nanosecond time scales. Consider program must either periodically poll one illustrative example of the overall datacenter networking. The trans- the status of the request or use an in- benefits of a synchronous programming mission time for a full-size packet at terrupt mechanism. Our experience model in Google, a rewrite of the Colos- 40Gbps is less than one microsecond at Google leads us to strongly prefer sus6 client library to use a synchronous (300ns). At the speed of , the time a synchronous programming model. I/O model with lightweight threads to traverse the length of a typical data- Synchronous code is a lot simpler, gained a significant improvement in per- center (say, 200 meters to 300 meters) is hence easier to write, tune, and debug formance while making the code more approximately one microsecond. Fast (see the second sidebar “Synchronous/ compact and easier to understand. More datacenter networks are thus likely to Asynchronous Programming Models”). important, however, was that shifting have latencies of the order of microsec- The benefits of synchronous pro- the burden of managing asynchronous onds. Likewise, raw device latency gramming models are further ampli- events away from the programmer to the is on the order of tens of microseconds. fied at scale, when a typical end-to-end operating system or the thread library The latencies of emerging new non-vol- application can span multiple small makes the code significantly simpler. atile memory technologies (such as the closely interacting systems, often written This transfer is very important in ware- Intel-Micron Xpoint 3D memory9 and across several languages—C, C++, Java, house-scale environments where the co- Moneta3) are expected to be at the lower JavaScript, Go, and the like—where the debase is touched by thousands of devel- end of the microsecond regime as well. languages all have their own differing opers, with significant software releases In-memory systems (such as the Stan- and ever-changing idioms for asynchro- multiple per week. ford RAMCloud14 project) have also es- nous programming.2,11 Having simple Recent trends point to a new breed timated latencies on the order of mi- and consistent APIs and idioms across of low-latency I/O devices that will croseconds. Fine-grain GPU offloads (and other accelerators) have similar Table 1. Events and their latencies showing emergence of a new breed of microsecond microsecond-scale latencies. events. Not only are microsecond-scale hard- ware devices becoming available, we also nanosecond events microsecond events millisecond events see a growing need to exploit lower laten- register file: 1ns–5ns datacenter networking: O(1µs) disk: O(10ms) cy storage and communication. One key cache accesses: 4ns–30ns new NVM memories: O(1µs) low-end flash: O(1ms) reason is the demise of Dennard scaling memory access: 100ns high-end flash: O(10µs) wide-area networking: O(10ms) and the slowing of Moore’s Law in 2003 GPU/accelerator: O(10µs) that ended the rapid improvement in microprocessor performance; today’s processors are approximately 20 times Figure 1. Cumulative software overheads, all in the range of microseconds can degrade slower than if they had continued to dou- performance a few orders of magnitude. ble in performance every 18 months.8 In 100 response, to enhance online services, cloud companies have increased the

75 number of computers they use in a cus- tomer query. For instance, single-user search query already turns into thou- 50

Latency (us) sands of remote procedure calls (RPCs), a number that will increase in the future. 25 Techniques optimized for nanosec- ond or millisecond time scales do not 0 RDMA Two-sided Thread Interrupts RPC TCP scale well for this microsecond regime. dispatch Superscalar out-of-order execution,

Cumulative Overheads branch prediction, prefetching, simul- taneous multithreading, and other techniques for nanosecond time scales Table 2. Service times measuring number of instructions between I/O events for a do not scale well to the microsecond production-quality-tuned web search workload. The flash and DRAM data are measured for real production use; the bolded data for new memory/storage tiers is based on regime; system designers do not have detailed trace analysis. enough instruction-level parallelism or hardware-managed thread contexts to hide the longer latencies. Likewise, Task: data-intensive workload Service time (task length between I/Os) software techniques to tolerate milli- Flash 225K instructions = O(100 s) µ second-scale latencies (such as soft- Fast flash ~20K instructions = O(10µs) ware-directed context switching) scale New NVM memory ~2K instructions = O(1µs) poorly down to microseconds; the over- DRAM 500 instructions = O(100ns–1µs) heads in these techniques often equal or exceed the latency of the I/O device

50 COMMUNICATIONS OF THE ACM | APRIL 2017 | VOL. 60 | NO. 4 contributed articles

Is the Microsecond Getting Enough Respect? A simple comparison, as in the figure here, of the occurrence of the terms “millisecond,” “microsecond,” and “nanosecond” in Google’s n-gram viewer (a tool that charts of words in a large corpus of books printed from 1800 to 2012, https://books.google.com/ngrams) points to the lack of adequate attention to the microsecond-level time scale. Microprocessors moved out of the microsecond scale toward nanoseconds, while networking and storage latencies have remained in the milliseconds. With the rise of a new breed of I/O devices in the datacenter, it is time for system designers to refocus on how to achieve high performance and ease of programming at the microsecond-scale.

n-gram viewer of ms, µs, and ns.

0.0000500%

0.0000450%

0.0000400% millisecond 0.0000350% (All)

0.0000300% nanosecond (All) 0.0000250%

0.0000200%

0.0000150% microsecond (All) 0.0000100%

0.0000050%

0.0000000%

1920 1930 1940 1950 1960 1970 1980 1990 2000

itself. As we will see, it is quite easy to just remote hardware) adds several more and core hops) all add overheads, again take fast hardware and throw away its microseconds. Dispatching overhead in the microsecond range. We have also performance with software designed from a network thread to an operation measured standard Google debugging for millisecond-scale devices. thread (on a different processor) further features degrading latency by up to increases latency due to processor-wake- tens of microseconds. Finally, queue- How to Waste a Fast up and kernel-scheduler activity. Using ing overheads—in the host, applica- Datacenter Network interrupt-based notification rather than tion, and network fabric—can all incur To better understand how optimiza- spin polling adds many more microsec- additional latencies, often on the order tions can target the microsecond re- onds. Adding a feature-filled RPC stack of tens to hundreds of microseconds. gime, consider a high-performance incurs significant software overhead in Some of these sources of overhead have network. Figure 1 is an illustrative ex- excess of tens of microseconds. Finally, a more severe effect on tail latency than ample of how a 2µs fabric can, through using a full-fledged TCP/IP stack rather on median latency, which can be espe- a cumulative set of software overheads, than the RDMA-based transport adds to cially problematic in distributed com- turn into a nearly 100µs datacenter the final overhead that exceeds 75µs in putations.4 fabric. Each measurement reflects the this particular experiment. Similar observations can be made median round-trip latency (from the In addition, there are other more about overheads for new non-volatile stor- application), with no queueing delays unpredictable, and more non-intuitive, age. For example, the Moneta project3 at or unloaded latencies. sources of overhead. For example, when the University of California, San Diego, A very basic remote direct memory an RPC reaches a server where the core discusses how the latency of access for access (RDMA) operation in a fast data- is in a sleep state, additional latencies— a non-volatile memory with baseline center network takes approximately 2µs. often tens to hundreds of microsec- raw access latency of a few microsec- An RDMA operation offloads the mecha- onds—might be incurred to come out onds can increase by almost a factor of nisms of operation handling and trans- of that sleep state (and potentially warm five due to different overheads across port reliability to a specialized hardware up processor caches). Likewise, various the kernel, interrupt handling, and data device. Making it a “two-sided” primitive mechanisms (such as interprocessor in- copying. (involving remote software rather than terrupts, data copies, context switches, System designers need to rethink

APRIL 2017 | VOL. 60 | NO. 4 | COMMUNICATIONS OF THE ACM 51 contributed articles

the hardware and software stack in the of requests (requests per second), a istic arrivals and service times, where context of the microsecond challenge. simple queuing theory model tells service times represent the time be- System design decisions like operat- us that as service time increases tween overhead events. ing system-managed threading and throughput decreases. In the micro- As expected, for short service times, interrupt-based notification that were second regime, when service time is overhead of just a single microsecond in the noise with millisecond-scale de- composed mostly of “overhead” rath- leads to dramatic reduction in over- signs now have to be redesigned more er than useful computation, through- all throughput efficiency. How likely carefully, and system optimizations put declines precipitously. are such small service times in the real (such as storage I/O schedulers tar- Illustrating the effect, Figure 2 world? Table 2 lists the service times geted explicitly at millisecond scales) shows efficiency (fraction of achieved for a production web-search workload have to be rethought for the microsec- vs. ideal throughput) on the y-axis, measuring the number of instructions ond scale. and service time on the x-axis. The dif- between I/O events when the workload ferent curves show the effect of chang- is tuned appropriately. As we move to How to Waste a Fast ing “microsecond overhead” values; systems that use fast flash or new non- Datacenter Processor that is, amounts of time spent on each volatile memories, service times in the The other significant negative ef- overhead event, with the line marked range of 0.5µs to 10µs are to be expected. fect of microsecond-scale events is “No overhead” representing a hypo- Microsecond overheads can significantly on processor efficiency. If we mea- thetical ideal system with zero over- degrade performance in this regime. sure processor resource efficiency head. The model assumes a simple At longer service times, sub-micro- in terms of throughput for a stream closed queuing model with determin- second overheads are tolerable and throughput is close to the ideal. Higher Figure 2. Efficiency degrades significantly at low service times due to microsecond-scale overheads in the tens of microseconds, overheads. possibly from the software overheads detailed earlier, can lead to degraded No overhead 1µs 16µs 100% performance, so system designers still need to optimize for killer microsec-

75% onds. Other overheads go beyond the basic mechanics of accessing microsecond- 50% scale devices. A 2015 paper summariz- Efficiency ing a multiyear longitudinal study at 25% Google10 showed that 20%–25% of fleet- wide processor cycles are spent on low- 0% 0 10 20 30 40 level overheads we call the “datacenter Service time (us) tax.” Examples include serialization and deserialization of data, memory alloca- tion and de-allocation, network stack costs, compression, and encryption. Table 3. High-performance computing and warehouse-scale computing systems The datacenter tax adds to the killer mi- compared. Though high-performance computing systems are often optimized for low-latency networking, their designs and techniques are not directly applicable to crosecond challenge. A logical question warehouse-scale computers. is whether system designers can address reduced processor efficiency by offload- High-Performance Computing Warehouse-Scale Computing ing some of the overheads to a separate Workloads Supercomputing workloads that Large-scale online data-intensive core or accelerator. Unfortunately, at often model the physical world; workloads; operate on big data and single-digit microsecond I/O latencies, simpler, static data structures. complex dynamic data structures; I/O operations tend to be closely coupled response latency critical. with the main work on the processor. Programming Code touched by a fewer programmers; Codebase touched by thousands of It is this frequent and closely cou- environment slower-changing workloads. developers; significant software releases 100 times per year. pled nature of these processor over- Hardware concurrency visible to heads that is even more significant, as programmers at compile time. Automatic scale-out of queries per second. in “death by 1,000 cuts.” For example, System Focus on highest performance; recent Focus on highest performance per dollar. if microsecond-scale operations are constraints emphasis on performance per . made infrequently, then conservation Stranding of resources (such as Significant effort to avoid stranding of resources of processor performance may not be underutilized processors) acceptable. (such as processor, memory, and power). a concern. Application threads could Reliability/ Reliability in hardware; often no-long- Commodity hardware; reliability across just busy-poll to wait for the microsec- security lived mutable data; no encryption. the stack. ond operation to complete. Alterna- Encryption/authentication requirements. tively, if these operations are not cou- pled closely to the main computation on the processor, traditional offload

52 COMMUNICATIONS OF THE ACM | APRIL 2017 | VOL. 60 | NO. 4 contributed articles

Synchronous/Asynchronous Programming Models Synchronous APIs maintain the model of sequential // Heap-allocated state tracked between asynchronous operations. execution, and a thread’s state can be stored on the stack struct AsyncCountWordsState { without explicit management from the application. This bool status; std::function done_callback; leads to code that is shorter, easier-to-understand, more Index index; maintainable, and potentially even more efficient. In contrast, string doc; with an asynchronous model, whenever an operation is }; triggered, the code must typically be split up and moved into // Invokes the ‘done’ callback, passing the number of words found in the different functions. The control flow becomes obfuscated, // document specified by ‘doc_id’. void AsyncCountWords(const string& doc_id, std::function done) { and it becomes the application’s responsibility to manage // Kick off the first asynchronous operation, and invoke the the state between asynchronous event-handling functions // ReadDocumentIndexDone when it finishes. State between asynchronous // operations is tracked in a heap-allocated ‘state’ object. that run to completion. auto state = new AsyncCountWordsState(); Consider a simple example where a function returns state->done_callback = done; AsyncReadDocumentIndex(doc_id, &state->status, &state->index, the number of words that appears in a given document std::bind(&ReadDocumentIndexDone, state)); identifier where the implementation may store the cached }

documents on a microsecond-scale device. The example // First callback function. represented in the first code portion makes two synchronous void ReadDocumentIndexDone(AsyncCountWordsState* state) { if (!state->status) { accesses to microsecond-scale devices: the first retrieves an state->done_callback(-1); index location for the document, and the second uses that delete state; location to retrieve the actual document content for counting } else { // Kick off the second asynchronous operation, and invoke the words. With a synchronous programming model, it is not only // ReadDocumentDone function when it finishes. The 'state' object straightforward but the model also can completely abstract // is passed to the second callback for final cleanup. AsyncReadDocument(state->index.location, &state->status, away device accesses altogether by providing a natural, &state->doc, std::bind(&ReadDocumentDone, state)); synchronous CountWords function with familiar syntax. } }

// Returns the number of words found in the document specified by ‘doc_id’. // Second callback function. int CountWords(const string& doc_id) { void ReadDocumentDone(AsyncCountWordsState* state) { Index index; if (!state->status) { bool status = ReadDocumentIndex(doc_id, &index); state->done_callback(-1); if (!status) return -1; } else { string doc; state->done_callback(CountWordsInString(state->doc)); status = ReadDocument(index.location, &doc); } if (!status) return -1; delete state; return CountWordsInString(doc); } } Callbacks make the flow of control explicit rather than just in- The C++11 example in the second code portion shows an voking a function and waiting for it to complete. While languages and asynchronous model where a “callback” (commonly known as libraries can make it somewhat easier to use callbacks and continua- a “continuation”) is used. When the asynchronous operation tions (such as support for async/await, lambdas, tasks, and futures), completes, the event-handling loop invokes the callback the result remains code that is arguably messier and more difficult to to continue the computation. Since two asynchronous understand than a simple synchronous function-calling model. operations are made, two callbacks are needed. Moreover In this asynchronous example, wrapping the code inside the CountWords API now must be asynchronous itself. And a synchronous Cou ntWords() function—in order to abstract unlike the synchronous example, state between asynchronous away the presence of an underlying asynchronous microsecond- operations must be explicitly managed and tracked on the scale device—requires support for a wait primitive. The existing heap rather than on a thread’s stack. approaches for waiting on a condition invoke operating system support for thread scheduling, thereby incurring significant overhead when wait times are at the microsecond scale. engines can be used. However, given porting software not tuned for such latency networking. Nevertheless, the fine-grain and closely coupled over- small latencies. Computer scientists techniques used in supercomputers are heads, new hardware optimizations thus need to design “microsecond- not directly applicable to warehouse- are needed at the processor micro- aware” systems stacks. They also need scale computers (see Table 3). For architectural level to rethink the im- to build on related work from the past one, high-performance systems have plementation and scheduling of such five years in this area (such as Caulfied slower-changing workloads, and fewer core functions that comprise future et al.,3 Nanavati et al.,12 and Ouster- programmers need to touch the code. microsecond-scale I/O events. hout et al.14) to continue redesigning Code simplicity and greatest program- traditional low-level system optimiza- mer productivity are thus not as critical Solution Directions tions—reduced lock contention and as they are in deployments like Amazon This evidence indicates several ma- synchronization, lower-overhead inter- or Google where key software products jor classes of opportunities ahead in rupt handling, efficient resource utili- are released multiple times per week. the upcoming “era of the killer micro- zation during spin-polling, improved High-performance workloads also second.” First, and more near-term, job scheduling, and hardware offload- tend to have simpler and static data it is relatively easy to squander all the ing but for the microsecond regime. structures that lend themselves to benefits from microsecond devices by The high-performance comput- simpler, faster networking. Second, the progressively adding suboptimal sup- ing industry has long dealt with low- emphasis is primarily on performance

APRIL 2017 | VOL. 60 | NO. 4 | COMMUNICATIONS OF THE ACM 53 contributed articles

(vs. performance-per-total-cost-of- facilities similar to the x86 monitor/ for detailed comments that improved ownership in large-scale Web deploy- mwait instructionsa could allow a thread this article. We also thank the teams at ments). Consequently, they can keep to yield until an I/O operation completes Google that build, manage, and main- processors highly underutilized when, with very low overhead. Meanwhile, the tain the systems that contributed to the say, blocking for MPI-style rendezvous hardware could seamlessly schedule a insights we have explored here. messages. In contrast, a key emphasis different thread. To reduce overhead fur- in warehouse-scale computing systems ther, a potential hardware implementa- References 1. Alverson, R. et al. The Tera computer system. In is the need to optimize for low latencies tion could cache the thread’s context in Proceedings of the Fourth International Conference on Supercomputing (Amsterdam, The Netherlands, June while achieving greater utilizations. either the existing L1/L2/L3 hierarchy 11–15). ACM Press, New York, 1990, 1–6. As discussed, traditional processor or a special-purpose context cache. Im- 2. Boost C++ Libraries. Boost asio library; http://www. boost.org/doc/libs/1_59_0/doc/html/boost_asio.html optimizations to hide latency run out proved resource isolation and quality- 3. Caulfield, A. et al. Moneta: A high-performance storage of instruction-level pipeline parallel- of-service control in hardware will also array architecture for next-generation, non-volatile memories. In Proceedings of the 2010 IEEE/ACM ism to tolerate microsecond latencies. help. Hardware support to build new International Symposium on Microarchitecture (Atlanta, System designers need new hardware instrumentation to track microsecond GA, Dec. 4–8). IEEE Computer Society Press, 2010. 4. Dean, J. and Barroso, L.A. The tail at scale. Commun. optimizations to extend the use of syn- overheads will also be useful. ACM 56, 2 (Feb. 2013), 74–80. chronous blocking mechanisms and Finally, techniques to enable micro- 5. Erlang. Erlang User’s Guide Version 8.0. Processes; http://erlang.org/doc/efficiency_guide/processes.html thread-level parallelism to the micro- second-scale devices should not neces- 6. Fikes, F. Storage architecture and challenges. In second range. sarily seek to keep processor pipelines Proceedings of the 2010 Google Faculty Summit (Mountain View, CA, July 29, 2010); http://www. Context switching can help, al- busy. One promising solution might systutorials.com/3306/storage-architecture-and- beit at the cost of increased power instead be to enable a processor to stop challenges/ 7. Golang.org. Effective Go. Goroutines; https://golang. and latency. Prior approaches for consuming power while a microsecond- org/doc/effective_go.html#goroutines fast context switching (such as Denelcor scale access is outstanding and shift 8. Hennessy, J. and Patterson, D. Computer Architecture: A Quantitative Approach, Sixth Edition. Elsevier, 15 1 HEP and Tera MTA computers ) traded that power to other cores not blocked Cambridge, MA, 2017. off single-threaded performance, giving 9. Intel Newsroom. Intel and Micron produce on accesses. breakthrough memory technology, July 28, up on latency advantages from local- 2015; http://newsroom.intel.com/community/ Conclusion intel_newsroom/blog/2015/07/28/intel-and-micron- ity and private high-level caches and, produce-breakthrough-memory-technology consequently, have limited appeal in System designers can no longer ig- 10. Kanev, S. et al. Profiling a warehouse-scale computer. In Proceedings of the 42nd International Symposium a broader warehouse-scale computing nore efficient support for microsec- on Computer Architecture (Portland, OR, June 13–17). environment where programmers want ond-scale I/O, as the most useful new ACM Press, New York, 2015. 11. Microsoft. Asynchronous Programming with Async and to tolerate microsecond events with low warehouse-scale computing technolo- Await (C# and Visual Basic); https://msdn.microsoft. overhead and ease of programmability. gies start running at that time scale. com/en-us/library/hh191443.aspx 12. Nanavati, M. et al. Non-volatile storage: Implications Some languages and runtimes (such as Today’s hardware and system software of the datacenter’s shifting center. Commun. ACM 50, 1 Go and Erlang) feature lightweight make an inadequate platform, par- (Jan. 2016), 58–63. 5,7 13. Nelson, J. et al. Latency-tolerant software distributed threads to reduce memory and context- ticularly given support for synchro- shared memory. In Proceedings of the USENIX switch overheads associated with oper- nous programming models is deemed Annual Technical Conference (Santa Clara, CA, July 8–10). Usenix Association, Berkeley, CA, 2015. ating system threads. But these systems critical for software productivity. Novel 14. Ousterhout, J. et al. The RAMCloud storage system. fall back to heavier-weight mechanisms microsecond-optimized system stacks ACM Transactions on Computer Systems 33, 3 (Sept. 2015), 7:1–7:55. when dealing with I/O. For example, the are needed, reexamining questions 15. Smith, B. A pipelined shared-resource MIMD Grappa platform13 builds an efficient computer. Chapter in Advanced Computer around appropriate layering and ab- Architecture. D.P. Agrawal, Ed. IEEE Computer task scheduler and communication straction, control and data plane sepa- Society Press, Los Alamitos, CA, 1986, 39–41. 16. Wikipedia.org. Google n-gram viewer; https:// layer for small messages but trades off a ration, and hardware/software bound- en.wikipedia.org/wiki/Google_Ngram_Viewer more restricted programming environ- aries. Such optimized designs at the ment and less-efficient performance microsecond scale, and corresponding Luiz André Barroso ([email protected]) is a Google and also optimizes for throughput. New faster I/O, can in turn enable a virtuous Fellow and Vice President of Engineering at Google Inc., Mountain View, CA. hardware ideas are needed to enable con- cycle of new applications and program- Michael R. Marty ([email protected]) is a senior text switching across a large number of ming models that leverage low-latency staff software engineer and manager at Google Inc., threads (tens to hundreds per processor, communication, dramatically increas- Madison, WI. David Patterson ([email protected]) is an though finding the sweet spot is an open ing the effective computing capabili- emeritus professor at the University of California, question) at extremely fast latencies ties of warehouse-scale computers. Berkeley, and a distinguished engineer at Google Inc., (tens of nanoseconds). Mountain View, CA. Parthasarathy Ranganathan (partha.ranganathan@ Hardware innovation is also needed Acknowledgments google.com) is a principal engineer at Google Inc., to help orchestrate communication with We would like to thank Al Borchers, Mountain View, CA. pending I/O, efficient queue manage- Robert Cypher, Lawrence Greenfield, Copyright held by the authors. ment and task scheduling/dispatch, and Mark Hill, Urs Hölzle, Christos Ko- better processor state (such as cache) zyrakis, Noah Levine, Milo Martin, Jeff management across several contexts. Mogul, John Ousterhout, Amin Vah-

Ideally, future schedulers will have rich dat, Sean Quinlan, and Tom Wenisch Watch the authors discuss support for I/O (such as being able to their work in this exclusive Communications video. park a thread based on the readiness of a The x86 monitor/mwait instructions allow privi- http://cacm.acm.org/videos/the- multiple I/O operations). For instance, leged software to wait on a single memory word. attack-of-the-killer-microseconds

54 COMMUNICATIONS OF THE ACM | APRIL 2017 | VOL. 60 | NO. 4