I/O Speculation for the Microsecond
Total Page:16
File Type:pdf, Size:1020Kb
I/O Speculation for the Microsecond Era Michael Wei, University of California, San Diego; Matias Bjørling and Philippe Bonnet, IT University of Copenhagen; Steven Swanson, University of California, San Diego https://www.usenix.org/conference/atc14/technical-sessions/presentation/wei This paper is included in the Proceedings of USENIX ATC ’14: 2014 USENIX Annual Technical Conference. June 19–20, 2014 • Philadelphia, PA 978-1-931971-10-2 Open access to the Proceedings of USENIX ATC ’14: 2014 USENIX Annual Technical Conference is sponsored by USENIX. I/O Speculation for the Microsecond Era Michael Wei†, Matias Bjørling‡, Philippe Bonnet‡, Steven Swanson† †University of California, San Diego ‡IT University of Copenhagen Device Read Write Millisecond Scale Abstract 10G Intercontinental RPC 100 ms 100 ms Microsecond latencies and access times will soon domi- 10G Intracontinental RPC 20 ms 20 ms nate most datacenter I/O workloads, thanks to improve- Hard Disk 10 ms 10 ms ments in both storage and networking technologies. Cur- 10G Interregional RPC 1 ms 1 ms rent techniques for dealing with I/O latency are targeted Microsecond Scale for either very fast (nanosecond) or slow (millisecond) 10G Intraregional RPC 300 µs 300 µs devices. These techniques are suboptimal for microsec- SATA NAND SSD 200 µs 50 µs ond devices - they either block the processor for tens PCIe/NVMe NAND SSD 60 µs 15 µs of microseconds or yield the processor only to be ready 10Ge Inter-Datacenter RPC 10 µs 10 µs again microseconds later. Speculation is an alternative 40Ge Inter-Datacenter RPC 5 µs 5 µs technique that resolves the issues of yielding and block- PCM SSD 5 µs 5 µs ing by enabling an application to continue running un- Nanosecond Scale til the application produces an externally visible side ef- 40 Gb Intra-Rack RPC 100 ns 100 ns fect. State-of-the-art techniques for speculating on I/O DRAM 10 ns 10 ns requests involve checkpointing, which can take up to a STT-RAM <10 ns <10 ns millisecond, squandering any of the performance bene- fits microsecond scale devices have to offer. In this paper, Table 1: I/O device latencies. Typical random read and we survey how speculation can address the challenges write latencies for a variety of I/O devices. The major- that microsecond scale devices will bring. We mea- ity of I/Os in the datacenter will be in the microsecond sure applications for the potential benefit to be gained range. from speculation and examine several classes of specu- lation techniques. In addition, we propose two new tech- opments is that the datacenter will soon be dominated by niques, hardware checkpoint and checkpoint-free spec- microsecond-scale I/O requests. ulation. Our exploration suggests that speculation will Today, an operating system uses one of two options enable systems to extract the maximum performance of when an application makes an I/O request: either it can I/O devices in the microsecond era. block and poll for the I/O to complete, or it can com- plete the I/O asynchronously by placing the request in 1 Introduction a queue and yielding the processor to another thread or application until the I/O completes. Polling is an We are at the dawn of the microsecond era: current effective strategy for devices with submicrosecond la- state-of-the-art NAND-based Solid State Disks (SSDs) tency [2, 20], while programmers have used yielding offer latencies in the sub-100µs range at reasonable and asynchronous I/O completion for decades on devices cost [16, 14]. At the same time, improvements in net- with millisecond latencies, such as disk. Neither of these work software and hardware have brought network laten- strategies, however, is a perfect fit for microsecond-scale cies closer to their physical limits, enabling sub-100µs I/O requests: blocking will prevent the processor from communication latencies. The net result of these devel- doing work for tens of microseconds, while yielding may USENIX Association 2014 USENIX Annual Technical Conference 475 reduce performance by increasing the overhead of each 2.1 Interfaces versus Strategies I/O operation. A third option exists as a solution for dispatching I/O When an application issues a request for an I/O, it uses an requests, speculation. Under the speculation strategy, the interface to make that request. A common example is the operating system completes I/O operations speculatively, POSIX write/read interface, where applications make returning control to the application without yielding. The I/O requests by issuing blocking calls. Another example operating system monitors the application: in the case of is the POSIX asynchronous I/O interface, in which ap- a write operation, the operating system blocks the appli- plications enqueue requests to complete asynchronously cation if it makes a side-effect, and in the case of a read and retrieve the status of their completion at some later operation, the operating system blocks the application if time. it attempts to use data that the OS has not read yet. In Contrast interfaces with strategies, which refers to addition, the operating system may have a mechanism to how the operating system actually handles I/O requests. rollback if the I/O operation does not complete success- For example, even though the write interface is block- fully. By speculating, an application can continue to do ing, the operating system may choose to handle the I/O useful work even if the I/O has not completed. In the asynchronously, yielding the processor to some other context of microsecond-scale I/O, speculation can be ex- thread. tremely valuable since, as we discuss in the next section, In this work, we primarily discuss operating system there is often enough work available to hide microsecond strategies for handling I/O requests, and assume that ap- latencies. We expect that storage class memories, such plication developers are free to choose interfaces. as phase-change memory (PCM), will especially benefit from speculation since their access latencies are unpre- 2.2 Asynchronous I/O - Yielding dictable and variable [12]. Any performance benefit to be gained from specu- Yielding, or the asynchronous I/O strategy, follows the lation is dependent upon the performance overhead of traditional pattern for handling I/O requests within the speculating. Previous work in I/O speculation [9, 10] operating system: when a userspace thread issues an I/O has relied on checkpointing to enable rollback in case request, the I/O subsystem issues the request and the of write failure. Even lightweight checkpointing, which scheduler places the thread in an I/O wait state. Once utilizes copy-on-write techniques, has a significant over- the I/O device completes the request, it informs the oper- head which can exceed the access latency of microsec- ating system, usually by means of a hardware interrupt, ond devices. and the operating system then places the thread into a ready state, which enables the thread to resume when it In this paper, we survey speculation in the context is rescheduled. of microsecond-scale I/O devices, and attempt to quan- Yielding has the advantage of allowing other tasks to tify the performance gains that speculation has to of- utilize the CPU while the I/O is being processed. How- fer. We then explore several techniques for speculation, ever, yielding introduces significant overhead, which is which includes exploring existing software-based check- particularly relevant for fast I/O devices [2, 20]. For ex- pointing techniques. We also propose new techniques ample, yielding introduces contexts switches, cache and which exploit the semantics of the traditional I/O inter- TLB pollution as well as interrupt overhead that may ex- face. We find that while speculation could allow us to ceed the cost of doing the I/O itself. These overheads maximize the performance of microsecond scale devices, are typically in the microsecond range, which makes the current techniques for speculation cannot deliver the per- cost of yielding minimal when dealing with millisecond formance which microsecond scale devices require. latencies as with disks and slow WANs, but high when dealing with nanosecond devices, such as fast NVMs. 2 Background 2.3 Synchronous I/O - Blocking Past research has shown that current systems have built- in the assumption that I/O is dominated by millisecond Blocking, or the synchronous I/O strategy, is a solution scale requests [17, 2]. These assumptions have impacted for dealing with devices like fast NVMs. Instead of the core design of the applications and operating systems yielding the CPU in order for I/O to complete, block- we use today, and may not be valid in a world where ing prevents unnecessary context switches by having the I/O is an order of magnitude faster. In this Section, we application poll for I/O completions, keeping the entire discuss the two major strategies for handling I/O and context of execution within the executing thread. Typi- show that they do not adequately address the needs of cally, the application stays in a spin-wait loop until the microsecond-scale devices, and we give an overview of I/O completes, and resumes execution once the device I/O speculation. flags the I/O as complete. 476 2014 USENIX Annual Technical Conference USENIX Association Blocking prevents the CPU from incurring the cost of t Yielding app2 context switches, cache and TLB pollution as well as in- t t terrupt overhead that the yielding strategy incurs. How- sched sched t t ever, the CPU is stalled for the amount of time the I/O io_start io_comp takes to complete. If the I/O is fast, then this strategy is optimal since the amount of time spent waiting is much shorter than the amount of CPU time lost due to soft- tdevice ware overheads.