A Portable Kernel Abstraction for Low-Overhead Ephemeral Mapping Management Khaled Elmeleegy, Anupam Chanda, and Alan L

A Portable Kernel Abstraction For Low-Overhead Ephemeral Mapping Management Khaled Elmeleegy, Anupam Chanda, and Alan L. Cox Department of Computer Science Rice University, Houston, Texas 77005, USA {kdiaa,anupamc,alc}@cs.rice.edu Willy Zwaenepoel School of Computer and Communication Sciences EPFL, Lausanne, Switzerland [email protected] Abstract of process tracing and debugging. To create an ephemeral mapping two actions are required: the Modern operating systems create ephemeral allocation of a temporary kernel virtual address virtual-to-physical mappings for a variety of pur- and the modification of the virtual-to-physical ad- poses, ranging from the implementation of inter- dress mapping. To date, these actions have been process communication to the implementation of performed through separate interfaces. This pa- process tracing and debugging. With succeed- per demonstrates the benefits of combining these ing generations of processors the cost of creat- actions under a single interface. ing ephemeral mappings is increasing, particu- This work is motivated by the increasing cost larly when an ephemeral mapping is shared by of ephemeral mapping creation, particularly, the multiple processors. increasing cost of modifications to the virtual-to- To reduce the cost of ephemeral mapping man- physical mapping. To see this trend, consider the agement within an operating system kernel, we latency in processor cycles for the invlpg in- introduce the sf buf ephemeral mapping instruction across several generations of the IA32 terface. We demonstrate how in several kernel architecture. This instruction invalidates the subsystems — including pipes, memory disks, Translation Look-aside Buffer (TLB) entry for sockets, execve(), ptrace(), and the vnode the given virtual address. In general, the op- pager — the current implementation can be re- erating system must issue this instruction when placed by calls to the sf buf interface. it changes a virtual-to-physical mapping. When We describe the implementation of the this instruction was introduced in the 486, it took sf buf interface on the 32-bit i386 architecture 12 cycles to execute. In the Pentium, its latency and the 64-bit amd64 architecture. This imple- increased to 25 cycles. In the Pentium III, its mentation reduces the cost of ephemeral mapping latency increased to ∼100 cycles. Finally, in management by reusing wherever possible ex- the Pentium 4, its latency has reached ∼500 to isting virtual-to-physical address mappings. We ∼1000 cycles. So, despite a factor of three de- evaluate the sf buf interface for the pipe, mem- crease in the cycle time between a high-end Pen- ory disk and networking subsystems. Our results tium III and a high-end Pentium 4, the cost of a show that these subsystems perform significantly mapping change measured in wall clock time has better when using the sf buf interface. On a actually increased. multiprocessor platform interprocessor interrupts are greatly reduced in number or eliminated alto- Furthermore, on a multiprocessor, the cost of gether. ephemeral mapping creation can be significantly higher if the mapping is shared by two or more processors. Unlike data cache coherence, TLB 1 Introduction coherence is generally implemented in software by the operating system [2, 12]: The processor Modern operating systems create ephemeral initiating a mapping change issues an interpro- virtual-to-physical mappings for a variety of pur- cessor interrupt (IPI) to each of the processors poses, ranging from the implementation of in- that share the mapping; the interrupt handler that terprocess communication to the implementation is executed by each of these processors includes USENIX Association 2005 USENIX Annual Technical Conference 223 an instruction, such as invlpg, that invalidates lines of code reduction in an operating system that processor’s TLB entry for the mapping’s vir- kernel from using the sf buf interface. Sec- tual address. Consequently, a mapping change is tion 6 presents an experimental evaluation of the quite costly for all processors involved. sf buf interface. We present related work in In the past, TLB coherence was only an issue Section 7 and conclude in Section 8. for multiprocessors. Today, however, some implementations of Simultaneous Multi-Threading 2 Ephemeral Mapping Usage (SMT), such as the Pentium 4’s, require the operating system to implement TLB coherence in a We use FreeBSD 5.3 as an example to demon- single-processor system. strate the use of ephemeral mappings. FreeBSD To reduce the cost and complexity of 5.3 uses ephemeral mappings in a wide vari- ephemeral mapping management within an op- ety of places, including the implementation of erating system kernel, we introduce the sf buf pipes, memory disks, sendfile(),sockets, ephemeral mapping interface. Like Mach’s pmap execve(), ptrace(), and the vnode pager. interface [11], our objective is to provide a machine-independent interface enabling variant, machine-specific implementations. Unlike pmap, 2.1 Pipes our sf buf interface supports allocation of tem- Conventional implementations of Unix pipes per- porary kernel virtual addresses. We describe how form two copy operations to transfer the data various subsystems in the operating system ker- from the writer to the reader. The writer copies nel benefit from the sf buf interface. the data from the source buffer in its user address We present the implementation of the sf buf space to a buffer in the kernel address space, and interface on two representative architectures, the reader later copies this data from the kernel i386, a 32-bit architecture, and amd64, a 64-bit buffer to the destination buffer in its user address architecture. This implementation is efficient: it space. performs creation and destruction of ephemeral In the case of large data transfers that fill mappings in O(1) expected time on i386 and O(1) the pipe and block the writer, FreeBSD uses time on amd64.Thesf buf interface enables ephemeral mappings to eliminate the copy oper- the automatic reuse of ephemeral mappings so ation by the writer, reducing the number of copy that the high cost of mapping changes can be operations from two to one. The writer first de- amortized over several uses. In addition, this im- termines the set of physical pages underlying the plementation of the sf buf interface incorpo- source buffer, then wires each of these physical rates several techniques for avoiding TLB coher- pages disabling their replacement or page-out, ence operations, eliminating the need for costly and finally passes the set to the receiver through IPIs. the object implementing the pipe. Later, the We have evaluated the performance of the pipe, reader obtains the set of physical pages from the memory disk and networking subsystems using pipe object. For each physical page, it creates an the sf buf interface. Our results show that these ephemeral mapping that is private to the current subsystems benefit significantly from its use. For CPU and is not used by other CPUs. Henceforth, the bw pipe program from the lmbench bench- we refer to this kind of mapping as a CPU-private mark [10] the sf buf interface improves perfor- ephemeral mapping. The reader then copies the mance up to 168% on one of our test platforms. data from the kernel virtual address provided by In all of our experiments the number of TLB in- the ephemeral mapping to the destination buffer validations is greatly reduced or eliminated. in its user address space, destroys the ephemeral The rest of the paper is organized as follows. mapping, and unwires the physical page reen- The next two sections motivate this work from abling its replacement or page-out. two different perspectives: First, Section 2 de- scribes the many uses of ephemeral mappings in 2.2 Memory Disks an operating system kernel; second, Section 3 presents the execution costs for the machine-level Memory disks have a pool of physical pages. To operations used to implement ephemeral map- read from or write to a memory disk a CPU- pings. We define the sf buf interface and its private ephemeral mapping for the desired pages implementation on two representative architec- of the memory disk is created. Then the data is tures in Section 4. Section 5 summarizes the copied between the ephemerally mapped pages 224 2005 USENIX Annual Technical Conference USENIX Association and the read/write buffer provided by the user. 2.5 ptrace(2) After the read or write operation completes, the ephemeral mapping is freed. The ptrace(2) system call enables one process to trace or debug another process. It includes the capability to read or write the memory 2.3 sendfile(2) and Sockets of the traced process. To read from or write to the traced process’s memory, the kernel creates CPU- The zero-copy sendfile(2) system call and private ephemeral mappings for the desired phys- zero-copy socket send use ephemeral mappings ical pages of the traced process. The kernel then in a similar way. For zero-copy send the copies the data between the ephemerally mapped kernel wires the physical pages correspond- pages and the buffer provided by the tracing pro- ing to the user buffer in memory and then cess. The kernel then frees the ephemeral map- creates ephemeral mappings for them. For pings. sendfile() it does the same for the pages of the file. The ephemeral mappings persist until the corresponding mbuf chain is freed, e.g., when 2.6 Vnode Pager TCP acknowledgments are received. The kernel then frees the ephemeral mappings and un- The vnode pager creates ephemeral mappings to wires the corresponding physical pages. These carry out I/O. These ephemeral mappings are not ephemeral mappings are not CPU-private be- CPU private. They are used for paging to and cause they need to be shared among all the CPUs from file systems with small block sizes.

Load more