Proceedings of the Linux Symposium

Proceedings of the Linux Symposium Volume One June 27th–30th, 2007 Ottawa, Ontario Canada Contents The Price of Safety: Evaluating IOMMU Performance 9 Ben-Yehuda, Xenidis, Mostrows, Rister, Bruemmer, Van Doorn Linux on Cell Broadband Engine status update 21 Arnd Bergmann Linux Kernel Debugging on Google-sized clusters 29 M. Bligh, M. Desnoyers, & R. Schultz Ltrace Internals 41 Rodrigo Rubira Branco Evaluating effects of cache memory compression on embedded systems 53 Anderson Briglia, Allan Bezerra, Leonid Moiseichuk, & Nitin Gupta ACPI in Linux – Myths vs. Reality 65 Len Brown Cool Hand Linux – Handheld Thermal Extensions 75 Len Brown Asynchronous System Calls 81 Zach Brown Frysk 1, Kernel 0? 87 Andrew Cagney Keeping Kernel Performance from Regressions 93 T. Chen, L. Ananiev, and A. Tikhonov Breaking the Chains—Using LinuxBIOS to Liberate Embedded x86 Processors 103 J. Crouse, M. Jones, & R. Minnich GANESHA, a multi-usage with large cache NFSv4 server 113 P. Deniel, T. Leibovici, & J.-C. Lafoucrière Why Virtualization Fragmentation Sucks 125 Justin M. Forbes A New Network File System is Born: Comparison of SMB2, CIFS, and NFS 131 Steven French Supporting the Allocation of Large Contiguous Regions of Memory 141 Mel Gorman Kernel Scalability—Expanding the Horizon Beyond Fine Grain Locks 153 Corey Gough, Suresh Siddha, & Ken Chen Kdump: Smarter, Easier, Trustier 167 Vivek Goyal Using KVM to run Xen guests without Xen 179 R.A. Harper, A.N. Aliguori & M.D. Day Djprobe—Kernel probing with the smallest overhead 189 M. Hiramatsu and S. Oshima Desktop integration of Bluetooth 201 Marcel Holtmann How virtualization makes power management different 205 Yu Ke Ptrace, Utrace, Uprobes: Lightweight, Dynamic Tracing of User Apps 215 J. Keniston, A. Mavinakayanahalli, P. Panchamukhi, & V. Prasad kvm: the Linux Virtual Machine Monitor 225 A. Kivity, Y. Kamay, D. Laor, U. Lublin, & A. Liguori Linux Telephony 231 Paul P. Komkoff, A. Anikina, & R. Zhnichkov Linux Kernel Development 239 Greg Kroah-Hartman Implementing Democracy 245 Christopher James Lahey Extreme High Performance Computing or Why Microkernels Suck 251 Christoph Lameter Performance and Availability Characterization for Linux Servers 263 Linkov Koryakovskiy “Turning the Page” on Hugetlb Interfaces 277 Adam G. Litke Resource Management: Beancounters 285 Pavel Emelianov, Denis Lunev and Kirill Korotaev Manageable Virtual Appliances 293 D. Lutterkort Everything is a virtual filesystem: libferris 303 Ben Martin Conference Organizers Andrew J. Hutton, Steamballoon, Inc., Linux Symposium, Thin Lines Mountaineering C. Craig Ross, Linux Symposium Review Committee Andrew J. Hutton, Steamballoon, Inc., Linux Symposium, Thin Lines Mountaineering Dirk Hohndel, Intel Martin Bligh, Google Gerrit Huizenga, IBM Dave Jones, Red Hat, Inc. C. Craig Ross, Linux Symposium Proceedings Formatting Team John W. Lockhart, Red Hat, Inc. Gurhan Ozen, Red Hat, Inc. John Feeney, Red Hat, Inc. Len DiMaggio, Red Hat, Inc. John Poelstra, Red Hat, Inc. Authors retain copyright to all submitted papers, but have granted unlimited redistribution rights to all as a condition of submission. The Price of Safety: Evaluating IOMMU Performance Muli Ben-Yehuda Jimi Xenidis Michal Ostrowski IBM Haifa Research Lab IBM Research IBM Research [email protected] [email protected] [email protected] Karl Rister Alexis Bruemmer Leendert Van Doorn IBM LTC IBM LTC AMD [email protected] [email protected] [email protected] Abstract 1 Introduction An I/O Memory Management Unit (IOMMU) creates one or more unique address spaces which can be used IOMMUs, IO Memory Management Units, are hard- to control how a DMA operation, initiated by a device, ware devices that translate device DMA addresses to accesses host memory. This functionality was originally machine addresses. An isolation capable IOMMU re- introduced to increase the addressability of a device or stricts a device so that it can only access parts of mem- bus, particularly when 64-bit host CPUs were being in- ory it has been explicitly granted access to. Isolation troduced while most devices were designed to operate capable IOMMUs perform a valuable system service by in a 32-bit world. The uses of IOMMUs were later ex- preventing rogue devices from performing errant or ma- tended to restrict the host memory pages that a device licious DMAs, thereby substantially increasing the sys- can actually access, thus providing an increased level of tem’s reliability and availability. Without an IOMMU isolation, protecting the system from user-level device a peripheral device could be programmed to overwrite drivers and eventually virtual machines. Unfortunately, any part of the system’s memory. Operating systems uti- this additional logic does impose a performance penalty. lize IOMMUs to isolate device drivers; hypervisors uti- The wide spread introduction of IOMMUs by Intel [1] lize IOMMUs to grant secure direct hardware access to and AMD [2] and the proliferation of virtual machines virtual machines. With the imminent publication of the will make IOMMUs a part of nearly every computer PCI-SIG’s IO Virtualization standard, as well as Intel system. There is no doubt with regards to the benefits and AMD’s introduction of isolation capable IOMMUs IOMMUs bring. but how much do they cost? We seek in all new servers, IOMMUs will become ubiquitous. to quantify, analyze, and eventually overcome the performance penalties inherent in the introduction of this new technology. Although they provide valuable services, IOMMUs can impose a performance penalty due to the extra memory accesses required to perform DMA operations. The ex- 1.1 IOMMU design act performance degradation depends on the IOMMU design, its caching architecture, the way it is pro- A broad description of current and future IOMMU grammed and the workload. This paper presents the hardware and software designs from various companies performance characteristics of the Calgary and DART can be found in the OLS ’06 paper entitled Utilizing IOMMUs in Linux, both on bare metal and in a hyper- IOMMUs for Virtualization in Linux and Xen [3]. The visor environment. The throughput and CPU utilization design of a system with an IOMMU can be broadly bro- of several IO workloads, with and without an IOMMU, ken down into the following areas: are measured and the results are analyzed. The poten- tial strategies for mitigating the IOMMU’s costs are then • IOMMU hardware architecture and design. discussed. In conclusion a set of optimizations and re- sulting performance improvements are presented. • Hardware ↔ software interfaces. • 9 • 10 • The Price of Safety: Evaluating IOMMU Performance • Pure software interfaces (e.g., between userspace IOMMU (which will be discussed later in detail) does and kernelspace or between kernelspace and hyper- not provide a software mechanism for invalidating a sin- visor). gle cache entry—one must flush the entire cache to in- validate an entry. We present a related optimization in Section 4. It should be noted that these areas can and do affect each other: the hardware/software interface can dictate some It should be mentioned that the PCI-SIG IOV (IO Vir- aspects of the pure software interfaces, and the hardware tualization) working group is working on an Address design dictates certain aspects of the hardware/software Translation Services (ATS) standard. ATS brings in an- interfaces. other level of caching, by defining how I/O endpoints (i.e., adapters) inter-operate with the IOMMU to cache This paper focuses on two different implementations translations on the adapter and communicate invalida- of the same IOMMU architecture that revolves around tion requests from the IOMMU to the adapter. This adds the basic concept of a Translation Control Entry (TCE). another level of complexity to the system, which needs TCEs are described in detail in Section 1.1.2. to be overcome in order to find the optimal caching strat- egy. 1.1.1 IOMMU hardware architecture and design 1.1.2 Hardware ↔ Software Interface Just as a CPU-MMU requires a TLB with a very high hit-rate in order to not impose an undue burden on the The main hardware/software interface in the TCE fam- system, so does an IOMMU require a translation cache ily of IOMMUs is the Translation Control Entry (TCE). to avoid excessive memory lookups. These translation TCEs are organized in TCE tables. TCE tables are anal- caches are commonly referred to as IOTLBs. ogous to page tables in an MMU, and TCEs are similar The performance of the system is affected by several to page table entries (PTEs). Each TCE identifies a 4KB cache-related factors: page of host memory and the access rights that the bus (or device) has to that page. The TCEs are arranged in a contiguous series of host memory pages that comprise • The cache size and associativity [13]. the TCE table. The TCE table creates a single unique IO address space (DMA address space) for all the devices • The cache replacement policy. that share it. • The cache invalidation mechanism and the fre- The translation from a DMA address to a host mem- quency and cost of invalidations. ory address occurs by computing an index into the TCE table by simply extracting the page number from the The optimal cache replacement policy for an IOTLB DMA address. The index is used to compute a direct is probably significantly different than for an MMU- offset into the TCE table that results in a TCE that trans- TLB. MMU-TLBs rely on spatial and temporal locality lates that IO page. The access control bits are then used to achieve a very high hit-rate. DMA addresses from de- to validate both the translation and the access rights to vices, however, do not necessarily have temporal or spa- the host memory page. Finally, the translation is used by tial locality. Consider for example a NIC which DMAs the bus to direct a DMA transaction to a specific location received packets directly into application buffers: pack- in host memory.

Proceedings of the Linux Symposium

Implementation of a Linux Kernel Module to Stress Test Memory Subsystems in X86 Architecture

Kutrace: Where Have All the Nanoseconds Gone?

Lguest a Journey of Learning the Linux Kernel Internals

Drilling Network Stacks with Packetdrill

Lguest the Little Hypervisor

Debugging Kernel Problems

Techniques for Improving the Performance of Software Transactional Memory

Chapter 1. Origins of Mac OS X

Optimization of System's Performance with Kernel Tracing by Cohort Intelligence

ARM Assembly Shellcode from Zero to ARM Assembly Bind Shellcode

Hardware-Driven Evolution in Storage Software by Zev Weiss A

Container-Based Virtualization for Byte-Addressable NVM Data Storage