Rethinking the I/O Memory Management Unit (IOMMU)

Rethinking the I/O Memory Management Unit (IOMMU) Moshe Malka Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 Rethinking the I/O Memory Management Unit (IOMMU) Research Thesis Submitted in partial fulfillment of the requirements for the degree of Master of Science in Computer Science Moshe Malka Submitted to the Senate of the Technion | Israel Institute of Technology Adar 5775 Haifa March 2015 Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 This research was carried out under the supervision of Prof. Dan Tsafrir, in the Faculty of Computer Science. Some results in this thesis have been published as articles by the author and research collaborators in conferences and journals during the course of the author's research period, the most up-to-date versions of which being: 1. Moshe Malka, Nadav Amit, Muly Ben-Yehuda and Dan Tsafrir. rIOMMU: Efficient IOMMU for I/O Devices that Employ Ring. In proceeding of the 20th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS 2015). 2. Moshe Malka, Nadav Amit, and Dan Tsafrir. Efficient IOMMU Intra-Operating System Protection. In proceeding of the 13th USENIX Conference on File and Storage Technologies (FAST 2015) . Acknowledgements I would like to thank my advisor Dan Tsafrir for his devoted guidance and help, my research team Nadav Amit and Muli Ben-Yehuda, my parents and my friends. The generous financial help of the Technion is gratefully acknowledged. Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 Contents List of Figures Abstract 1 Abbreviations and Notations 3 1 Introduction 5 2 Background 7 2.1 Virtual Memory . .8 2.1.1 Physical and Virtual Addressing . .8 2.1.2 Address Spaces . 10 2.1.3 Page Table . 10 2.1.4 Virtual Memory as a Tool for Memory Protection . 11 2.1.5 Address Translation . 13 2.2 Direct Memory Access . 22 2.2.1 Transferring Data from the Memory to the Device . 23 2.2.2 Transferring Data from the Device to the Memory . 23 2.3 Adding Virtual Memory to I/O Transactions . 24 3 rIOMMU: Efficient IOMMU for I/O Devices that Employ Ring Buffers 27 3.1 Introduction . 27 3.2 Background . 29 3.2.1 Operating System DMA Protection . 29 3.2.2 IOMMU Design and Implementation . 30 3.2.3 I/O Devices Employing Ring Buffers . 31 3.3 Cost of Safety . 32 3.3.1 Overhead Components . 33 3.3.2 Protection Modes and Measured Overhead . 33 3.3.3 Performance Model . 37 3.4 Design . 38 3.5 Evaluation . 44 Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 3.5.1 Methodology . 44 3.5.2 Results . 47 3.5.3 When IOTLB Miss Penalty Matters . 50 3.5.4 Comparing to TLB Prefetchers . 50 3.6 Related Work . 51 4 Efficient IOMMU Intra-Operating System Protection 53 4.1 Introduction . 53 4.2 Intra-OS Protection . 55 4.3 IOVA Allocation and Mapping . 57 4.4 Long-Lasting Ring Interference . 59 4.5 The EiovaR Optimization . 61 4.5.1 EiovaR with Strict Protection . 63 4.5.2 EiovaR with Deferred Protection . 65 4.6 Evaluation . 66 4.6.1 Methodology . 66 4.6.2 Results . 69 4.7 Related Work . 75 5 Reducing the IOTLB Miss Overhead 77 5.1 Introduction . 77 5.2 General description of all the prefetchers we explore . 78 5.3 Markov Prefetcher (MP) . 79 5.3.1 Markov Chain Theorem . 80 5.3.2 Prefetching Using the Markov Chain . 81 5.3.3 Extension to IOMMU . 81 5.4 Recency Based Prefetching (RP) . 81 5.4.1 TLB hit . 82 5.4.2 TLB miss . 82 5.4.3 Extension to IOMMU . 84 5.5 Distance Prefetching (DP) . 84 5.6 Evaluation . 86 5.6.1 Methodology . 86 5.6.2 Results . 86 5.7 Measuring the cost of an Intel IOTLB miss . 90 6 Conclusions 93 6.1 rIOMMU . 93 6.2 eIOVAR . 93 6.3 Reducing the IOTLB Miss Overhead . 93 Hebrew Abstract i Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 List of Figures 2.1 A system that uses physical addressing. .8 2.2 A system that uses virtual addressing. .9 2.3 Flat page table. 11 2.4 Allocating a new virtual page. 12 2.5 Using virtual memory to provide page-level memory protection. 13 2.6 Addressing translation with page table. 14 2.7 Page hit. 14 2.8 Components of a virtual address that are used to access the TLB. 15 2.9 TLB hit. 16 2.10 TLB miss. 17 2.11 A two-level page table hierarchy. Notice that addresses increase from top to bottom. 18 2.12 Address translation with a k-level page table. 19 2.13 Addressing for small memory system. Assume 14-bit virtual addresses (n = 14), 12-bit physical addresses (m = 12), and 64-byte pages (P = 64). 20 2.14 TLB, page table, and cache for small memory system. All values in the TLB, page table, and cache are in hexadecimal notation. 21 2.15 DMA transaction flow with IOMMU sequence diagram. 24 3.1 IOMMU is for devices what the MMU is for processes. 28 3.2 Intel IOMMU data structures for IOVA translation. 30 3.3 A driver drives its device through a ring. With an IOMMU, pointers are IOVAs (both registers and target buffers). 32 3.4 The I/O device driver maps an IOVA v to a physical target buffer p. It then assigns v to the DMA descriptor. 34 3.5 The I/O device writes the packet it receives to the target buffer through v, which the IOMMU translates to p..................... 34 3.6 After the DMA completes, the I/O device driver unmaps v and passes p to a higher-level software layer. 34 3.7 CPU cycles used for processing one packet. The top bar labels are relative to Cnone=1,816 (bottommost grid line). 36 Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 3.8 Throughput of Netperf TCP stream as a function of the average number of cycles spent on processing one packet. 36 3.9 The rIOMMU data structures. e) is used only by hardware. The last two fields of rRING are used only by software. 38 3.10 rIOMMU data structures for IOVA translation. 39 3.11 Outline of the rIOMMU logic. All DMAs are carried out with IOVAs that are translated by the rtranslate routine. 40 3.12 Outline of the rIOMMU OS driver, implementing map and unmap, which respectively correspond to Figures 3.4 and 3.6. 41 3.13 Absolute performance numbers of the IOMMU modes when using the Mellanox (top) and Broadcom (bottom) NICs. 47 4.1 IOVA translation using the Intel IOMMU. ................... 56 4.2 Pseudo code of the baseline IOVA allocation scheme. The functions rb next and rb prev return the successor and predecessor of the node they receive, respectively. 59 4.3 The length of each alloc iova search loop in a 40K (sub)sequence of alloc iova calls performed by one Netperf run. One Rx-Tx interference leads to regular linearity. .................................... 61 4.4 Netperf TCP stream iteratively executed under strict protection. The x axis shows the iteration number. .......................... 63 4.5 Average cycles breakdown of map with Netperf/strict. ............ 64 4.6 Average cycles breakdown of unmap with Netperf/strict. ........... 64 4.7 Netperf TCP stream iteratively executed under deferred protection. The x axis shows the iteration number. .......................... 65 4.8 Under deferred protection, EiovaRk eliminates costly linear searches when k exceeds the high-water mark W . ........................ 67 4.9 Length of the alloc iova search loop under the EiovaRk deferred protection regime for three k values when running Netperf TCP Stream. Bigger capacity implies that the searches become shorter on average. Big enough capacity (k ≥ W = 250) eliminates the searches altogether. ............... 67 4.10 The performance of baseline vs. EiovaR allocation, under strict and deferred protection regimes for the Mellanox (top) and Broadcom (bottom) setups. Except for in the case of Netperf RR, higher values indicated better performance. 68 4.11 Netperf Stream throughput (top) and used CPU (bottom) for different message sizes in the Broadcom setup. .......................... 74 4.12 Impact of increased concurrency on Memcached in the Mellanox setup. EiovaR allows the performance to scale. ........................ 75 5.1 General scheme. 78 5.2 Markov state transition diagram, which is represented as a directed graph (right) or a matrix (left). 80 Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 5.3 Schematic implementation of the Markov Prefetcher. 82 5.4 Schematic depiction of the recency prefetcher on a TLB hit. 83 5.5 Schematic depiction of the recency prefetcher on a TLB miss. 84 5.6 Schematic depiction of the distance prefetcher on a TLB miss. 85 5.7 Hit rate simulation of Apache benchmarks with message sizes of 1k (top) and 1M (bottom). 87 5.8 Hit rate simulation of Netperf stream with message sizes of 1k (top) and 4k (bottom). 88 5.9 Hit rate simulation of Netperf RR (top) and Memcached (bottom). 89 5.10 Subtraction between the RTT when the IOMMU is enabled and the RTT when the IOMMU is disabled. 91 Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 Technion - Computer Science Department - M.Sc.

Rethinking the I/O Memory Management Unit (IOMMU)

Memory Protection at Option

A Minimal Powerpc™ Boot Sequence for Executing Compiled C Programs

Memory Management

Arm System Memory Management Unit Architecture Specification

Quantifying the Performance of Garbage Collection Vs. Explicit Memory Management

Introduction to Uclinux

I.T.S.O. Powerpc an Inside View

Understanding the Linux Kernel, 3Rd Edition by Daniel P

18-447: Computer Architecture Lecture 18: Virtual Memory III

Virtual Memory and Linux

ACDC: Towards a Universal Mutator for Benchmarking Heap Management Systems

Memory Management In