Rethinking the I/O Memory Management Unit (IOMMU)

Moshe Malka

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 Rethinking the I/O Memory Management Unit (IOMMU)

Research Thesis

Submitted in partial fulﬁllment of the requirements for the degree of Master of Science in Computer Science

Moshe Malka

Submitted to the Senate of the Technion — Israel Institute of Technology Adar 5775 Haifa March 2015

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 This research was carried out under the supervision of Prof. Dan Tsafrir, in the Faculty of Computer Science.

Some results in this thesis have been published as articles by the author and research collaborators in conferences and journals during the course of the author’s research period, the most up-to-date versions of which being:

1. Moshe Malka, Nadav Amit, Muly Ben-Yehuda and Dan Tsafrir. rIOMMU: Eﬃcient IOMMU for I/O Devices that Employ Ring. In proceeding of the 20th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS 2015).

2. Moshe Malka, Nadav Amit, and Dan Tsafrir. Eﬃcient IOMMU Intra-Operating System Protection. In proceeding of the 13th USENIX Conference on File and Storage Technologies (FAST 2015) .

Acknowledgements

I would like to thank my advisor Dan Tsafrir for his devoted guidance and help, my research team Nadav Amit and Muli Ben-Yehuda, my parents and my friends.

The generous ﬁnancial help of the Technion is gratefully acknowledged.

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 Contents

List of Figures

Abstract 1

Abbreviations and Notations 3

1 Introduction 5

2 Background 7 2.1 Virtual Memory ...... 8 2.1.1 Physical and Virtual Addressing ...... 8 2.1.2 Address Spaces ...... 10 2.1.3 Page Table ...... 10 2.1.4 Virtual Memory as a Tool for Memory Protection ...... 11 2.1.5 Address Translation ...... 13 2.2 Direct Memory Access ...... 22 2.2.1 Transferring Data from the Memory to the Device ...... 23 2.2.2 Transferring Data from the Device to the Memory ...... 23 2.3 Adding Virtual Memory to I/O Transactions ...... 24

3 rIOMMU: Efficient IOMMU for I/O Devices that Employ Ring Buffers 27 3.1 Introduction ...... 27 3.2 Background ...... 29 3.2.1 Operating System DMA Protection ...... 29 3.2.2 IOMMU Design and Implementation ...... 30 3.2.3 I/O Devices Employing Ring Buffers ...... 31 3.3 Cost of Safety ...... 32 3.3.1 Overhead Components ...... 33 3.3.2 Protection Modes and Measured Overhead ...... 33 3.3.3 Performance Model ...... 37 3.4 Design ...... 38 3.5 Evaluation ...... 44

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 3.5.1 Methodology ...... 44 3.5.2 Results ...... 47 3.5.3 When IOTLB Miss Penalty Matters ...... 50 3.5.4 Comparing to TLB Prefetchers ...... 50 3.6 Related Work ...... 51

4 Eﬃcient IOMMU Intra-Operating System Protection 53 4.1 Introduction ...... 53 4.2 Intra-OS Protection ...... 55 4.3 IOVA Allocation and Mapping ...... 57 4.4 Long-Lasting Ring Interference ...... 59 4.5 The EiovaR Optimization ...... 61 4.5.1 EiovaR with Strict Protection ...... 63 4.5.2 EiovaR with Deferred Protection ...... 65 4.6 Evaluation ...... 66 4.6.1 Methodology ...... 66 4.6.2 Results ...... 69 4.7 Related Work ...... 75

5 Reducing the IOTLB Miss Overhead 77 5.1 Introduction ...... 77 5.2 General description of all the prefetchers we explore ...... 78 5.3 Markov Prefetcher (MP) ...... 79 5.3.1 Markov Chain Theorem ...... 80 5.3.2 Prefetching Using the Markov Chain ...... 81 5.3.3 Extension to IOMMU ...... 81 5.4 Recency Based Prefetching (RP) ...... 81 5.4.1 TLB hit ...... 82 5.4.2 TLB miss ...... 82 5.4.3 Extension to IOMMU ...... 84 5.5 Distance Prefetching (DP) ...... 84 5.6 Evaluation ...... 86 5.6.1 Methodology ...... 86 5.6.2 Results ...... 86 5.7 Measuring the cost of an Intel IOTLB miss ...... 90

6 Conclusions 93 6.1 rIOMMU ...... 93 6.2 eIOVAR ...... 93 6.3 Reducing the IOTLB Miss Overhead ...... 93

Hebrew Abstract i

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 List of Figures

2.1 A system that uses physical addressing...... 8 2.2 A system that uses virtual addressing...... 9 2.3 Flat page table...... 11 2.4 Allocating a new virtual page...... 12 2.5 Using virtual memory to provide page-level memory protection...... 13 2.6 Addressing translation with page table...... 14 2.7 Page hit...... 14 2.8 Components of a virtual address that are used to access the TLB. . . . 15 2.9 TLB hit...... 16 2.10 TLB miss...... 17 2.11 A two-level page table hierarchy. Notice that addresses increase from top to bottom...... 18 2.12 Address translation with a k-level page table...... 19 2.13 Addressing for small memory system. Assume 14-bit virtual addresses (n = 14), 12-bit physical addresses (m = 12), and 64-byte pages (P = 64). 20 2.14 TLB, page table, and cache for small memory system. All values in the TLB, page table, and cache are in hexadecimal notation...... 21 2.15 DMA transaction ﬂow with IOMMU sequence diagram...... 24

3.1 IOMMU is for devices what the MMU is for processes...... 28 3.2 Intel IOMMU data structures for IOVA translation...... 30 3.3 A driver drives its device through a ring. With an IOMMU, pointers are IOVAs (both registers and target buffers)...... 32 3.4 The I/O device driver maps an IOVA v to a physical target buffer p. It then assigns v to the DMA descriptor...... 34 3.5 The I/O device writes the packet it receives to the target buffer through v, which the IOMMU translates to p...... 34 3.6 After the DMA completes, the I/O device driver unmaps v and passes p to a higher-level software layer...... 34 3.7 CPU cycles used for processing one packet. The top bar labels are relative

to Cnone=1,816 (bottommost grid line)...... 36

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 3.8 Throughput of Netperf TCP stream as a function of the average number of cycles spent on processing one packet...... 36 3.9 The rIOMMU data structures. e) is used only by hardware. The last two ﬁelds of rRING are used only by software...... 38 3.10 rIOMMU data structures for IOVA translation...... 39 3.11 Outline of the rIOMMU logic. All DMAs are carried out with IOVAs that are translated by the rtranslate routine...... 40 3.12 Outline of the rIOMMU OS driver, implementing map and unmap, which respectively correspond to Figures 3.4 and 3.6...... 41 3.13 Absolute performance numbers of the IOMMU modes when using the Mellanox (top) and Broadcom (bottom) NICs...... 47

4.1 IOVA translation using the Intel IOMMU...... 56 4.2 Pseudo code of the baseline IOVA allocation scheme. The functions rb next and rb prev return the successor and predecessor of the node they receive, respectively. 59 4.3 The length of each alloc iova search loop in a 40K (sub)sequence of alloc iova calls performed by one Netperf run. One Rx-Tx interference leads to regular linearity...... 61 4.4 Netperf TCP stream iteratively executed under strict protection. The x axis shows the iteration number...... 63 4.5 Average cycles breakdown of map with Netperf/strict...... 64 4.6 Average cycles breakdown of unmap with Netperf/strict...... 64 4.7 Netperf TCP stream iteratively executed under deferred protection. The x axis shows the iteration number...... 65

4.8 Under deferred protection, EiovaRk eliminates costly linear searches when k exceeds the high-water mark W ...... 67

4.9 Length of the alloc iova search loop under the EiovaRk deferred protection regime for three k values when running Netperf TCP Stream. Bigger capacity implies that the searches become shorter on average. Big enough capacity (k ≥ W = 250) eliminates the searches altogether...... 67 4.10 The performance of baseline vs. EiovaR allocation, under strict and deferred protection regimes for the Mellanox (top) and Broadcom (bottom) setups. Except for in the case of Netperf RR, higher values indicated better performance. 68 4.11 Netperf Stream throughput (top) and used CPU (bottom) for diﬀerent message sizes in the Broadcom setup...... 74 4.12 Impact of increased concurrency on Memcached in the Mellanox setup. EiovaR allows the performance to scale...... 75

5.1 General scheme...... 78 5.2 Markov state transition diagram, which is represented as a directed graph (right) or a matrix (left)...... 80

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 5.3 Schematic implementation of the Markov Prefetcher...... 82 5.4 Schematic depiction of the recency prefetcher on a TLB hit...... 83 5.5 Schematic depiction of the recency prefetcher on a TLB miss...... 84 5.6 Schematic depiction of the distance prefetcher on a TLB miss...... 85 5.7 Hit rate simulation of Apache benchmarks with message sizes of 1k (top) and 1M (bottom)...... 87 5.8 Hit rate simulation of Netperf stream with message sizes of 1k (top) and 4k (bottom)...... 88 5.9 Hit rate simulation of Netperf RR (top) and Memcached (bottom). . . . 89 5.10 Subtraction between the RTT when the IOMMU is enabled and the RTT when the IOMMU is disabled...... 91

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 Abstract

Processes are encapsulated with virtual memory spaces and access the memory via virtual addresses (VAs) to ensure, among other things, that they access only those memory parts they have been explicitly granted access to. These memory spaces are created and maintained by the OS, and the translation from virtual to physical addresses (PAs) is done by the MMU. Analogously, I/O devices can be encapsulated with I/O virtual memory spaces and access the memory using I/OVAs, which are translated by the input/output memory management units (IOMMU) to physical addresses. This encapsulation increases system availability and reliability, since it prevents devices from overwriting any part of the memory, including memory that might be used by other entities. It also prevents rogue devices from performing errant or malicious access to the memory and ensures that buggy devices will not lose important data. Chip makers understood the importance of this and added IOMMUs to the chipsets of all servers and some PCs. However, this protection comes at the cost of performance degradation, which depends on the IOMMU design, the way it is programmed, and the workload. We found that Intel’s IOMMU degrades the throughput of I/O-intensive workloads by up to an order of magnitude. We investigate all the possible causes of IOMMU overhead and that of its driver and suggest a solution for each. First we identify that the complexity of the kernel subsystem in charge of IOVA allocation is linear in the number of allocated IOVAs and thus a major source of overhead. We optimize the allocation in a manner that ensures that the complexity is typically constant and never worse than logarithmic, and we improve the performance of the Netperf, Apache, and Memcached benchmarks by up to 4.6x. Observing that the IOTLB miss rate can be as high as 50%, we then suggest hiding the IOTLB misses with a prefetcher. We extend some of the state-of-the-art prefetchers to IOTLB and compare them. In our experiments we achieve a hit rate of up to 99% on some configurations and workloads. Finally, we observe that many devices such as network and disk controllers typically interact with the OS via circular ring buffers that induce a sequential, completely predictable workload. We design a ring IOMMU (rIOMMU) that leverages this characteristic by replacing the virtual memory page table hierarchy with a circular, flat table. Using standard networking benchmarks, we show that rIOMMU provides up to 7.56x higher throughput relative to the baseline IOMMU, and that it is within 0.77-1.00x the throughput of a system without IOMMU.

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 2

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 Abbreviations and Notations

ALH : Address Locality Hint API : Application Programming Interface CI : Cache Index CO : Cache Oﬀset CPU : Central Processing Unit CT : Cache Tag DMA : Direct Memory Access DP : Distance Prefetcher DRAM : Dynamic Random Access Memory DS : Data Structure FIFO : First-In First-Out GB : Gigabyte Gbps : Gigabits per second HW : Hardware I/O : Input/Output IOMMU : I/O Memory Management Unit IOPF : I/O Page Fault IOTLB : I/O Translation Lookaside Buﬀer IOVA : I/O Virtual Address IP : Internet Protocol KB : Kilobyte K : Kilo KVM : Kernel-based Virtual Machine LRU : Least Recently Used MB : Megabyte MMU : Memory Management Unit MP : Markov Prefetcher MPRE : Mapping Prefetch

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 NIC : Network Interface Controller OS : Operating System PA : Physical Address PCI : Peripheral Component Interconnect PFN : Physical Frame Number PPN : Physical Page Number PPO : Physical Page Offset PTBR : Page Table Base Register PTE : Page Table Entry RAM : Random Access Memory RDMA : Remote Direct Memory Access rIOMMU : ring IOMMU RP : Recency Prefetcher RR : Request Response RTT : Round Trip Time Rx : Receiver SUP : Supervisor SW : Software TCP : Transmission Control Protocol TLBI : Translation Lookaside Buffer Index TLB : Translation Lookaside Buffer TLBT : Translation Lookaside Buffer Tag Tx : Transmitter UDP : User Datagram Protocol VA : Virtual Address VM : Virtual Machine VPN : Virtual Page Number VPO : Virtual Page Offset VP : Virtual Page VT-d : Virtualization Technology for Direct I/O VT : Virtualization Technology

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 Chapter 1

Introduction

I/O transactions, such as reading data from a hard disk or a network card, constitute a major part of the workloads in today’s computer systems, especially in servers. They are, therefore, an important part of systems performance. To reduce the CPU utilization, most of today’s peripheral devices can bypass the processor and directly read from and write to the main memory using a direct memory access (DMA) hardware unit [23, 17]. This unit is described in section 2.2. Although accessing the memory directly gives performance advantages, it can also lead to stability and security problems. A peripheral device with direct memory access can be programmed to overwrite any part of the system’s memory or to cause bugs in the system; it can also be made vulnerable to infection with malicious software [15]. These disadvantages of direct access have received a lot of attention recently. How- ever, similar problems also existed for CPU processes back before the virtual memory mechanism was added to CPUs. Virtual memory encapsulates the processes with an abstraction layer that prevents direct access to the memory. Many of the problems solved by the virtual memory closely resemble problems that emerged due to direct memory access. Hardware designers have noticed this similarity and duplicated the virtual memory mechanism for I/O devices, calling it I/O virtual memory. The hardware unit in charge of I/O virtual memory is called the I/O memory management unit (IOMMU). Duplicating the mechanism was natural because the virtual memory is a major part of almost every computer machine. In chapter 2 we expand on the required background regarding virtual memory, the direct memory access mechanism, how the I/O virtual memory mechanism fits into the picture, and the differences between processes and I/O devices in the context of virtual memory. Although I/O virtual memory solves the problems mentioned above, hardware designers did not take into account the difference between the workload of I/O devices and processes. As a result, systems that perform I/O transactions using I/O virtual memory experience a significant reduction in performance. The goal of this work is to study all the causes for the performance reduction and suggest new designs

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 and algorithms to reduce the performance overhead. We carefully investigated the workload experienced by the virtual memory and found that hardware overheads are not exclusively to blame for its high cost. Rather, the cost is amplified by software due to the IOMMU driver overhead. In our primary paper we observe that many devices such as network and disk controllers typically interact with the OS via circular ring buffers that induce a sequential, completely predictable workload. We design a ring IOMMU (rIOMMU) that leverages this characteristic by replacing the virtual memory page table hierarchy with a circular, flat table. A flat table is adequately supported by exactly one IOTLB entry, making every new translation an implicit invalidation of the former and thus requiring explicit invalidations only at the end of I/O bursts. Using standard networking benchmarks, we show that rIOMMU provides up to 7.56x higher throughput relative to the baseline IOMMU, and that it is within 0.77–1.00x the throughput of a system without IOMMU protection. We describe the design and evaluation of our newly proposed rIOMMU in Chapter 3. A paper reflecting the content of this chapter has been accepted for publication to the ACM 20th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2015 . In a second paper we identify the kernel subsystem in charge of IOVA allocation to be a major source of performance degradation for I/O-intensive workloads, due to regularly suffering from a costly linear overhead. We suggest an efficient I/O virtual address allocation mechanism (EIOVAR). Utilizing the characteristics of the load experienced by IOMMUs, we optimize the allocation in a manner that ensures that the complexity is typically constant and never worse than logarithmic. Our allocation scheme is immediately applicable and improves the performance of Netperf, Apache, and Memcached benchmarks by up to 4.6x. We describe EIOVAR in chapter 4. A paper reflecting the content of this chapter has been accepted for publication to the 13th USENIX Conference on File and Storage Technologies (FAST), 2015 . The IOMMU contains a small cache of the virtual memory translations (called the I/O Translation Look aside Buffer). IOTLB misses cause the IOMMU to translate the virtual address, an action that includes multiple memory accesses. In order to hide the performance reduction caused by the IOTLB misses, we explored the prefetch mechanism. This aspect of our research is described in chapter 5, where we review 3 state-of-the-art prefetchers and extend them to work with IOMMU. We then simulate the miss rate of an IOMMU that contains these prefetchers.

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 Chapter 2

Background

The concept of virtual memory was first developed by German physicist Fritz-Rudolf Guntsch at the Technische Universitt Berlin in 1956 in his doctoral thesis, Logical Design of a Digital Computer with Multiple Asynchronous Rotating Drums and Automatic High Speed Memory Operation. In 1961, the Burroughs Corporation independently released the first commercial computer with virtual memory, the B5000, with segmentation rather than paging. The first minicomputer to introduce virtual memory was the Norwegian NORD-1; during the 1970s, other minicomputers implemented virtual memory, notably VAX models running VMS. Before 1982 all Intel CPUs were designed to work in Real Mode. Real Mode, also called real address mode, is characterized by unlimited direct software access to all memory, I/O addresses, and peripheral hardware, but provides no support for memory protection. This situation continued until Protected Mode was added to the x86 architecture in 1982. Introduced with the release of Intel’s 80286 (286)processor, Protected Mode was later extended, with the release of the 80386 in 1985, to include features such as virtual memory paging and safe multi-tasking. The first The virtual memory is an abstract layer of the main memory that separates the processes from the physical memory. It is a combination of operating system, disk files, hardware virtual address translation, hardware exceptions, and main memory that provide processes with a large private address space without any intervention from the application programmer. Virtual memory provides three main capabilities:

1. It provides programmers with a large uniform address space and lets them focus on designing the program rather than dealing with managing the memory used by the processes.

2. It uses main memory eﬃciently by treating it as a cache for process data stored on disk, keeping only the active areas in main memory, and swapping data disk and memory as needed.

3. It protects the address space of each process from corruption by other processes.

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 2.1 Virtual Memory

The goal of this subsection is to explain how virtual memory works, with a focus on:

• The functionality relevant to the I/O: address space, address translation and so forth;

• The ability to provide encapsulation of the memory used by each device;

• Protecting the system from devices that try to gain unauthorized access to the memory.

The reader is referred to the references on which this subsection is based for more in-depth information about virtual memory [18].

2.1.1 Physical and Virtual Addressing

Main memory 0 1 Physical Address 2 (PA) 3 CPU 4 4 5 6 7 8

M-1:

Data word

Figure 2.1: A system that uses physical addressing.

A computer system’s main memory is organized as an array of M contiguous byte- sized cells. Each byte has a unique physical address (PA), as follows:

• The ﬁrst byte has an address of 0;

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 • The next byte has an address of 1;

• The next byte has an address of 2, and so on.

As the organization is simple, physical addressing is the most natural way for a CPU to access memory. Figure 2.1 shows an example of physical addressing for a load instruction that reads the word starting at physical address 4. An eﬀective physical address is created when the CPU executes the load instruction. It then passes it to main memory over the memory bus. The main memory fetches the 4-byte word that starts at physical address 4 and returns it to the CPU, which stores it in a register. Physical addressing, and systems such as digital signal processors and embedded microcontrollers, were used in early personal computers. The method is still employed in Cray supercomputers. Modern processors no longer use physical addressing. Instead, they use virtual addressing, as shown in Figure 2.2. In virtual addressing, the CPU accesses the main memory by generating a virtual address (or VA, for short), which is converted to the appropriate physical address before it is sent to the memory. Converting a virtual address into a physical address is called address translation. Address translation, like exception handling, requires the CPU hardware and the operating system to work together closely.

Main memory 0 1 Physical 2 Address Address Virtual translation Address(VA) (PA) 3 CPU MMU 4100 4 4 5 6 7 8

M-1:

Figure 2.2: A system that uses virtual addressing.

Virtual addresses are translated on the ﬂy by the memory management unit (MMU),

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 which is dedicated hardware on the CPU chip that uses a look-up table stored in main memory. The look-up table contains the translations for the mapped virtual addresses and is managed by the operating system.

2.1.2 Address Spaces

An address space is an ordered set of non-negative integer addresses 0, 1, 2, .... In a system that uses virtual memory, the CPU creates virtual addresses from an address space of N = 2n addresses called the virtual address space: 0, 1, 2, ...., N − 1. The size of an address space is determined by the number of bits required to represent the largest address. For example, a virtual address space with N = 2n addresses is called an n-bit address space. Modern computer systems typically support either 32-bit or 64-bit virtual address spaces. A system also has a physical address space that corresponds to the M bytes of physical memory in the system: 0, 1, 2, ...., M − 1 M is not required to be a power of two, but for our purposes, we will assume that M = 2. The address space is an important concept because it makes a clear distinction between:

• Data objects (bytes); and

• Their attributes (addresses).

This distinction allows us to generalize, and also allows each data object to have multiple independent addresses, each chosen from a diﬀerent address space. The basic idea of virtual memory is that each byte of main memory has a virtual address chosen from the virtual address space, and a physical address chosen from the physical address space.

2.1.3 Page Table

Virtual memory is a tool for caching process data from disk to main memory, memory management, and memory protection. These capabilities are provided by a combination of:

• The operating system software;

• Address translation hardware in the memory management unit; and

• A data structure known as a page table. It is stored in physical memory and maps virtual pages to physical pages. The address translation hardware reads the page table each time it converts a virtual address to a physical address.

The operating system maintains the contents of the page table and transfers pages back and forth between disk and DRAM. The basic organization of a page table is shown in Figure 2.3

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 Physical memory (DRAM) Physical page PP 0 Number or VP 1 valid disk address VP 2 PTE 0 0 null VP 7 1 VP 4 PP 3 1 Virtual memory 1 (disk) 0 VP 1 0 null VP 2 VP 3 0 VP 4 PTE 7 1 Memory resident VP 6 Page table VP 7 (DRAM)

Figure 2.3: Flat page table.

A page table is an array of page table entries (PTEs). Each page in the virtual address space has a page table entry at a fixed offset in the page table. For our purposes, we will assume that each page table entry consists of a valid bit and an n-bit address field. The valid bit tells us whether the virtual page is currently cached in DRAM. If the valid bit is set, the address field indicates the start of the corresponding physical page in DRAM where the virtual page is cached. If the valid bit is not set, a null address tells us that the virtual page has not yet been allocated. Otherwise, the address points to the start of the virtual page on disk. Figure 2.3 shows a page table for a system with eight virtual pages and four physical pages. The four virtual pages, VP 1, VP 2, VP 4, and VP 7, are currently cached in DRAM. Two pages, VP 0 and VP 5, are yet to be allocated, while pages VP 3 and VP 6 have been allocated - but are not currently cached. Because the DRAM cache is fully associative, any physical page can contain any virtual page.

Allocating Memory and Mapping It to Virtual Space

When the operating system allocates a new page of virtual memory, such as the result of calling malloc, this will aﬀect the page table, as shown in the example in Figure 2.4. In the example, VP 5 is allocated by creating room on disk and updating page table entry 5 to point to the newly created page on disk.

2.1.4 Virtual Memory as a Tool for Memory Protection

Modern computer systems must provide the means by which the operating system controls access to the memory system. A user process should be prevented from modifying its read-only text section. Nor should it be allowed to read and/or to modify any of the code and data structures in the kernel. It should also be prevented from reading and or writing to the private memory of other processes. Furthermore, unless

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 Physical memory (DRAM) Physical page PP 0 Number or VP 1 valid disk address VP 2 PTE 0 0 null VP 7 1 VP 3 PP 3 1 Virtual memory 1 (disk) 0 VP 1 0 VP 2 VP 3 0 VP 4 1 PTE 7 VP 5 Memory resident VP 6 Page table VP 7 (DRAM)

Figure 2.4: Allocating a new virtual page.

all parties expressly allow it, such as by calling to specific inter-process communication system calls, modification of any virtual pages shared with other processes should be forbidden. Having separate virtual address spaces makes it easy to isolate the private memories of different processes. However, the address translation mechanism can be extended by adding some additional permission bits to the page table entries to provide even greater access control. This is possible because the address translation hardware reads a page table entry each time the CPU generates an address. Figure 2.5 shows how this can be done. In this example we have added three permission bits to each page table entry:

• The SUP bit indicates whether processes must be running in kernel (supervisor) mode to access the page. Processes running in kernel mode can access any page, while processes running in user mode can access pages for which SUP is 0.

• The ”read” and ”write” bits control read and write access to the page. For example, if process i is running in user mode, then it has permission to read VP 0 and to read or write VP 1, but it cannot access VP 2.

If an instruction contravenes these permissions, the CPU triggers a general protection fault that transfers control to an exception handler in the kernel. This exception is reported as a ”segmentation fault” in Unix shells.

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 Page tables with permission bits Physical memory Process i: SUP READ WRITE ADDRESS Vp 0 No Yes No PP 6 PP 0 Vp 1 No Yes Yes PP 4

Vp 2 Yes Yes Yes PP 2 PP 2

PP 4

PP 6

SUP READ WRITE ADDRESS Process j: No Yes No PP 9 Vp 0 PP 9 Vp 1 No Yes Yes PP 6

Vp 2 Yes Yes Yes PP 11 PP 11

Figure 2.5: Using virtual memory to provide page-level memory protection.

2.1.5 Address Translation

In covering the basics of address translation in this section, our goal is to provide an understanding of the role hardware plays in supporting virtual memory, in suﬃcient detail to allow the reader to work through some examples on his or her own. Do bear in mind that we are omitting a number of details, especially those related to timing. Although they are important to hardware designers, such details are beyond the scope of this thesis. How the memory management unit uses the page table to perform the virtual address mapping is shown in Figure 2.6. The page table base register (PTBR), which is a control register in the CPU, points to the current page table. The n-bit virtual address has two components:

1. A p-bit virtual page oﬀset (VPO); and

2. An (n - p)-bit virtual page number (VPN).

The memory management unit uses the virtual page number to select the appropriate page table entry. For example, VPN 0 selects PTE 0, VPN 1 selects PTE 1, and so on. The corresponding physical address is the concatenation of the physical page number (PPN) from the page table entry and the virtual page oﬀset from the virtual address. As both the physical and virtual pages are P bytes, the physical page oﬀset (PPO) is

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 N-1 p p-1 o Virtual page number (VPN) Virtual page offset (VPO) Page table Base register (PTBR) Valid Physical page number (PPN)

The VPN acts as index into the page table

If valid=0 then page not in memory M-1 p p-1 o (page fault) Physical page number (PPN) Physical page OFETFS (PPO) Physical address

Figure 2.6: Addressing translation with page table.

identical to the virtual page oﬀset. The steps the CPU hardware performs when there is a page hit are shown in Figure 2.7.

2 CPU chip PTEA PTE 1 MMU Cache/ Processor 3 VA memory PA 4 Data 5

Figure 2.7: Page hit.

1. The processor generates a virtual address and sends it to the memory management unit.

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 2. The memory management unit generates the page table entry address, and requests it from the main memory.

3. The cache and/or main memory return the page table entry to the memory management unit. The memory management unit creates the physical address and sends it to main memory.

4. The main memory returns the requested data word to the processor.

Page hits are handled entirely by hardware. Handling a page fault requires coopera- tion between hardware and the operating system kernel. As it is not relevant to our work, we do not explain the process here.

Speeding up Address Translation with a TLB Translation Lookaside Buﬀer

Each time the CPU generates a virtual address, the memory management unit must refer to a page table entry to translate the virtual address into a physical address. This requires an additional fetch from memory that take tens or even hundreds of cycles. If the page table entry is cached in L1, then the overhead is reduced to only one or two cycles. However, even this low cost can be further reduced or even eliminated by including a small cache of page table entries in the memory management unit. This is called a translation lookaside buffer (or TLB for short). A TLB is a small, virtually addressed cache where each line holds a block consisting of a single page table entry. A translation lookaside buffer usually has a high degree of associativity. This is shown in Figure 2.8, in which the index and tag fields used for set selection and line matching are extracted from the virtual page number in the virtual address. If the translation lookaside buffer has T = 2 t sets, then the translation lookaside buffer index (TLBI) consists of the t least significant bits of the virtual page number, and the translation lookaside buffer tag (TLBT) consists of the remaining bits in the virtual page number.

N-1 p+t p+t-1 p p-1 0 TLB tag (TLBT) TLB index (TLBI) VPO

VPN

Figure 2.8: Components of a virtual address that are used to access the TLB.

The steps involved when there is a translation lookaside buﬀer hit (the usual case) are shown in Figure 2.9. The important point is that all of the address translation steps are performed inside the on-chip memory management unit. Because of this, they are performed very quickly.

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 CPU chip TLB

2 VPN PTE 3

4 1 Trans- Cache- Processor VA lation PA memory

5 Data

Figure 2.9: TLB hit.

1. The CPU generates a virtual address.

2. The memory management unit checks if the translation exists on the translation lookaside buﬀer.

3. The memory management unit fetches the appropriate page table entry from the translation lookaside buﬀer.

4. The memory management unit translates the virtual address into a physical address, and then sends it to the main memory.

5. The main memory returns the requested data word to the CPU.

The memory management unit must fetch the page table entry from the L1 cache if there is a translation lookaside buﬀer miss. This is shown in Figure 2.10. The newly fetched page table entry is stored in the translation lookaside buﬀer miss and may possibly overwrite an existing entry.

Multi-Level Page Tables

Until now we have assumed that the system uses a single page table for address translation. If, however, we had a 32-bit address space, 4 KB pages, and a 4-byte page table entry, we would require a 4 MB page table resident in memory at all times, even if the application referenced only a small part of the virtual address space. This problem is compounded for systems with 64-bit address spaces.

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 CPU chip TLB

2 VPN 4 PTE

3 1 Trans- Cache- Processor PTEA VA lation PA memory 5

6 Data

Figure 2.10: TLB miss.

A commonly used method of compacting the page table is to use a hierarchy of page tables. The idea is best explained using an example in which a 32-bit virtual address space is partitioned into 4 KB pages, with page table entries that are 4 bytes each, and for which the virtual address space has the following form:

• The ﬁrst 2K pages of memory are allocated for code and data;

• The next 6K pages are unallocated;

• The next 1023 pages are also unallocated; and

• The next page is allocated for the user stack.

Figure 2.11 shows how a two-level page table hierarchy for this virtual address space might be constructed. Each page table entry in the level-1 table is responsible for mapping a 4 MB segment of the virtual address space, in which each segment consists of 1024 contiguous pages. For example, page table entry 0 maps the ﬁrst segment, page table entry 1 the next segment, and so on. As the address space is 4 GB, 1024 page table entries are suﬃcient to cover the entire space. If every page in segment i is unallocated, level 1 page table entry i will be null. For example, in Figure 2.11, segments 2 to7 are unallocated. However, if at least one page in segment I is allocated, level 1 page table entry i will point to the base of a level 2 page table. This is shown in Figure 1.11, where all or portions of segments 0, 1, and 8 are allocated, so their level 1 page table entries point to level 2 page tables.

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 Virtual memory Level 2 VP 0 Page table Level 1 ……… Page table PTE 0 VP 1023 2K allocated VM PTE 0 …….. pages for code and VP 1024 data PTE 1 PTE 1023 ……… PTE 2 (NULL) VP 2047 PTE 3 (NULL) PTE 0

PTE 4 (NULL) ……..

PTE 5 (NULL) PTE 1023 Gap 6 k allocated VM PTE 6 (NULL) pages PTE 7 (NULL)

PTE 8 1023 null PTEs PTE 1023 (1k-9) 1023 1023 allocated pages Null PTEs Unallocated pages

VP 9215 1 allocated VM pages for the stack

Figure 2.11: A two-level page table hierarchy. Notice that addresses increase from top to bottom.

Each page table entry in a level 2 page table maps a 4 KB page of virtual memory, just as in a single-level page table. In 4-byte page table entries, each level 1 and level 2 page table is 4K bytes, which, conveniently, is the same size as a page. Moreover, only the level 1 table needs to be in main memory all the time. The level 2 page tables can be created and paged in and out by the virtual memory system as they are needed. This further reduces demand on the main memory. Only the most frequently used level 2 page tables need be cached in the main memory. Figure 2.12 summarizes address translation with a k-level page table hierarchy.

• The virtual address is partitioned into k virtual page numbers and a virtual page oﬀset.

• Each virtual page number i, 1 ≤ i ≤ k, is an index of a page table at level i.

• Each page table entry in a level-j table, 1 ≤ j ≤ k − 1, points to the base of some page table at level j + 1.

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 • Each page table entry in a level-k table contains either the physical page number of some physical page, or the address of a disk block.

To create the physical address, the memory management unit must access k page table entries before it can allocate the physical page number. Again, as with a single-level hierarchy, the physical page number is identical to the virtual page oﬀset.

n -1 Virtual address P -1 0 VPN 1 VPN 2 . . . VPN k VPO

Level 1 Level 2 Level k Page table Page table Page table . . .

PPN

M-1 P -1 0 PPN PPO Physical address

Figure 2.12: Address translation with a k-level page table.

At first glance, accessing k page table entries appears to be expensive and impractical. However, the translation lookaside buffer compensates for this by caching page table entries from the page tables at the different levels, the effect of which is that address translation with multi-level page tables is not significantly slower than with single-level page tables.

End-to-end Address Translation (Page Walk)

We put it all together in this subsection with an example of end-to-end address translation on a small system with a translation lookaside buﬀer and L1 d-cache. For simplicity, we make the following assumptions:

• The memory is byte addressable.

• Memory accesses are to 1-byte words (not 4-byte words).

• Virtual addresses are 14 bits wide (n = 14).

• Physical addresses are 12 bits wide (m = 12).

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 13 12 11 10 9 8 7 6 5 4 3 2 1 0 Virtual address VPN VPO (Virtual page number) (Virtual page number)

11 10 9 8 7 6 5 4 3 2 1 0 Physical address PPN PPO (Physical page number) (Physical page number)

Figure 2.13: Addressing for small memory system. Assume 14-bit virtual addresses (n = 14), 12-bit physical addresses (m = 12), and 64-byte pages (P = 64).

• The page size is 64 bytes (P = 64).

• The translation lookaside buﬀer is a four-way set associative with a total of 16 entries.

• The L1 d-cache is physically addressed and direct mapped, with a 4-byte line size and a total of 16 sets.

The formats of the virtual and physical addresses are shown in Figure 2.13. Since each page is composed of 26 = 64 bytes, the low-order 6 bits of the virtual and physical addresses serve as the virtual page offset and physical page offset respectively. The high-order 8 bits of the virtual address serve as the virtual page number. The high-order 6 bits of the physical address serve as the physical page number. Figure 2.14(a) shows a snapshot of this memory system, including the translation lookaside buffer; Figure 2.14(b) shows a portion of the page table; and Figure 2.14(c) shows the L1 cache. We have also shown how the bits of the virtual and physical addresses are partitioned by the hardware as it accesses these devices. This can be seen above the figures for the translation lookaside buffer and cache.

• Translation lookaside buffer: The translation lookaside buffer is virtually addressed using the bits of the virtual page number. As the translation lookaside buffer has four sets, the 2 low-order bits of the virtual page number serve as the set index (the translation lookaside buffer index). The remaining 6 high-order bits serve as the tag (translation lookaside buffer tag) that distinguishes the different virtual page numbers that might map to the same translation lookaside buffer set.

• Page table: The page table is a single-level design with a total of 28 = 256 page table entries. However, we are only interested in the ﬁrst sixteen of these. For

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 TLBT TLBT 13 12 11 10 9 8 7 6 5 4 3 2 1 0 Virtual address VPN VPO Set Tag PPN Valid Tag PPN Valid Tag PPN Valid Tag PPN Valid 0 03 _ 0 09 0D 1 00 - 0 07 02 1

1 03 2D 1 02 - 0 04 - 0 0A - 0

2 02 - 0 08 - 0 06 - 0 03 - 0

3 07 - 0 03 0D 1 0A 34 1 02 - 0

(a) TLB: Four sets, 16 entries, four-way set associative

CT CI CO 11 10 9 8 7 6 5 4 3 2 1 0 Physical address PPN PPO

Idx Tag Valid Blk 0 Blk 1 Blk 2 Blk 3 0 19 1 99 11 23 11 1 15 0 - - - - VPN PPN Valid VPN PPN Valid 00 28 1 08 28 1 2 1B 1 00 02 04 08 3 36 0 - - - - 01 - 0 09 - 0 4 32 1 43 6D 8F 09 02 33 1 0A 33 1 5 0D 1 36 72 F0 1D 03 02 1 0B 02 1 6 31 0 - - - - 04 - 0 0C - 0 7 16 1 11 C2 DF 03 05 16 1 OD 2D 1 8 24 1 3A 00 51 89 - 0 11 0 06 0E 9 2D 0 - - - - - 0 0D 0 07 0F A 2D 1 93 15 DA 3B (b) Page table: Only the first 16 PTEs are shown B 0B 0 - - - - C 12 0 - - - - D 16 1 04 96 34 15 E 13 1 83 77 1B D3 F 14 0 - - - -

Figure 2.14: TLB, page table, and cache for small memory system. All values in the TLB, page table, and cache are in hexadecimal notation.

convenience, we have labeled each page table entry with the virtual page number that indexes it. Keep in mind, however, that these virtual page numbers are not part of the page table and not stored in memory. Also keep in mind that the physical page number of each invalid page table entry is marked with a dash or minus sign to emphasize that the bit values stored there are not meaningful.

• Cache. The direct-mapped cache is addressed by the ﬁelds in the physical address. As each block is 4 bytes, the low-order 2 bits of the physical address serve as the block oﬀset (CO) and, since there are 16 sets, the next 4 bits serve as the set index (CI). The remaining 6 bits serve as the tag (CT).

What happens when the CPU executes a load instruction that reads the byte at address 0x03d4? Recall that the hypothetical CPU reads one-byte words, not four-byte words. In starting a manual simulation such as this, it is helpful to:

• Write down the bits in the virtual address;

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 • Identify the various ﬁelds we will need; and

• Work out their hex values.

A similar task is performed by the hardware when it decodes the address. From the virtual page number, the memory management unit extracts the virtual page number of 0x0F from the virtual address. It then checks with the translation lookaside buffer set to see whether it has cached a copy of page table entry 0x0F from another, previous, memory reference. The translation lookaside buffer extracts the translation lookaside buffer index of 0x03 and the translation lookaside buffer tag of 0x3, hitting on a valid match in the second entry of Set 0x3. It then returns the cached physical page number 0x0D to the memory management unit. The memory management unit would need to fetch the PTE from main memory if the translation lookaside buffer had missed. However, that did not happen in our example. Instead, we had a translation lookaside buffer hit. The memory management unit now has everything required to create the physical address, which it does by concatenating the physical page number (0x0D) from the page table entry with the virtual page offset (0x14) from the virtual address. This forms the physical address 0x354. The memory management unit then sends the physical address to the cache, which extracts from the physical address:

1. The cache oﬀset (CO) of 0x0;

2. The cache set index (CI) of 0x5; and

3. The cache tag (CT) of 0x0D.

Because the tag in Set 0x5 matches the cache tag, the cache detects a hit, reads out the data byte (0x36) at oﬀset CO, and returns it to the memory management unit, which then passes it back to the CPU. It is possible to have other routes through the translation, an example being that if the translation lookaside buﬀer misses, the memory management unit has to fetch the physical page number from a page table entry in the page table. If the resulting page table entry is invalid, this indicates a page fault and the kernel must reload the missing page and rerun the load instruction. Another possibility is that the page table entry is valid, but the necessary memory block misses in the cache.

2.2 Direct Memory Access

Direct memory access (DMA) is the hardware mechanism that allows peripheral components to read from the memory or to write to it directly without involving the CPU. This mechanism not only allows us to free the CPU to execute other commands, but

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 it can also signiﬁcantly improve the throughput of the memory transactions from the device to the memory and vice versa. There are two types of DMAs: a software transfer of the data (by calling functions such as read) or a hardware transfer of the data, known as Rx when the data is transferred from the memory, and Tx when it is transferred to the device. We will focus on the hardware DMA transaction to the memory, which is done asynchronously in this case. That is, the device transfers the data at its own rate without the involvement of the CPU, and the CPU continues its execution simultaneously. Those transactions are done using DMA descriptors, which include the location on the memory to access. A DMA descriptor is a data structure that contains all the information the hardware needs to execute its operations such as read or write. The descriptor is prepared by the OS in advance. Its location on the memory is known to the device. Once the hardware becomes available to execute the next DMA command, it reads the descriptor, executes the relevant command, advances to read the next descriptor, and so on until it reaches an empty descriptor.

2.2.1 Transferring Data from the Memory to the Device

Consider an example of a case in which the device reads data from the memory, as happens when a packet is sent to the NIC. On driver registration, the driver allocates a set of descriptors, and, once there is data to transfer to the device, chooses a descriptor belonging to the OS, updates it to point to the data buffer, writes in it the size of the buffer, marks the descriptor as belonging to the device, and interrupts the device to wake it up. The device reads the first descriptor, which belongs to it, and from that descriptor it reads the pointer and the size of the buffer to be read. The device now knows the number of bytes to read and where to read from, and it start the transaction. After finishing the transaction, the device marks the descriptor as belonging to the OS, advances to the next descriptor, and interrupts the OS. The OS detaches the buffer from the descriptor, leaving the descriptor free for the next DMA command.

2.2.2 Transferring Data from the Device to the Memory

When the device writes data to the memory, it interrupts the OS to announce that new data has arrived. Then, the OS uses the interrupt handler to allocate a buffer and tells the hardware where to transfer its data. After the device writes the data to the buffer and raises another interrupt, the OS wakes up a relevant process and passes the packet to it. A network card is a typical example of a device that transfers data asynchronously to the memory. The OS prepares the descriptor in advance, allocates a buffer, links it to the descriptor, and marks the descriptor belonging to the device. If no packet arrives, the descriptor contains a link to an allocated buffer waiting for data to be written to it. At the arrival of a packet from the network, the network card reads the descriptor in order to identify the address of the buffer to write the data to, writes the data to the

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 buffer, and raises an interrupt to the OS. Using the interrupt handler, the OS detaches the buffer from the descriptor, allocates a new buffer, links it to the descriptor, and passes the packet written on the buffer to the network stack. Now the descriptor, with a new allocated buffer, again belongs to the device and waits until a new packet arrives.

2.3 Adding Virtual Memory to I/O Transactions

What follows is a brief description of what the I/O virtual memory adds to the flow of I/O transactions. Before updating the DMA descriptor with the buffer address, the driver asks the IOMMU driver to map the physical address of the buffer and receives a virtual address. This address is inserted to the DMA descriptor instead of the physical one, which was inserted if the IOMMU wasn’t enabled. The flow is described in more detail in Figure 2.15

Figure 2.15: DMA transaction ﬂow with IOMMU sequence diagram.

I/O device transactions work as follows (each number refers to the corresponding number in Figure 2.15):

1. When a device needs to perform an I/O transaction to/from the memory, its driver (the piece of kernel software that control the device) issues a request for an I/O buﬀer.

2. The OS updates the lookup table (page table) and returns an IOVA to the device.

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 3. The device driver initiates the device to transfer the data to and/or from a virtual address via a corresponding DMA unit.

4. The device starts the transaction to an IOVA via a DMA unit.

5. The IOMMU translates the IOVA to a physical address, and starts the transaction to and/or from a physical address.

6. When the transaction ends, the device raises an interrupt to the driver.

7. The driver issues a request for unmapping the I/O buﬀer.

8. The OS updates the radix tree that the mapping is not available.

There are several strategies for deciding when to map and unmap the I/O buﬀer. Strict is the common strategy and the one that oﬀers maximum protection. Before a DMA transaction takes place, all the memory accessed by the I/O device is mapped and then unmapped once the transaction is complete (right after steps 3-6). Other strategies postpone the unmap operation and add it to a list, the items on which will be unmapped together. The longer the unmap operation is delayed, the less the system is protected from misbehaving device drivers.

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 26

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 Chapter 3

rIOMMU: Eﬃcient IOMMU for I/O Devices that Employ Ring Buﬀers

3.1 Introduction

I/O device drivers initiate direct memory accesses (DMAs) to asynchronously move data from their devices into memory and vice versa. In the past, DMAs used physical memory addresses. But such unmediated access made systems vulnerable to (1) rogue devices that might perform errant or malicious DMAs [14, 19, 40, 44, 65], and to (2) buggy drivers that account for most operating system (OS) failures and might wrongfully trigger DMAs to arbitrary memory locations [13, 21, 31, 50, 59, 63]. Subsequently, all major chip vendors introduced I/O memory management units (IOMMUs) [3, 11, 36, 40], which allow DMAs to execute with I/O virtual addresses (IOVAs). The IOMMU translates the IOVAs into physical addresses according to I/O page tables that are setup by the OS. The OS thus protects itself by adding a suitable translation just before the corresponding DMA, and by removing the translation right after [17, 23, 64]. We explain in detail how the IOMMU is implemented and used in §3.2. DMA protection comes at a cost that can be substantial in terms of performance [4, 15, 64], notably for newer, high-throughput I/O devices like 10/40 Gbps network controllers (NICs), which can deliver millions of packets per second. Our measurements indicate that using DMA protection with such devices can reduce the throughput by up to 10x. This penalty has motivated OS developers to trade oﬀ some protection for performance. For example, when employing the “deferred” IOMMU mode, the Linux kernel defers IOTLB invalidations for a short while instead of performing them immediately when necessary, because they are slow. The kernel then processes the accumulated invalidations en masse by ﬂushing the entire IOTLB, thus amortizing the overhead at the risk of allowing devices to erroneously utilize stale IOTLB entries.

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 Technion -Computer ScienceDepartment -M.Sc. ThesisMSC-2015-10 -2015 efidta IMUipoe h hogptb .075x hresltnyby latency shortens 1.00—7.56x, by throughput the improves rIOMMU that find We mode, IOMMU stricter the to relative performance the double can tradeoff this While hc tlz iclr“ig uesi re oitrc ihteO.Arn sa array an is ring A OS. the with interact to order in buffers “ring” circular utilize which .009x n eue P osmto y03–.0 eaiet h aeieDMA baseline the to relative 0.36–1.00x by consumption CPU reduces and 0.80–0.99x, in rIOMMU I/O describe high-throughput We handles rare. OS become the invalidations since explicit And the bursts, latter. eliminating in the translation, invalidate previous is explicitly the every to One removes Consequently, need IOTLB other. ring. the the per to after inserted entry one sequentially, translation IOTLB substantially used one is are IOVAs invalidations only flat because IOTLB designates of enough of rIOMMU indices frequency the are the IOVAs because Finally, as reduced, design. faster, our IOVAs—the addresses—is in (de)allocating virtual of tables as act is the serving table Second, integers flat structure. actual a hierarchical in a translation in IOVA overhead than an the quicker building/destroying reduce First, significantly rings. that protection. of IOMMU DMA nature baseline of the the to over correspond advantages directly three that has tables page (1D) flat deallocated. using and IOVA turn, each in predictable: used linearly ring, are is IOVAs the used Thus, in are placed order. they and allocated, same which other, is the in the in sequence after descriptors the descriptor these and one process short-lived order, device in I/O ring the the (2) dictate through that semantics work ring driver Importantly, encapsulate the IOVAs. Descriptors (1) associated that DMAs. the initiating including when details, drives, sets DMA disk driver the and OS NICs the as that such descriptors devices, of tables. I/O page used hierarchical widely on most based the is to which pertains MMU, claim regular Our the of design the and replicating analyze in We IOMMU disabled. the is using IOMMU with the associated when overheads than the lower model 5x still is throughput the

eeaut h efrac frOM sn tnadntokbenchmarks. network standard using rIOMMU of performance the evaluate We model sequential pervasive this supports that (rIOMMU) IOMMU ring a propose We needlessly IOMMU the to due largely is performance degraded the that argue We I/O deviceCPU address virtual iue31 OM sfrdvcswa h M sfrprocesses. for is MMU the what devices for is IOMMU 3.1: Figure

MMUI/O device TLB physical address

28 physicalI/O device memory physical address § 3.3. IOMMUI/O device IOTLB address virtual I/O § 3.4.

r IOMMU I/OI/O device device protection. Our fastest rIOMMU variant is within 0.77–1.00x the throughput, 1.00–1.04x the latency, and 1.00–1.22x the CPU consumption of a system that disables the IOMMU entirely. We describe our experimental evaluation in §4.6.

3.2 Background

3.2.1 Operating System DMA Protection

The role the IOMMU plays for I/O devices is similar to the role the regular MMU plays for processes, as illustrated in Figure 3.1. Processes typically access the memory using virtual addresses, which are translated to physical addresses by the MMU. Analogously, I/O devices commonly access the memory via DMAs associated with IOVAs. The IOVAs are translated to physical addresses by the IOMMU. The IOMMU provides inter- and intra-OS protection [4, 62, 64, 66]. Inter-OS protection is applicable in virtual setups. It allows for “direct I/O”, where the host assigns a device directly to a guest virtual machine (VM) for its exclusive use, largely removing itself from the guest’s I/O path and thus improving its performance [30, 50]. In this mode of operation, the VM directly programs device DMAs using its notion of (guest) “physical” addresses. The host uses the IOMMU to redirect these accesses to where the VM memory truly resides, thus protecting its own memory and the memory of the other VMs. With inter-OS protection, IOVAs are mapped to physical memory locations infrequently, typically only upon such events as VM creation and migration. Such mappings are therefore denoted static or persistent [64]; they are not the focus of this paper. Intra-OS protection allows the OS to defend against the DMAs of errant/malicious devices [14, 19, 24, 40, 44, 65] and of buggy drivers, which account for most OS failures [21, 13, 31, 50, 59, 63]. Drivers and their I/O devices can perform DMAs to arbitrary memory addresses, and IOMMUs allow OSes to protect themselves (and their processes) against such accesses, by restricting them to speciﬁc physical locations. In this mode of work, map operations (of IOVAs to physical addresses) and unmap operations (invalidations of previous maps) are frequent and occur within the I/O critical path, such that each DMA is preceded and followed by the mapping and unmapping of the corresponding IOVA [44, 52]. Due to their short lifespan, these mappings are denoted dynamic [17], streaming [23] or single-use [64]. This strategy of IOMMU-based intra-OS protection is the focus of this paper. It is recommended by hardware vendors [40, 32, 44] and employed by operating systems [9, 17, 23, 37, 51, 64].1 It is applicable in non-virtual setups where the OS has direct control over the IOMMU. It is likewise applicable in

1 For example, the DMA API of Linux notes that “DMA addresses should be mapped only for the time they are actually used and unmapped after the DMA transfer” [52]. In particular, “once a buffer has been mapped, it belongs to the device, not the processor. Until the buffer has been unmapped, the [OS] driver should not touch its contents in any way. Only after [the unmap of the buffer] has been called is it safe for the driver to access the contents of the buffer” [23].

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 requester identifier IOVA (DMA address) bus dev func 0…0 idx idx idx idx offset 15 8 3 0 6363 4848 3939 3300 2211 1122 0

PTE root context PTE PTE entry entry PTE

root context page table hierarchy table table 63 12 0

PFN offset physical address

Figure 3.2: Intel IOMMU data structures for IOVA translation.

virtual setups where IOMMU functionality is exposed to VMs via paravirtualization [15, 50, 57, 64], full emulation [4], and, more recently, hardware support for nested IOMMU translation [3, 40].

3.2.2 IOMMU Design and Implementation

Given a target memory buffer of a DMA, the OS associates the physical address (PA) of the buffer with an IOVA. The OS maps the IOVA to the PA by inserting the IOVA⇒PA translation to the IOMMU data structures. Figure 4.1 depicts these structures as implemented by Intel x86-64 [40]. The PCI protocol dictates that each DMA operation is associated with a 16-bit request identifier comprised of a bus-device-function triplet that uniquely identifies the corresponding I/O device. The IOMMU uses the 8-bit bus number to index the root table in order to retrieve the physical address of the context table. It then indexes the context table using the 8-bit concatenation of the device and function numbers. The result is the physical location of the root of the page table hierarchy that houses all of the IOVA⇒PA translations of that I/O device. The purpose of the IOMMU page table hierarchy is similar to that of the MMU hierarchy: recording the mapping from virtual to physical addresses by utilizing a 4-level radix tree. Each 48-bit (I/O) virtual address is divided into two: the 36 high-order bits, which constitute the virtual page number, and the 12 low-order bits, which are the offset within the page. The translation procedure applies to the virtual page number only, converting it into a physical frame number (PFN) that corresponds to the physical

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 memory location being addressed. The oﬀset is the same for both physical and virtual pages.

Let Tj denote a page table in the j-th radix tree level for j = 1, 2, 3, 4, such that 9 T1 is the root of the tree. Each Tj is a 4KB page containing up to 2 = 512 pointers

to physical locations of next-level Tj+1 tables. Last-level—T4—tables contain PFNs of target buﬀer locations. Correspondingly, the 36-bit virtual page number is split into a

sequence of four 9-bit indices i1, i2, i3 and i4, such that ij is used to index Tj in order

to ﬁnd the physical address of the next Tj+1 along the radix tree path. Logically, in C

pointer notation, T1[i1][i2][i3][i4] is the PFN of the target memory location. Similarly to the MMU translation lookaside buffer (TLB), the IOMMU caches translations using an IOTLB, which it fills on-the-fly as follows. Upon an IOTLB miss, the IOMMU hardware hierarchically walks the page table as described above, and it inserts the IOVA⇒PA translation to the IOTLB. IOTLB entries are invalidated explicitly by the OS as part of the corresponding unmap operation. An IOMMU table walk fails if a matching translation was not previously established by the OS, a situation which is logically similar to encountering a null pointer value during the walk. A walk additionally fails if the DMA being processed conflicts with the read/write permission bits found within the page table entries along the traversed radix tree path. We note in passing that, at present, in contrast to MMU memory accesses, DMAs are typically not restartable. Namely, existing systems usually do not support “I/O page faults”, and hence the OS cannot populate the IOMMU page table hierarchy on demand. Instead, IOVA translations of valid DMAs are expected to be successful, and the corresponding pages must be pinned to memory. (Albeit I/O page fault standardization does exist [54].)

3.2.3 I/O Devices Employing Ring Buﬀers

Many I/O devices—notably NICs and disk drives—deliver their I/O through one or more producer/consumer ring buffers. A ring is an array shared between the OS device driver and the associated device, as illustrated in Figure 3.3. The ring is circular in that the device and driver wrap around to the beginning of the array when they reach its end. The entries in the ring are called DMA descriptors. Their exact format and content vary between devices, but they specify at least the address and size of the corresponding target buffers. Additionally, the descriptors commonly contain status bits that help the driver and the device to synchronize. Devices must also know the direction of each requested DMA, namely, whether the data should be transmitted from memory (into the device) or received (from the device) into memory. The direction can be specified in the descriptor, as is typical for disk controllers. Alternatively, the device can employ different rings for receive and transmit activity, in which case the direction is implied by the ring. The receive and transmit rings are denoted Rx and Tx, respectively. NICs employ at least one Rx and one Tx

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 Figure 3.3: A driver drives its device through a ring. With an IOMMU, pointers are IOVAs (both registers and target buﬀers).

per port. They may employ multiple Rx/Tx rings per port to promote scalability, as different rings can be handled concurrently by different cores. Upon initialization, the OS device driver is responsible for allocating the rings and configuring the I/O device with their size and base location. For each ring, the device and driver utilize a head and a tail pointers to delimit the content of the ring that can be used by the device: [head, tail). The device iteratively consumes (removes) descriptors from the head, and it increments the head to point to the descriptor that it will use next. Similarly, the driver adds descriptors to the tail, and it increments the tail to point to the entry it will use subsequently. A device asynchronously informs its OS driver that data was transmitted or received by triggering an interrupt. The device coalesces interrupts when their rate is high. Upon receiving an interrupt, the driver of a high-throughput device handles the entire I/O burst. Namely, it sequentially iterates through and processes all the descriptors whose corresponding DMAs have completed,

3.3 Cost of Safety

This section enumerates the overhead components involved in using the IOMMU in the Linux/Intel kernel (§3.3.1). It experimentally quantifies the overhead of each component (§3.3.2). And it provides and validates a simple performance model that allows us to understand how the overhead affects performance and to assess the benefits of reducing it (§3.3.3).

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 3.3.1 Overhead Components

Suppose that a device driver that employs a ring wants to transmit or receive data from/to a target buffer. Figure 3.4 lists the actions it carries out. First, it allocates the target buffer, whose physical address is denoted p (1). (For simplicity, let us assume that p is page aligned.) It pins p to memory and then asks the IOMMU driver to map the buffer to some IOVA, such that the I/O device would be able to access p (2). The IOMMU driver invokes its IOVA allocator, which returns a new IOVA v—an integer that is not associated with any other page currently accessible to the I/O device (3). The IOMMU driver then inserts the v ⇒ p translation to the page table hierarchy of the I/O device (4), and it returns v to the device driver (5). Finally, when updating the corresponding ring descriptor, the device driver uses v as the address for the target buffer of the associated DMA operation (6). Assume that the latter is a receive DMA. Figure 3.5 details the activity that takes place when the I/O device gets the data. The device reads the DMA descriptor within the ring through its head register. As the address held by the head is an IOVA, it is intercepted by the IOMMU (1). The IOMMU consults its IOTLB to find a translation for the head IOVA. If the translation is missing, the IOMMU walks the page table hierarchy of the device to resolve the miss (2). Equipped with the head’s physical address, the IOMMU translates the head descriptor for of the I/O device (3). The head descriptor specifies that v (IOVA defined above) is the address of the target buffer (4), so the I/O device writes the received data to v (5). The IOMMU intercepts v, walks the page table if the v ⇒ p translation is missing (6), and redirects the received data to p (7). Figure 3.6 shows the actions the device driver carries out after the DMA operation is completed. The device driver asks the IOMMU driver to unmap the IOVA v (1). In response, the IOMMU driver removes the v ⇒ p mapping from the page table hierarchy (2), purges the mapping from the IOTLB (3), and deallocates v (4). (The order of these actions is important.) Once the I/O device can no longer access p, it is safe for the device driver to hand the buffer to higher levels in the software stack for further processing (5).

3.3.2 Protection Modes and Measured Overhead

We experimentally quantify the overhead components of the map and unmap functions of the IOMMU driver as outlined in Figures 3.4 and 3.6. To this end, we execute the standard Netperf TCP stream benchmark, which attempts to maximize network throughput between two machines over a TCP connection. (The experimental setup is detailed in §4.6.)

Strict Protection We begin by proﬁling the Linux kernel in its safer IOMMU mode, denoted strict, which strictly follows the map/unmap procedures described in §3.3.1.

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 Figure 3.4: The I/O device driver maps an IOVA v to a physical target buﬀer p. It then assigns v to the DMA descriptor.

Figure 3.5: The I/O device writes the packet it receives to the target buﬀer through v, which the IOMMU translates to p.

Figure 3.6: After the DMA completes, the I/O device driver unmaps v and passes p to a higher-level software layer.

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 function component strict strict+ defer defer+ map iova alloc 3986 92 1674 108 page table 588 590 533 577 other 44 45 44 42 sum 4618 727 2251 727 unmap iova ﬁnd 249 418 263 454 iova free 159 62 189 57 page table 438 427 471 504 iotlb inv 2127 2135 9 9 other 26 25 205 216 sum 2999 3067 1137 1240

Table 3.1: Average cycles breakdown of the map and unmap functions of the IOMMU driver for diﬀerent protection modes.

Table 3.1 shows the average duration of the components of these procedures in cycles. When examining the breakdown of strict/map, we see that its most costly component is, surprisingly, IOVA allocation (Step 3 in Figure 3.4). Upon further investigation, we found that the reason for this high cost is a nontrivial pathology in the Linux IOVA allocator that regularly causes some allocations to be linear in the number of currently allocated IOVAs. We were able to come up with a more efficient IOVA allocator, which consistently allocates/frees in constant time [7]. We denote this optimized IOMMU mode—which is quicker than strict but equivalent to it in terms of safety—as strict+. Table 3.1 shows that strict+ indeed reduces the allocation time from nearly 4,000 cycles to less than 100. The remaining dominant strict(+)/map overhead is the insertion of the IOVA to the IOMMU page table (Step 4 in Figure 3.4). The 500+ cycles of the insertion are attributed to explicit memory barriers and cacheline flushes that the driver performs when updating the hierarchy. Flushes are required, as the I/O page walk is incoherent with the CPU caches on our system. (This is common nowadays; Intel started shipping servers with coherent I/O page walks only recently.) Focusing on the unmap components of strict/strict+, we see that finding the unmapped IOVA in the allocator’s data structure is costlier in strict+ mode. The reason: like the baseline strict, strict+ utilizes a red-black tree to hold the IOVAs. But the strict+ tree is fuller, so the logarithmic search is longer. Conversely strict+/free (Step 4 in Figure 3.6) is done in constant time, rather than logarithmic, so it is quicker. The other unmap components are: removing the IOVA from the page tables (Step 2 in Figure 3.6) and the IOTLB (Step 3). The removal takes 400+ cycles, which is comparable to the duration of insertion. IOTLB invalidation is by far the slowest unmap component at around 2,000 cycles; this result is consistent with previous work [4, 66].

Deferred Protection In order to reduce the high cost of invalidating IOTLB entries, the Linux deferred protection mode relaxes strictness somewhat, trading oﬀ some safety

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 18k 9.8x iova 16k iotlb inv (C) 14k page table other 12k 5.5x 10k 4.9x 8k 6k 3.3x 4k

cycles per packet none 0 strict strict+ defer defer+ completely safe tradeoff

Figure 3.7: CPU cycles used for processing one packet. The top bar labels are relative to Cnone=1,816 (bottommost grid line).

25 none model 20 measured w/ delays riommu measured w/ iommu 15 riommu- 10 defer+ defer 5 strict+

throughput [Gbps] strict 0 2k 4k 6k 8k 10k 12k 14k 16k 18k cycles per packet (C)

Figure 3.8: Throughput of Netperf TCP stream as a function of the average number of cycles spent on processing one packet.

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 for performance. Instead of invalidating entries right away, the IOMMU driver queues the invalidations until 250 freed IOVAs accumulate. It then processes all of them in bulk by invalidating the entire IOTLB. This approach affects the cost of (un)mapping in two ways, as shown in Table 3.1 in the defer and defer+ columns. (Defer+ is to defer what strict+ is to strict.) First, as intended, it eliminates the cost of invalidating individual IOTLB entries. And second, it reduces the cost of IOVA allocation in the baseline deferred mode as compared to strict (1,674 vs. 3,986), because deallocating IOVAs in bulk reduces somewhat the aforementioned linear pathology. The drawback of deferred protection is that the I/O device might erroneously access target buffers through stale IOTLB entries after the buffers have already been handed back to higher software stack levels (Step 5 in Figure 3.6). Notably, at this point, the buffers could be (re)used for other purposes.

3.3.3 Performance Model

Let C denote the average number of CPU cycles required to process one packet. Figure 3.7 shows C for each of the aforementioned IOMMU modes in our experimental setup.

The bottommost horizontal grid line shows Cnone, which is C when the IOMMU is

turned off. We can see, for example, that Cstrict is nearly 10x higher than Cnone. Our experimental setup employs a NIC that uses two target buffers per packet: one for the header and one for the data. Each packet thus requires two map and two unmap invocations. So the processing of the packet includes: two IOVA (de)allocations; two page table insertions and deletions; and two invalidations of IOTLB entries. The corresponding aggregated cycles are respectively depicted as the three top stacked sub-bars in the figure. The bottom, “other” sub-bar embodies all the rest of the packet processing activity, notably TCP/IP and interrupt processing. As noted, the deferred modes eliminate the IOTLB invalidation overhead, and the “+” modes reduce the

overhead of IOVA (de)allocation. But even Cdefer+ (the most performant mode, which

introduces a vulnerability window) is still over 3.3x higher than Cnone. We find that the way the specific value of C affects the overall throughput of Netperf is simple and intuitive. Specifically, if S denotes the cycles-per-second clock speed of the core, then S/C is the number of packets the core can handle per second. And since every Ethernet packet carries 1,500 bytes, the throughput of the system in Gbps S should be Gbps(C) = 1500 byte × 8 bit × C , assuming S is given in GHz. Figure 3.8 shows that this simple model (thick line) is accurate. It coincides with the throughput

obtained when systematically lengthening Cnone using a carefully controlled busy-wait loop (thin line). It also coincides with the throughput measured under the diﬀerent IOMMU modes (cross points).

Consequences As our model accurately reﬂects reality, we conclude that the translation activity carried out by the IOMMU (as depicted in Figure 3.5) does not aﬀect

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 the performance of the system, even when servicing demanding benchmarks like Net- perf. Instead, the cost of IOMMU protection is entirely determined by the number of cycles the core spends establishing and destroying IOVA mappings. Consequently, we can later simulate and accurately assess the expected performance of our proposed IOMMU by likewise spending cycles; there is no need to simulate the actual IOMMU hardware circuitry external to the core. A second conclusion rests on the understanding that throughput is proportional to 1/C. If C is high (right-hand side of Figure 3.8), incrementally improving it would make little diﬀerence. The required change must be

signiﬁcant enough to bring C to the proximity of Cnone.

3.4 Design

a) b) c) d) struct rDEVICE { struct rRING { struct rPTE { struct rIOVA { u16 size; u18 size; u64 phys_addr; u30 offset; rRING rings[size]; rPTE ring[size]; u30 size; u18 rentry; }; // SW only u02 dir; u16 rid; u18 tail; u01 valid; }; // = 64bit // SW only u31 unused; u18 nmapped; };// = 128bit }; e) struct rIOTLB_entry { u16 bdf; u16 rid; u18 rentry; rPTE rpte; rPTE next; };

Figure 3.9: The rIOMMU data structures. e) is used only by hardware. The last two ﬁelds of rRING are used only by software.

Our goal is to design rIOMMU, an efficient IOMMU for devices that employ rings. We aim to substantially reduce all IOVA-related overheads: (de)allocation, insertion/deletion to/from the page table hierarchy, and IOTLB invalidation (see Figure 3.7). We base our design on the observation that ring semantics (§3.2.3) dictate a well-defined memory access order. The OS sequentially produces ring entries, one after the other. And the I/O device sequentially consumes these entries, one after the other, making its DMA pattern predictable. We contend that the x86 hierarchical structure of page tables is poorly suited for the ring model. For each DMA, the OS has to walk the hierarchical page table in order to map the associated IOVA. Then, the device faults on the IOVA and so the IOMMU must walk the table too. Shortly after, the OS has to walk the table yet again in order to unmap the IOVA. Contributing to this overhead are the aforementioned memory barriers and cacheline flushes required for propagating page table changes. In a nutshell, we propose to replace the table hierarchy with a per-ring flat page table (1D array) as shown in Figure 3.10. IOVAs would constitute indices of the array,

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 requester identifier IOVA (DMA address) bus dev func ring id ring entry offset 15 8 3 0 63 49 48 30 29 0

root context size rPTE offset <= size entry entry rRING

root context page table hierarchy table table 63 29 0 PFN offset physical address

Figure 3.10: rIOMMU data structures for IOVA translation.

thus eliminating IOVA (de)allocation overheads. Not having to walk a hierarchical structure would additionally reduce the table walk cost. In accordance with Figure 3.9e, we propose an IOTLB that holds at most one entry per ring, such that each table walk removes the previous IOVA translation. Consequently, given a burst of unmaps, only the last IOVA in the sequence requires explicit invalidation. We further discuss this point later on. We next describe the rIOMMU design in detail. There are several ways to realize the rIOMMU concept, and our description should be viewed as an example. Contrary to the baseline IOMMU, which provides protection in page granularity, our rIOMMU facilitates protection of any speciﬁed size.

Data Structures Figure 3.9 defines the rIOMMU data structures. The rDEVICE (Figure 3.9a) is to the rIOMMU what the root page table is to the baseline IOMMU. It is uniquely associated with a bus-device-function (bdf) triplet and is pointed to by the context table (Figure 4.1). As noted, each DMA carries with it a bdf, allowing the rIOMMU to find the corresponding rDEVICE when needed. The rDEVICE consists of a physical pointer to an array of rRING structures (Figure 3.9b) and a matching size. Each rRING entry represents a flat page table. It likewise contains the table’s physical address and size. The OS associates with each rRING: (1) a tail pointing to the next entry to be allocated in the flat table, and (2) the current number of valid mappings in the table. The latter two are not architected and are unknown to the rIOMMU hardware. We include them in rRING to simplify the description.

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 u64 rtranslate(u16 bus_dev_func, rIOVA iova, u2 dir) { rIOTLB_entry e = riotlb_find( bus_dev_func, iova.rid ); if( ! e ) { e = rtable_walk( bus_dev_func, iova ); riotlb_insert( e ); } if( e.rentry != iova.rentry ) riotlb_entry_sync( bus_dev_func, iova, e ); if( iova.offset >= e.rpte.size || ! (e.rpte.dir & dir) ) io_page_fault(); return e.rpte.phys_addr + iova.offset; }

void riotlb_entry_sync(u16 bus_dev_func, rIOVA iova, rIOTLB_entry e) { rDEVICE d = get_domain( bus_dev_func ); u18 next = (e.rentry + 1) % d.rings[e.rid].size;

if( e.next.valid && (iova.rentry===next) ) { e.rpte = e.next; e.rentry = next; e.next.valid = 0; } else e = rtable_walk( bus_dev_func, iova ); rprefetch( d, e ); }

rIOTLB_entry rtable_walk(u16 bus_dev_func, rIOVA iova) { rDEVICE d = get_domain( bus_dev_func ); if( iova.rid >= d.size || iova.rentry >= d.rings[iova.rid].size || ! d.rings[iova.rid].ring[iova.rentry].valid ) io_page_fault();

rIOTLB_entry e; rRING r = d.rings[iova.rid]; e.bdf = bus_dev_func; e.rid = iova.rid; e.rentry = iova.rentry; e.rpte = r.ring[e.rentry]; // copy rprefetch( d, e ); return e; }

// async void rprefetch(rDEVICE d, rIOTLB_entry e) { rRING r = d.rings[e.rid]; u18 next = (e.rentry + 1) % r.size; if( r.size > 1 && r.ring[next].valid ) e.next = r.ring[next]; // copy }

Figure 3.11: Outline of the rIOMMU logic. All DMAs are carried out with IOVAs that are translated by the rtranslate routine.

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 rIOVA map(rDEVICE d, u16 rid, u64 pa, u30 size, u2 direction) { rRING r = d.rings[rid]; locked { if( r.nmapped == r.size ) return OVERFLOW; u18 t = r.tail; r.tail = (r.tail + 1) % r.size; r.nmapped++; }

r.ring[t].phys_addr = pa; r.ring[t].size = size; r.ring[t].dir = direction; r.ring[t].valid = 1; sync_mem( & r.ring[t] ); return pack_iova( 0/*offset*/, t/*rentry*/, rid ); }

void unmap(rDEVICE d, rIOVA iova, bool end_of_burst) { rRING r = d.rings[iova.rid]; r.ring[iova.rentry].valid = 0; locked { r.nmapped--; } sync_mem( & r.ring[iova.rentry] ); if( end_of_burst ) riotlb_invalidate( bus_dev_func(d), iova.rid ); }

void sync_mem(void * line) { if( ! riommu_pt_is_coherent() ) { memory_barrier(); cache_line_flush( line ); } memory_barrier(); }

Figure 3.12: Outline of the rIOMMU OS driver, implementing map and unmap, which respectively correspond to Figures 3.4 and 3.6.

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 Each ring buffer of the I/O device is associated with two rRINGs in the rDEVICE array. The first corresponds to IOVAs pointing to the device ring buffer (Step 1 of Figure 3.5 for translating the head register). The second corresponds to IOVAs that the device finds within its ring descriptors (Step 5 in Figure 3.5 for translating target buffers). The IOVAs that reside in the first flat table are mapped as part of the I/O device initialization. They will be unmapped only when the device is brought down, as the device rings are always accessible to the device. IOVAs residing in the second flat table are associated with DMA target buffers; they are mapped/unmapped repeatedly and are valid only while their DMA is in flight. The flat table pointed to by rRING.ring is an array of rPTE structures (Figure 3.9c). An rPTE consists of the physical address and size associated with the corresponding IOVA; two bits that specify the DMA direction, which can be from the device, to it, or both; and a bit that indicates whether the rPTE (and thus the corresponding IOVA) are valid. The physical address need not be page aligned and the associated size can have any value, allowing for fine-grained protection. The rIOVA structure (Figure 3.9d) defines the format of IOVAs. As noted, every DMA has a bdf that uniquely identifies its rDEVICE. The rIOVA.rid (ring ID) serves as an index to the corresponding rDEVICE.rings array, and thus it uniquely identifies the rRING of the rIOVA. Likewise, rIOVA.rentry serves as an index to the rRING.ring array, and thus it uniquely identifies the rPTE of the rIOVA. The target address of the rIOVA is computed by adding rIOVA.offset to rPTE.phys addr. The data structures discussed so far are used by both software and hardware. They are setup by the OS and utilized by the rIOMMU to translate rIOVAs. The last one (Figure 3.9e) is a hardware-only structure, representing one rIOTLB entry. The combination of its first two fields (bdf+rid) uniquely identifies a rRING flat page table, which we denote as T . The rIOTLB utilizes at most one rIOTLB entry per T . The combination of the first three fields (bdf+rid+rentry) uniquely identifies T ’s “current” rPTE—the PTE associated with the most recently translated rIOVA that belongs to T . The current rPTE is cached by rIOTLB entry.rpte (holds a copy). The rIOTLB entry.next field may or may not contain a prefetched copy of T ’s subsequent rPTE. (Our design does not depend on the latter field and works just as well without it.)

Hardware The rtranslate routine (Figure 3.11 top/left) outlines how rIOMMU translates a rIOVA to a physical address. First, it searches for e, the rIOTLB entry of the rRING that is associated with the rIOVA. (Recall that there is only one such entry per rRING.) If e is missing from the rIOTLB, rIOMMU walks the table using the data structures defined above, finds the rPTE, and inserts to the rIOTLB a matching entry. Doing the table walk ensures that e.rpte is the rPTE of the given rIOVA. However, if e was initially found in the rIOTLB, then e and the rIOVA might mismatch. rIOMMU therefore compares the rentry numbers of e and the IOVA, and it updates e if they are different. Now that e is up-to-date, rIOMMU checks that the direction of the DMA is

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 permitted according to the rPTE. It also checks that the offset of the IOVA is in range, namely, smaller than the associated rPTE.size. Violating these conditions constitutes an error, causing rIOMMU to trigger an I/O page fault (IOPF). IOPFs are not expected to occur (drivers pin target buffers to memory), and OSes typically reinitialize the I/O device if they do. If no violation is detected, rIOMMU finally performs the translation by adding the offset of the IOVA to rPTE.phys addr. The rtable walk routine (Figure 3.11 top/right) ensures that the rIOVA complies with the rIOMMU data structure limits as well as points to a valid rPTE. Noncompliance might be the result of, e.g., an errant DMA or a buggy driver. After validation, rtable walk initializes the rIOTLB entry in a straightforward manner based on the rIOVA and its rPTE. It additionally attempts to prefetch the subsequent rPTE by invoking rprefetch (Figure 3.11 bottom/right), which succeeds if the next rPTE is valid. Prefetching can be asynchronous. The riotlb entry sync routine (Figure 3.11 bottom/left) is used by rtranslate to synchronize e (the rIOTLB entry) with the current IOVA. The two become unsynchronized, e.g., whenever the device handles a new DMA descriptor. The required rPTE can then be found in e.next if prefetching was previously successful, in which case the routine assigns e.next to e.rpte. Otherwise, it uses rtable walk to fetch the needed rPTE. Finally, it attempts to prefetch the subsequent rPTE.

Software The (un)map functions comprising the rIOMMU OS driver are shown in Figure 3.12. Their prototypes are logically similar to the associated Linux functions from the baseline IOMMU OS driver (Figures 3.4 and 3.6), with minor adjustments. The map flow corresponds to Figure 3.4. It gets a device, a ring ID, a physical address to be mapped, and the associated size and direction of the DMA. The first part of the code allocates a ring entry rPTE at the ring’s tail and then updates the tail/nmapped fields accordingly. This allocation—which consists of incrementing two integers—is analogous to the costly IOVA allocation of baseline Linux. The second part of map initializes the newly allocated rPTE. When the rPTE is ready, the map function invokes sync mem, which ensures that the rPTE memory updates are visible to the rIOMMU. This part of the code is analogous to walking and updating the page table hierarchy of the baseline IOMMU, but it is simpler since the page table is flat. The return statement of the map function packs the rentry index and its ring ID into an IOVA as dictated by the rIOVA data structure (Figure 3.9d). The offset is always set to be 0 by the rIOMMU driver. Callers of map can later manipulate the offset as they please, provided they conform to the size constraint encoded into the corresponding rPTE. The flow of unmap (Figure 3.12/right) corresponds to Figure 3.6. Unmap gets an rIOVA, marks the associated rPTE as invalid (analogous to walking the table hierarchy), decrements the ring’s nmapped counter (analogous to IOVA deallocation), and synchro- nizes the memory to make the rPTE update visible to the rIOMMU. Recall that when

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 the device driver is notified that its device has finished some DMAs, it loops through the relevant descriptors and sequentially unmaps their IOVAs (§3.2.3). The driver sets the end of burst parameter of unmap to true at the end of this loop, upon the last IOVA. One invalidation is sufficient because, by design, each rRING has at most one rIOTLB entry allocated in the rIOTLB. In our experimental testbed, our measurements indicate that the average loop length of a throughput-sensitive workload such as Netperf is ˜200 iterations. This is long enough to make the amortized cost of IOTLB invalidations negligible, as in the deferred mode, but without sacrificing safety. Amortization, however, does not apply to latency- sensitive workloads. Nonetheless, the invalidation cost is small in comparison to the overall latency as will shortly be demonstrated. Finally, we consider the problem of synchronizing the memory between the IOMMU and its driver. In sync mem (Figure 3.12 bottom/right), we see support for two hardware modes, corresponding to whether the IOMMU table walk is coherent with the CPU caches. The baseline Linux kernel queries the relevant IOMMU capability bit. If it finds that the two are not in the same coherency domain, it introduces an additional memory barrier followed by a cacheline flush. In the following section, we experimentally evaluate two simulated rIOMMU versions corresponding to these two modes.

Limitations Let R be a ring of an I/O device. Let D be the number of DMA descriptors comprising R. Let L be the maximal number of R’s live IOVAs whose DMAs are currently in flight. And let N be the size of the associated rRING. N is set by the device driver upon startup. Optimally, N ≥ L, or else the driver would experience overflow (2nd line of map in Figure 3.12). While suboptimal, overflow is legal as with other devices employing rings; it just means that the driver must slow down. D is typically hundreds or a few thousands. In some I/O devices, each descriptor can hold only a constant number of IOVAs (K), in which case setting N = D × K would prevent overflow. Some devices support scatter-gather lists, whose K might be large or theoretically unbounded. Developers of device drivers must therefore make a judicious decision regarding N based on their domain-specific knowledge about L. (In our experiments, L was at most 8K for all rings.) Alternatively, developers can opt for using the baseline IOMMU. Importantly, there are devices for which rIOMMU is unsuitable, notably NICs that implement remote direct memory access (RDMA). We therefore do not propose to replace the baseline IOMMU, but only to supplement it.

3.5 Evaluation

3.5.1 Methodology

Simulating rIOMMU We experimentally evaluate the seven IOMMU modes deﬁned in §3.3–3.4: (1) strict, which is the completely safe Linux baseline; (2) strict+, which

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 enhances strict with our faster IOVA allocator; (3) defer, which is the Linux variant that trades off some protection for performance by batching IOTLB invalidations; (4) defer+, which is defer with our IOVA allocator; (5) riommu- (in lowercase), which is the newly proposed rIOMMU when assuming no I/O page table coherency; (6) riommu, which does assume coherent I/O page tables; and (7) none, which turns off the IOMMU. The five non-rIOMMU modes are executed as is. They constitute full implementa- tions of working systems and do not require a simulation component. To simulate the two rIOMMU modes, we start with the none mode as the baseline. We then supplement the baseline with calls to the (un)map functions, similarly to the way they are called in the non-simulated IOMMU-enabled modes. But instead of invoking the native functions of the Linux IOMMU driver (Figures 3.4 and 3.6), we invoke the (un)map functions that we implement in the simulated rIOMMU driver (Figure 3.12). All the code of the rIOMMU driver can be—and is—executed, with one exception. Since there is no real rIOTLB, we must simulate the invalidation of rIOTLB entries. We do so by busy waiting for 2,150 cycles upon each entry invalidation, in accordance to the measurements specified in Table 3.1. Notice that our methodology does not account for differences between the existing and proposed IOMMU translation mechanism. Namely, we only account for actions shown in Figures 3.4 and 3.6 but not those in Figure 3.5. Notably, we ignore the fact that the IOMMU works harder than the rIOMMU due to IOTLB misses that rIOMMU avoids via prefetching. We likewise ignore the fact that rIOMMU works harder than the no-IOMMU mode, since it translates addresses whereas the no-IOMMU mode does not. We ignore these differences, as the model validated in §3.3.3 shows that throughput is entirely determined by the number of cycles it takes the core—not the device or the IOMMU—to process a DMA request, even for the most demanding I/O-intensive workloads. The system behaves this way probably because the device and IOMMU operate in parallel to the CPU and are apparently fast enough so as not to constitute a bottleneck. We revalidate our methodology and show that it is also applicable for latency-sensitive workloads by using the standard Netperf UDP request-response (RR) benchmark, which repeatedly sends one byte to its peer and waits for an identical response. We run RR under two IOMMU modes: hardware path-through (HWpt) and software pass-through (SWpt). With HWpt, the IOMMU is enabled but never experiences IOTLB misses; instead, it translates each IOVA to an identical physical address without consulting any page table. SWpt provides an equivalent functionality by using a page table that maps the entire physical memory and associates each physical page address with an identical IOVA. Under SWpt, Netperf RR experiences an IOTLB miss on every packet it sends and receives. Nonetheless, we find that the performance of HWpt and SWpt is identical, because the network stack and interrupt processing introduce far greater latencies that hide the IOTLB miss penalty. Moreover, we find that the RR performance of HWpt/SWpt is identical to that of no-IOMMU.

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 Throughput performance of Netperf stream with HWpt and SWpt is smaller by ˜10% relative to no-IOMMU. But here too the diﬀerence is entirely caused by the core: about 200 CPU cycles spent on unrelated kernel abstraction code that executes under HWpt/SWpt but not under no-IOMMU.

Experimental Setup In an effort to get more general results, we conduct the evaluation using two setups involving two different NICs, as follows. The Mellanox setup (mlx for short) is comprised of two identical Dell PowerEdge R210 II Rack Server machines that communicate through Mellanox ConnectX3 40Gbps NICs. The NICs are connected back to back via a 40Gbps optical fiber and are configured to use Ethernet. We use one machine as the server and the other as a workload generator client. Each machine has a 8GB 1333MHz memory and a single-socket 4-core Intel Xeon E3-1220 CPU running at 3.10GHz. The chipset is Intel C202, which supports VT-d, Intel’s Virtualization Technology that provides IOMMU functionality. We configure the server to utilize one core only and turn off all power optimizations—sleep states (C-states) and dynamic voltage and frequency scaling (DVFS)—to avoid reporting artifacts caused by nondeterministic events. The machines run Ubuntu 12.04 with the Linux 3.4.64 kernel. All experimental findings described thus far were obtained with the mlx setup. The Broadcom setup (brcm for short) is similar, likewise utilizing two R210 machines. The differences are that the two machines communicate through Broadcom NetXtreme II BCM57810 10GbE NICs (connected via a CAT7 10GBASE-T cable for fast Ethernet); that they are equipped with 16GB memory; and that they run the Linux 3.11.0 kernel. The mlx and brcm device drivers differ substantially. Notably, mlx utilizes more ring buffers and allocates more IOVAs (we observed a total of ˜12K addresses for mlx and ˜3K for brcm). The mlx driver uses two target buffers per packet (header and body) and thus two IOVAs, whereas the brcm driver allocates only one buffer/IOVA per packet.

Benchmarks To drive our experiments we utilize the following benchmarks:

1. Netperf TCP stream [42] is a standard tool to measure networking performance in terms of throughput. It maximizes the amount of data sent over one TCP connection, simulating an I/O-intensive workload. Its default message size is 16KB. This is the application we used in §3.3.

2. Netperf UDP RR (request-response) is the second canonical conﬁguration of Netperf. As noted, it models a latency sensitive workload by repeatedly exchanging one byte messages in a ping-pong manner. The per message latency can be calculated as the inverse of the number of messages per second (which we show later on).

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 netperf stream netperf rr apache 1MB apache 1KB memcached [Gbps] [req/sec] [req/sec] [req/sec] [req/sec] 20 80k 2k 16k 160k 100 15 60k 1.5k 12k 120k 75 10 40k 1k 8k 80k 50 5 20k 0.5k 4k 40k 25 0 0k 0k 0k 0k 0 s s+ d d+ r- r n s s+ d d+ r- r n s s+ d d+ r- r n s s+ d d+ r- r n s s+ d d+ r- r n

10 40k 1.2k 16k 160k 100 7.5 30k 0.9k 12k 120k 75 5 20k 0.6k 8k 80k 50 cpu [%] (lower is better)

throughput (higher is better) 2.5 10k 0.3k 4k 40k 25 0 0k 0.0k 0k 0k 0 strict strict+ defer defer+ riommu- riommu none strict strict+ defer defer+ riommu- riommu none strict strict+ defer defer+ riommu- riommu none strict strict+ defer defer+ riommu- riommu none strict strict+ defer defer+ riommu- riommu none

throughput cpu [%]

Figure 3.13: Absolute performance numbers of the IOMMU modes when using the Mellanox (top) and Broadcom (bottom) NICs.

3. Apache [26, 27] is a popular HTTP web server. We drive it using ApacheBench [8], the workload generator distributed with Apache. It measures the number of requests per second that the server is capable of handling by requesting a static page of a given size. We run it on the client machine conﬁgured to generate 32 concurrent requests. We use two instances of the benchmark, respectively requesting a smaller (1KB) and a bigger (1MB) ﬁle.

4. Memcached [28] is a high-performance in-memory key-value storage server. It is often used to cache slow disk queries. We used the Memslap benchmark [2] (part of the libmemcached client library), which runs on the client machine and measures the completion rate of the requests that it generates. By default, Memslap generates a workload comprised of 90% get and 10% set operations, with 64B and 1KB key and value sizes, respectively. It too is set to use 32 concurrent requests.

3.5.2 Results

NIC benchmark throughput riommu- divided by riommu divided by strict strict+ defer defer+ none strict strict+ defer defer+ none mlx stream 5.12 2.90 2.57 1.74 0.52 7.56 4.28 3.79 2.57 0.77 rr 1.23 1.07 1.05 1.02 0.95 1.25 1.09 1.07 1.03 0.96 apache 1M 5.30 1.62 1.58 1.20 0.76 5.80 1.77 1.73 1.31 0.83 apache 1K 2.32 1.08 1.07 1.03 0.92 2.32 1.08 1.07 1.03 0.92 memcached 4.77 1.17 1.25 1.03 0.82 4.88 1.19 1.28 1.05 0.83

brcm stream 2.17 1.00 1.00 1.00 1.00 2.17 1.00 1.00 1.00 1.00 rr 1.19 1.05 1.04 1.02 0.99 1.21 1.06 1.05 1.03 1.00 apache 1M 1.20 1.01 1.00 1.00 1.00 1.20 1.01 1.00 1.00 1.00 apache 1K 1.24 1.13 1.08 1.02 0.89 1.29 1.18 1.13 1.07 0.93 memcached 1.76 1.35 1.18 1.10 0.78 1.88 1.45 1.27 1.18 0.84

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 NIC benchmark cpu riommu- divided by riommu divided by strict strict+ defer defer+ none strict strict+ defer defer+ none mlx stream 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 rr 0.94 0.99 0.98 0.99 1.01 0.93 0.98 0.96 0.98 1.00 apache 1M 0.99 0.99 1.00 1.00 1.00 0.99 0.99 0.99 1.00 1.00 apache 1K 0.99 1.00 1.00 1.00 1.00 0.99 1.00 1.00 1.00 1.00 memcached 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00

brcm stream 0.40 0.50 0.64 0.81 1.21 0.36 0.45 0.58 0.73 1.09 rr 0.86 0.96 0.96 1.00 1.11 0.84 0.93 0.93 0.98 1.08 apache 1M 0.48 0.49 0.60 0.75 1.41 0.41 0.42 0.52 0.65 1.22 apache 1K 0.99 0.99 0.99 1.00 1.00 0.99 1.00 1.00 1.00 1.00 memcached 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00

Table 3.2: Relative performance numbers.

We run each benchmark 100 times. Each individual run is configured to take ˜10 seconds. We treat the first 10 runs as warmup and report the average of the remaining 90 runs. Figure 4.10 shows the resulting throughput and CPU consumption for the mlx (top) and brcm (bottom) setups. The corresponding normalized performance is shown in Table 3.2, specifying the relative improvement of the two rIOMMU variants over the other modes. The top/left plot in Figure 4.10 corresponds to the analysis and data shown in Figures 3.7–3.8. Let us discuss the results in Figure 4.10, left to right. The greatest observed improvement by rIOMMU is attained with mlx / Netperf stream (Figure 4.10/top/left). This result is to be expected considering the model from §3.3.3 showing that every cycle shaved off the IOVA (un)mappings translates into increased throughput. CPU cycles constitute the bottleneck resource, as is evident from the mlx/stream/CPU curve, which is at 100% for all IOMMU modes. The notable difference between riommu- and riommu is due to ˜1.1K cycles that the former adds to the latter, which is the cost of four additional memory barriers and four additional cacheline flushes, per packet. (Specifically, a barrier and a cacheline flush in both map and unmap for two IOVAs corresponding to the packet’s header and data.) Riommu- and riommu provide 2.90– 7.56x higher throughput relative to the completely safe IOMMU modes strict and strict+, and 1.74–3.79x higher throughput relative to the deferred modes. The latter, however, does not constitute an apple-to-apples comparison, since the deferred modes are vulnerable whereas the rIOMMU modes are safe. Riommu- and riommu deliver 0.52x and 0.77x lower throughout relative to the unprotected, no-IOMMU optimum. The brcm/stream results (Figure 4.10/bottom/left) are quantitatively and qualitatively different. In particular, all IOMMU modes except strict have enough cycles to saturate the Broadcom NIC and achieve its line-rate, which is 10 Gbps. The brcm setup requires fewer cycles per packet because its device driver is more efficient, e.g., due to utilizing only one IOVA per packet instead of two. In setups of this type—where the network is saturated—the performance metric of interest becomes the CPU consumption.

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 NIC strict strict+ defer defer+ riommu- riommu none mlx 17.3 15.1 14.9 14.4 14.1 13.9 13.4 brcm 41.9 36.7 36.6 35.8 35.1 34.7 34.6

Table 3.3: Netperf RR round-trip time in microseconds.

By Table 3.2, we can see that riommu and riommu- consume 0.36–0.50x fewer CPU cycles than the two strict modes; 0.58–0.81x fewer cycles than the deferred modes; and 1.09x and 1.21x more cycles than the no-IOMMU optimum, respectively. The improvement by rIOMMU is less pronounced when running RR, in both mlx and brcm, with 1.02–1.25x higher throughput and 0.84–1.00x lower CPU consumption relative to the strict and deferred variants. It is less pronounced due to RR’s ping-pong nature, which implies that CPU cycles are in low demand, as indicated by the CPU curves at 28–30% for mlx and at 12–15% for brcm. For this reason, in comparison to mlx/RR/none, rIOMMU has 4–5% lower throughput and nearly identical CPU consumption. In comparison to brcm/RR/none, rIOMMU has 8–11% higher CPU consumption and nearly identical throughput. Although the per-packet processing time at the core is smaller in brcm, overall, the mlx hardware transmits packets faster, as indicated by its higher RR throughput. The corresponding round-trip time of the different modes (which, as noted, is the inverse of the throughput in RR’s case) is shown in Table 3.3. The results of Apache 1MB are qualitatively identical to those of Netperf stream, because the benchmark transmits a lot of data per request and is thus throughput sensitive. Conversely, Apache 1KB is not throughput sensitive. Its smaller 1KB requests make the performance of mlx and brcm look remarkably similar despite their networking infrastructure difference. In both cases, the bottleneck is the CPU, while the volume of the transmitted data is only a small fraction of the NICs capacity. (Both deliver ˜12K requests per second of 1KB files, yielding a transfer rate of ˜0.1 Gbps.) This is because Apache requires heavy processing for each http request. This overhead is amortized over hundreds of packets in the case of Apache 1MB, but over only one packet in the case of 1KB. Consequently, the computational processing dominates the throughput of Apache 1KB, and so the role of the networking infrastructure is marginalized. Even so, rIOMMU demonstrates a ˜1.24x and ˜2.32x throughput improvement over brcm/strict and mlx/strict, respectively. It is up to 1.18x higher relative to the other IOMMU- enabled modes. And ˜0.9x lower relative to the unprotected optimum.2 The network activity of Apache 1KB is somewhat similar to that of the Memcached benchmark, because both are configured with 32 concurrent requests, both receive queries comprised of a few dozens of bytes (file name or key item), and both transmit 1KB responses (file content or data item). The difference is that the Memcached internal

2We note in passing that our Apache 1KB throughput results coincide with that of Soares et al. [58], who reported a latency of 22ms for 256 concurrent requests, which translate to 1000/22 × 256 ≈ 12K requests/second.

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 logic is simpler, as its purpose is merely to serve as an in-memory LRU cache. For this reason, it achieves an order of magnitude higher throughput relative to Apache 1KB.3 The shorter per-request processing time makes the diﬀerences between the IOMMU conﬁgurations more pronounced, with rIOMMU throughput that is 1.17–4.88x higher than the completely safe modes, 1.03–1.28x higher than the deferred modes, and 0.78–0.84x lower than the optimum.

3.5.3 When IOTLB Miss Penalty Matters

Our experiments thus far indicated that using the IOMMU affects performance because it mandates the OS to spend CPU cycles on creating and destroying IOVA mappings. We were unable to measure the overhead caused by the actual IOMMU translation activity of walking the page tables upon an IOTLB miss (Figures 4.1 and 3.5). In §4.6.1, we attributed this inability to the substantially longer latencies induced by interrupt processing and the TCP/IP stack. In Table 3.3, we specified the round-trip latencies, whose magnitude (13–42 µs) seems to suggest that the occasional cost of 4 memory references per table walk is negligible in comparison. There are, however, high performance environments that enjoy lower latencies in the order of a µs [22, 29, 55, 61], which is required, e.g., “where a fraction of a microsecond can make a difference in the value of a transaction” [1]. User-level I/O, for example, might permit applications to (1) utilize raw Ethernet packets to eliminate TCP/IP overheads, and to (2) poll the I/O device to eliminate interrupt delays. With the help of the ibverbs library [38, 47], we established such a configuration on top of the mlx setup. We ran two experiments. The first iteratively and randomly selects a buffer from a large pool of previously mapped buffers and transmits it, thus ensuring that the probability for the corresponding IOVA to reside in the IOTLB is low. The second experiment does the same but with only one buffer, thus ensuring that the IOTLB always hits. The latency difference—which is the cost of an IOTLB miss—was ˜0.3 µs (1013 cycles on average); we believe it is reasonable to assume that it approximates the benefit of using rIOMMU over the existing IOMMU in high performance environments of this type.

3.5.4 Comparing to TLB Prefetchers

rIOMMU is not a prefetcher. Rather, it is a new IOMMU design that allows for eﬃcient IOVA (un)mappings while minimizing costly IOTLB invalidations. (Unrelated to prefetching.) But rIOMMU does have a prefetching component, since it loads to the rIOTLB the next IOVA to be used ahead of time. While this component turned out to be useful only in specialized setups (§3.5.3), it is still interesting to compare this aspect of our work to previously proposed TLB prefetchers.

3Our Memcached results are comparable to that of Gordon et al. [30].

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 For lack of space, we only briefly describe the bottom line. We modified the IOMMU layer of KVM/QEMU to log the DMAs that its emulated I/O devices perform. We ran our benchmarks in a VM and generated DMA traces. We fed the traces to three simulated TLB prefetchers: Markov [43], Recency [56], and Distance [46], as surveyed by Kandiraju and Sivasubramaniam [45]. We found their baseline versions to be ineffective, as IOVAs are invalidated immediately after being used. We modified them and allowed them to store invalidated addresses, but mandated them to walk the page table and check that their predictions are mapped before making them. Distance was still ineffective. Recency and Markov, however, were able to predict most accesses, but only if the number of entries comprising their history data structure grew larger than the ring. In contrast, rIOTLB requires only two entries per ring and its “predictions” are always correct.

3.6 Related Work

The overhead of IOMMU mapping and unmapping is a well known issue. Ben-Yehuda et al. [15] showed that using the Calgary IOMMU can impose a 30% increase in CPU utilization. In their work they proposed methods that can reduce the IOMMU mapping layer overhead, yet would require significant changes to existing device drivers. Several studies focused on the overhead of virtual IOMMUs, yet most of their approaches are applicable to native systems as well. Willmann et al. [64] showed that sharing mappings among DMA descriptors can reduce the overhead without sacrificing security, yet, as the authors admitted the extent of the performance improvement is workload-dependent and sometimes negligible. Other techniques—Willmann’s persistent mappings and validation of DMA buffers; Yassour et al. [66] mappings prefetching and invalidations batching; and Amit et. al [4] asynchronous invalidations—all improve performance at the cost of relaxed IOMMU protection. Other research works addressed inefficiencies of the I/O virtual address space allocator. Tomonori [60] proposed to enhance the allocator performance by managing the I/O virtual address space using bitmaps instead of red-black trees. Cascardo [20] showed that IOVA allocators suffer from lock-contention, and mitigating this contention can can significantly improve the performance of multicore systems. These studies do not address the overhead associated with IOTLB invalidations, and are therefore orthogonal to our work.

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 52

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 Chapter 4

Eﬃcient IOMMU Intra-Operating System Protection

4.1 Introduction

The role that the I/O memory management unit (IOMMU) plays for I/O devices is similar to the role that the regular memory management unit (MMU) plays for processes. Processes typically access the memory using virtual addresses translated to physical addresses by the MMU. Likewise, I/O devices commonly access the memory via direct memory access operations (DMAs) associated with I/O virtual addresses (IOVAs), which are translated to physical addresses by the IOMMU. Both hardware units are implemented similarly with a page table hierarchy that the operating system (OS) maintains and the hardware walks upon an (IO)TLB miss. The IOMMU can provide inter- and intra-OS protection [4, 62, 64, 66]. Inter protection is applicable in virtual setups. It allows for “direct I/O”, where the host assigns a device directly to a guest virtual machine (VM) for its exclusive use, largely removing itself from the guest’s I/O path and thus improving its performance [30, 50]. In this mode, the VM directly programs device DMAs using its notion of (guest) “physical” addresses. The host uses the IOMMU to redirect these accesses to where the VM memory truly resides, thus protecting its own memory and the memory of the other VMs. With inter protection, IOVAs are mapped to physical memory locations infrequently, typically only upon such events as VM creation and migration. Such mappings are therefore denoted persistent or static [64]. Intra-OS protection allows the OS to defend against errant/malicious devices and buggy drivers, which account for most OS failures [21, 59]. Drivers and their I/O devices are able perform DMAs to arbitrary memory locations, and IOMMUs allow OSes to protect themselves by restricting these DMAs to speciﬁc physical memory

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 locations. Intra-OS protection is applicable in non-virtual setups where the OS has direct control over the IOMMU. It is likewise applicable in virtual setups where IOMMU functionality is exposed to VMs via paravirtualization [15, 50, 57, 64], full emulation [4], or, recently, hardware support for nested IOMMU translation [3, 40]. In this mode, IOVA (un)mappings are frequent and occur within the I/O critical path. The OS programs DMAs using IOVAs rather than physical addresses, such that each DMA is preceded and followed by the mapping and unmapping of the associated IOVA to the physical address it represents [44, 52]. For this reason, such mappings are denoted single-use or dynamic [17]. The context of this paper is intra-OS protection. We discussed it in more detail in §4.2. To do its job, the intra-OS protection mapping layer must allocate IOVA values: ranges of integer numbers that would serve as page identifiers. IOVA allocation is similar to regular memory allocation. But it is different enough to merit its own allocator. One key difference is that regular allocators dedicate much effort to preserving locality and to combating fragmentation, whereas the IOVA allocator disallows locality and enjoys a naturally “unfragmented” workload. This difference makes the IOVA allocator 1–2 orders of magnitude smaller in terms of lines of code. Another key difference is that, by default, the IOVA subsystem trades off some safety for performance. It systematically delays the completion of IOVA deallocations while letting the OS believe that the deallocations have already been processed. Specifically, part of freeing an IOVA is purging it from the IOTLB such that the associated physical buffer is no longer accessible to the I/O device. But invalidating IOTLB entries is a costly, slow operation. So the IOVA subsystem opts for batching the invalidations until enough accumulate. Then, it invalidates the entire IOTLB, en masse, thus reducing the amortized price. This default mode is called deferred protection. Users can turn it off at boot time by instructing the kernel to use strict protection. We discuss the IOVA allocator and its protection modes in detail in §4.3. Single-use mappings that stress the IOVA mapping layer are usually associated with I/O devices that employ ring buffers in order to communicate with their OS drivers in a producer-consumer manner. The ring buffer is a cyclic memory array whose entries correspond to DMA requests that the OS initiates and the I/O device must fulfill. The ring entries contain IOVAs that the mapping layer allocates and frees before and after the associated DMAs are processed by the device. We carefully analyze the performance of the IOVA mapping layer and find that its allocation scheme is efficient despite its simplicity, but only if the device is associated with a single ring. Devices, however, often employ more rings, in which case our analysis indicates that the IOVA allocator seriously degrades the performance. We carefully study this deficiency and find that its root cause is a pathology we call long-lasting ring interference. The pathology occurs when I/O asynchrony prompts an event that confuses the allocator into migrating an IOVA from one ring to another, henceforth repetitively destroying the contiguity of the ring’s I/O space upon which the allocator relies for efficiency. We conjecture that this

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 harmful effect remained hidden thus far because of the well-known slowness associated with manipulating the IOMMU. The hardware took most of the blame for the high price of intra-OS protection even though software is equally guilty, as it turns out. We demonstrate and analyze long-lasting ring interference in §4.4. To resolve the problem, we introduce the EiovaR optimization (Efficient IOVA allocatoR) to the kernel’s mapping subsystem. In designing EiovaR, we exploit the following two observations: (1) the workload handled by the IOVA mapping layer is largely comprised of allocation requests for same size ranges, and (2) since the workload is ring-induced, the difference D between the cumulative number of allocation and deallocation requests at any given time is proportional to the size of the ring, which is relatively small. EiovaR is thus a simple, thin layer on top of the baseline IOVA allocator that proxies all the (de)allocations requests. It caches previously freed ranges and reuses them to quickly satisfy subsequent allocations. It is successful because the requests are similar, it is frugal in terms of memory consumption because D is small, and it is compact (implementation-wise) because it is mostly an array of free-lists with a bit of logic. EiovaR entirely eliminates the baseline allocator’s aforementioned reliance on I/O space contiguity, ensuring all (de)allocations are efficient. We describe EiovaR and experimentally explore its interaction with strict and deferred protection in §4.5. We evaluate the performance of EiovaR using micro- and macrobenchmarks and different I/O devices. On average, EiovaR satisfies (de)allocations in about 100 cycles, and it improves the throughput of Netperf, Apache, and Memcached benchmarks by up to 4.58x and 1.71x for strict and deferred protection, respectively. In configurations that achieve the maximal throughput of the I/O device, EiovaR reduces the CPU consumption by up to 0.53x. Importantly, EiovaR delivers strict protection with the performance that is similar to that of the baseline system when employing deferred protection. We conduct the experimental evaluation of EiovaR in §4.6.

4.2 Intra-OS Protection

DMA refers to the ability of I/O devices to read from or write to the main memory without CPU involvement. It is a heavily used mechanism, as it frees the CPU to continue to do work between the time it programs the DMA until the time the associated data is sent or received. As noted, drivers of devices that stress the IOVA mapping layer initiate DMA operations via a ring buﬀer, which is a circular array in main memory that constitutes a shared data structure between the driver and its device. Each entry in the ring contains a DMA descriptor, specifying the address(es) and size(s) of the corresponding target buﬀer(s); the I/O device will write/read the data to/from the latter, at which point it will trigger an interrupt to let the OS know that the DMA has completed. (Interrupts are coalesced if their rate is high.) I/O device are commonly associated with more than one ring, e.g., a receive ring denoted Rx for DMA read operations, and a transmit ring denoted Tx for DMA write operations.

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 requester identifier IOVA (DMA address) bus dev func 0s idx idx idx offset 15 8 3 0 63 39 29 21 12 0

root cntxt PTE PTE entry entry PTE

root context page table hierarchy table table (3-4 tables) 63 12 0

PFN offset physical address

Figure 4.1: IOVA translation using the Intel IOMMU.

In the past, I/O devices had to use physical addresses in order to access the main memory, namely, each DMA descriptor contained a physical address of its target buffer. Such unmediated DMA activity directed at the memory makes the system vulnerable to rogue devices performing errant or malicious DMAs [14, 19, 44, 65], or to buggy drivers that might wrongfully program their devices to overwrite any part of the system memory [13, 31, 50, 59, 63]. Subsequently, all major chip vendors introduced IOMMUs [3, 11, 36, 40], alleviating the problem as follows. The OS associates each DMA target buffer with some IOVA, which it uses instead of the buffer’s physical address when filling out the associated ring descriptor. The I/O device is oblivious to this change; it processes the DMA using the IOVA as if it was physical memory. The IOMMU circuitry then translates the IOVA to the physical address of the target buffer, routing the operation to the appropriate memory location. Figure 4.1 illustrates the translation process as performed by the Intel IOMMU, which we use in this paper. The PCI protocol dictates that each DMA operation is associated with a 16 bit request identifier comprised of a bus-device-function triplet, which is uniquely associated with the corresponding I/O device. The IOMMU uses the 8 bit bus number to index the root table and thus retrieve the physical address of the context table. It then indexes the context table using the 8 bit concatenation of the device and function numbers, yielding the physical location of the root of the page table hierarchy that houses the device’s IOVA translations. Similarly to the MMU, the IOMMU accelerates translations using an IOTLB. The functionality of the IOMMU hierarchy is similar to that of the regular MMU: it will permit an IOVA memory access to go through only if the OS previously inserted

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 a matching mapping between the IOVA and some physical memory address. The OS can thus protect itself by allowing a device to access a target buﬀer just before the corresponding DMA occurs (inserting a mapping), and by revoking access just after (removing the mapping), exerting ﬁne-grained control over what portions of memory may be used in I/O transactions at any given time. This state-of-the-art strategy of IOMMU-based protection was termed intra-OS protection by Willmann et al. [64].It is recommended by hardware vendors [32, 44], and it is used by operating systems [10, 17, 37, 51]. For example, the DMA API of Linux—which we use in this study—notes that “DMA addresses should be mapped only for the time they are actually used and unmapped after the DMA transfer” [52].

4.3 IOVA Allocation and Mapping

The task of generating IOVAs—namely, the actual integer numbers that the OS assigns to descriptors and the devices then use—is similar to regular memory allocation. But it is sufficiently different to merit its own allocator, because it optimizes for different objectives, and because it is required to make different tradeoffs, as follows:

Locality Memory allocators spend much effort in trying to (re)allocate memory chunks in a way that maximizes reuse of TLB entries and cached content. The IOVA mapping layer of the OS does the opposite. The numbers it allocates correspond to whole pages, and they are not allowed to stay warm in hardware caches in between allocations. Rather, they must be purged from the IOTLB and from the page table hierarchy immediately after the DMA completes. Moreover, while purging an IOVA, the mapping layer must flush each cache line that it modifies in the hierarchy, as the IOMMU and CPU do not reside in the same coherence domain.1

Fragmentation Memory allocators invest much eﬀort in combating fragmentation, attempting to eliminate unused memory “holes” and utilize the memory they have before requesting the system for more. As we further discuss in §4.4–§4.5, it is trivial for the IOVA mapping layer to avoid fragmentation due to the simple workload that it services, which is induced by the circular ring buﬀers, and which is overwhelmingly comprised of 2j-page requests.

Complexity Simplicity and compactness matter and are valued within the kernel. Not having to worry about locality and fragmentation while enjoying a simple workload, the mapping layer allocation scheme is signiﬁcantly simpler than regular memory allocators. In Linux, it is comprised of only a few hundred lines of codes instead of thousands [48, 49] or tens of thousands [16, 33].

1Intel IOMMU speciﬁcation documents a capability bit that indicates whether the IOMMU and CPU coherence could be turned on [40], but we do not own such hardware and believe it is not yet common.

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 Safety & Performance Assume a thread T0 frees a memory chunk M, and then

another thread T1 allocates memory. A memory allocator may give M to T1, but

only after it processes the free of T0. Namely, it would never allow T0 and T1 to use

M together. Conversely, the IOVA mapping layer purposely allows T0 (the device)

and T1 (the OS) to access M simultaneously for a short period of time. The reason: invalidation of IOTLB entries is costly [4, 64]. Therefore, by default, the mapping layer trades oﬀ safety for performance by (1) accumulating up to W unprocessed ’free’ operations and only then (2) freeing those W IOVAs and (3) invalidating the entire IOTLB en masse. Consequently, target buﬀers are actively being used by the OS while the device might still access them through stale IOTLB entries. This weakened safety mode is called deferred protection. Users can instead employ strict protection—which processes invalidations immediately—by setting a kernel command line parameter.

Technicalities Memory allocators typically use the memory that their clients free to store their internal data structures. (For example, a linked list of freed pages where each “next” pointer is stored at the beginning of the corresponding page.) The IOVA mapping layer cannot do that, because the IOVAs that it invents are pointers to memory that is used by some other entity (the device or the OS). An IOVA is just an additional identiﬁer for a page, which the mapping layer does not own. Another diﬀerence between the two types of allocators is that memory allocators running on 64-bit machines use native 64-bit pointers. The IOVA mapping layer prefers to use 32-bit IOVAs, as utilizing 64-bit addresses for DMA would force a slower, dual address cycle on the PCI bus [17].

In accordance to the above, the allocation scheme employed by the Linux/x86 IOVA mapping layer is diﬀerent than, and independent of, the regular kernel memory allocation subsystem. The underlying data structure of the IOVA allocator is the generic Linux kernel red-black tree. The elements of the tree are ranges. A range is a pair of integer numbers [L, H] that represent a sequence of currently allocated I/O virtual page numbers L, L + 1, ..., H − 1,H, such that L ≤ H stand for “low” and “high”, respectively. The ranges in the tree are pairwise disjoint, namely, given two ranges

[L1,H1] 6= [L2,H2] then either H1 < L2 or H2 < L1. Newly requested IOVA integers are allocated by scanning the tree right-to-left from the highest possible value downwards towards zero in search for a gap that can accommodate the requested range size. The allocation scheme attempts and—as we will later see—ordinarily succeeds to allocate the new range from within the highest gap available in the tree. The allocator begins to scan the tree from a cache node C that it maintains, iterating from C through the ranges in a descending manner until a suitable gap is found. C is maintained such that it usually points to a range that is higher than (to the right of) the highest free gap, as follows. When (1) a range R is freed and C currently points to a range lower than R, then C is updated to point to

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 struct range_t {int lo, hi;}; range_t alloc_iova(rbtree_t t, int rngsiz) { range_t new_range; rbnode_t right = t.cache; rbnode_t left = rb_prev( right ); while(right.range.lo - left.range.hi <= rngsiz) right = left; left = rb_prev( left ); new_range.hi = right.lo - 1; new_range.lo = right.lo - rngsiz; t.cache = rb_insert( t, new_range ); return new_range; } void free_iova(rbtree_t t, rbnode_t d) { if( d.range.lo >= t.cache.range.lo ) t.cache = rb_next( d ); rb_erase( t, d ); }

Figure 4.2: Pseudo code of the baseline IOVA allocation scheme. The functions rb next and rb prev return the successor and predecessor of the node they receive, respectively.

R’s successor. And (2) when a new range Q is allocated, then C is updated to point to Q; if Q was the highest free gap prior to its allocation, then C still points higher than the highest free gap after this allocation.

4.4 Long-Lasting Ring Interference

Figure 4.2 lists the pseudo code of the IOVA allocation scheme as was just described. Clearly, the algorithm’s worst case complexity is linear due to the ’while’ loop that scans previously allocated ranges beginning at the cache node C. But when factoring in the actual workload that this algorithm services, the situation is not so bleak: the complexity turns out to actually be constant rather than linear. (At least conceptually.) Recall that the workload is commonly induced by a circular ring buffer, whereby IOVAs of DMA target buffers are allocated and freed in a repeated, cyclic manner. Consider, for example, an Ethernet NIC with a Rx ring of size n, ready to receive packets. Assume the NIC initially allocates n target buffers, each big enough to hold one packet (1500 bytes). The NIC then maps the buffers to n newly allocated, consecutive IOVAs with which it populates the ring descriptors. Assume that the IOVAs are n, n− 1, ..., 2, 1. (The series is descending as IOVAs are allocated from highest to lowest.) The first mapped IOVA is n, so the NIC stores the first received packet in the memory pointed to by n, and it triggers an interrupt to let the OS know that it needs to handle the packet. Upon handling the interrupt, the OS first unmaps the corresponding IOVA, purging it from the IOTLB and IOMMU page table to prevent the device from accessing the

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 associated target buffer (assuming strict protection). The unmap frees IOVA=n, thus updating C to point to n’s successor in the red-black tree (free iova in Figure 4.2). The OS then immediately re-arms the ring descriptor for future packets, allocating a new target buffer and associating it with a newly allocated IOVA. The latter will be n, and it will be allocated in constant time, as C points to n’s immediate successor (alloc iova in Figure 4.2). The same scenario will cyclically repeat itself for n − 1, n − 2, ..., 1 and then again n, ..., 1 and so on as long as the NIC is operational. Our soon to be described experiments across multiple devices and workloads indicate that the above description is fairly accurate. IOVA allocations requests are overwhelmingly for one page ranges (H = L), and the freed IOVAs are indeed re-allocated shortly after being freed, enabling, in principle, the allocator in Figure 4.2 to operate in constant time as described. But the algorithm succeeds to operate in this ideal manner only for some bounded time. We find that, inevitably, an event occurs end ruins this ideality thereafter. The above description assumes there exists only one ring in the I/O virtual address space. In reality, however, there are often two or more, for example, the Rx and Tx receive and transmit rings. Nonetheless, even when servicing multiple rings, the IOVA allocator provides constant time allocation in many cases, so long as each ring’s free iova is immediately followed by a matching alloc iova for the same ring (the common case). Allocating for one ring and then another indeed causes linear IOVA searches due to how the cache node C is maintained. But large bursts of I/O activity flowing in one direction still enjoy constant allocation time. The aforementioned event that forever eliminates the allocator’s ability to accommodate large I/O bursts with constant time occurs when a free-allocate pair of one ring is interleaved with that of another. Then, an IOVA from one ring is mapped to another, ruining the contiguity of the ring’s I/O virtual address. Henceforth, every cycle of n allocations would involve one linear search prompted whenever the noncontiguous IOVA is freed and reallocated. We call this pathology long-lasting ring interference and note that its harmful effect increases as additional inter-ring free-allocate interleavings occur. Table 4.1 illustrates the pathology. Assume that a server mostly receives data and occasionally transmits. Suppose that Rx activity triggers a Rx.free iova(L) of address L (1). Typically, this action would be followed by Rx.alloc iova, which would then return L (2). But sometimes a Tx operation sneaks in between. If this Tx operation is Tx.free iova(H) such that H > L (3), then the allocator would update the cache node C to point to H’s successor (4). The next Rx.alloc iova would be satisfied by H (5), but then the subsequent Rx.alloc iova would have to iterate through the tree from H (6) to L (7), inducing a linear overhead. Notably, once H is mapped for Rx, the pathology is repeated every time H is (de)allocated. This repetitiveness is experimentally demonstrated in Figure 4.3, showing the per-allocation number of rb prev invocations. The calls are invoked in the loop in alloc iova while searching for a free IOVA. We show below that the implications of long-lasting ring interference can be dreadful

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 operation without Tx with Tx return CC return CC value before after value before after Rx.free(L=151) (1) 152 152 152 152 Tx.free(H=300) (3) 152 (4) 301 Rx.alloc (2) 151 152 151 (5) 300 301 300 Rx.free(150) 151 151 300 (6) 300 Rx.alloc 150 151 150 (7) 151 300 151

Table 4.1: Illustrating why Rx-Tx interferences cause linearity, following the baseline allocation algorithm detailed in Figure 4.2. (Assume that all addresses are initially allocated.)

length of search loop 0k [num of rb_prev calls] n+0k n+10k n+20k n+30k n+40k

allocation [serial number]

Figure 4.3: The length of each alloc iova search loop in a 40K (sub)sequence of alloc iova calls performed by one Netperf run. One Rx-Tx interference leads to regular linearity.

in terms of performance. How, then, is it possible that such a deficiency is overlooked? We contend that the reason is twofold. The first is that commodity I/O devices were slow enough in the past such that IOVA allocation linearity did not matter. The second reason is the fact that using the IOMMU hardware is slow and incurs a high price, motivating the deferred protection safety/performance tradeoff. Being that slow, the hardware served as a scapegoat, wrongfully held accountable for most of the overhead penalty and masking the fact that software is equally to blame.

4.5 The EiovaR Optimization

Suﬀering from frequent linear allocations, the baseline IOVA allocator is ill-suited for high-throughput I/O devices that are capable of performing millions of I/O transactions per second. It is too slow. One could proclaim that this is just another case of a special- purpose allocator proved inferior to a general-purpose allocator and argue that the

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 latter should be favored over the former despite the notable differences between the two as listed in §4.3. We, however, contend that the simple and repetitive, inherently ring- induced nature of the workload justifies a special-purpose allocator. We further contend that we are able to modify the existing allocator to consistently support extremely fast (de)allocations while introducing only a minimal change to the existing allocator. We propose the EiovaR optimization (Efficient IOVA allocatoR), which rests of the following observation. I/O devices that stress the intra-OS protection mapping layer are not like processes, in that the size of their virtual address spaces is relatively small, inherently bounded by the size of their rings. A typical ring size n is a few hundreds or a few thousands of entries. The number of per-device virtual page addresses that the IOVA allocator must simultaneously support is proportional to the ring size, which means it is likewise bounded and relatively small. Moreover, unlike regular memory allocators, the IOVA mapping layer does not allocate real memory pages. Rather, it allocates integer identifiers for those pages. Thus, it is reasonable to keep O(n) of these identifiers alive under the hood for quick (de)allocation, without really (de)allocating them (in the traditional, malloc sense of (de)allocation). In numerous experiments with multiple devices and workloads, the maximal number of per-device different IOVAs we have seen is 12K. More relevant is that, across all experiments, the maximal number of previously-allocated-but-now-free IOVAs has never exceeded 668 and was 155 on average. EiovaR leverages this workload characteristic to cache freed IOVAs so as to satisfy future allocations quickly. It further leverages the fact that, as noted earlier, IOVA allocations requests are overwhelmingly for one page ranges (H = L), and that, in the handful of cases where H =6 L, the size of the requested range has always been a power of two (H − L + 1 = 2j). We have never witnessed a non-power of two allocation. EiovaR is thin layer that masks the red-black tree, resorting to using it only when EiovaR cannot fulfill IOVA allocation on its own using previously freed elements. When configured to have enough capacity, all tree allocations that EiovaR is unable to mask are assured to occur in constant time. EiovaR’s main data structure is called “the freelist”. It has two components. The M+12 first is an array, farr, which has M entries, such that 2 is the upper bound on the size of the consecutive memory areas that EiovaR supports. (M = 28 is enough,

implying a terabyte of memory.) Entries in farr are linked lists of IOVA ranges. They are empty upon initialization. When an IOVA range [L, H] whose size is a power of two is freed, instead of actually freeing it, EiovaR adds it to the liked list of the corresponding exponent. That is, if H − L + 1 = 2j, then EiovaR adds the range to the

its j-th linked list farr[j]. Ranges comprised of one page (H = L) end up in farr[0]. For completeness, when the size of a freed range is not a power of two, EiovaR

stores it in its second freelist component, frbt, which is a red-black tree. Unlike the

baseline red-black tree, which sorts [L, H] ranges by the L and H values, frbt sorts ranges by their size (H − L + 1), allowing EiovaR to locate a range of a desirable size

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 throughput freelist allocation avg freelist size avg search length [Gbit/sec] [percent] [nodes] [rb_prev ops] 4 100 2.5 60 50 3.5 80 2 eiovar∞ 40 60 1.5 eiovar512 3 30 eiovar64 40 1 20 eiovar 2.5 8 20 0.5 10 eiovar1 2 0 0 0 baseline 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 (a) (b) (c) (d)

Figure 4.4: Netperf TCP stream iteratively executed under strict protection. The x axis shows the iteration number.

in logarithmic time. EiovaR allocation performs the reverse operation of freeing. If the newly allocated range has a power-of-2 size (i.e., 2j), then EiovaR tries to satisfy the allocation using

farr[j]. Otherwise, it tries to satisfy the allocation using frbt. EiovaR resorts to utilizing the baseline red-black tree only if a suitable range is not found in the freelist. When conﬁgured with enough capacity, our experiments indicate that, after a

while, all (de)allocations are satisfied by farr in constant time, taking 50-150 cycles per (de)allocation on average, depending on the configuration. When not imposing a limit on the freelist capacity, the allocations that EiovaR satisfies by resorting to the baseline tree are likewise done in constant time, seeing the tree never observes deallocations, which means its cache node C always points to its smallest, leftmost node (Figure 4.2). To bound the size of the freelist, EiovaR has a configurable parameter k that serves

as its maximal capacity. We use the EiovaRk notation to express this limit, with k = ∞ indicating a limitless freelist.

4.5.1 EiovaR with Strict Protection

To understand the behavior and eﬀect of EiovaR, we begin by analyzing ﬁve Eio-

vaRk variants as compared to the baseline under strict protection, where IOVAs are (de)allocated immediately before and after the associated DMAs. We use the standard Netperf stream benchmark that maximizes throughput on one TCP connection. We initially restart the NIC interface for each allocation variant (thus clearing IOVA structures), and then we execute the benchmark iteratively. The exact experimental setup is described in §4.6. The results are shown in Figure 4.4. Figure 4.4a shows that the throughput of all EiovaR variants is similar and is 20%–60% better than the baseline. The baseline gradually decreases except from in

the last iteration. Figure 4.4b highlights why even EiovaR1 is sufficient to provide the observed benefit. It plots the rate of IOVA allocations that are satisfied by the freelist, showing that k = 1 is enough to satisfy nearly all allocations. This result indicates that each call to free iova is followed by alloc iova, such that the IOVA freed by the former is returned by the latter, coinciding with the ideal scenario outlined in §4.3. Figure

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 baseline eiovar 4k 4k 3k 3k 2k 2k 1k 1k map [cycles] 0 0 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 alloc_iova iteration all the rest

Figure 4.5: Average cycles breakdown of map with Netperf/strict.

baseline eiovar 4k 4k 3k 3k 2k 2k 1k 1k unmap [cycles] 0 0 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 free_iova iteration all the rest

Figure 4.6: Average cycles breakdown of unmap with Netperf/strict.

4.4c supports this observation by depicting the average size of the freelist. The average

of EiovaR1 is inevitably 0.5, as every allocation and deallocation contributes to the average 1 and 0 respectively. Larger k values are similar, with an average of 2.5 because of two additional (de)allocation that are performed when Netperf starts running and that remain in the freelist thereafter. Figure 4.4d shows the average length of the ’while’ loop from Figure 4.2, which searches for the next free IOVA. It depicts a rough mirror image of Figure 4.4a, indicating throughput is tightly negatively correlated with the traversal length. Figure 4.5 (left) shows the time it takes the baseline to map an IOVA, separating allocation from the other activities. Whereas the latter remains constant, the former exhibits a trend identical to Figure 4.4d. Conversely, the alloc iova time of EiovaR (Figure 4.5, right) is negligible across the board. EiovaR is immune to long-lasting ring interface, as interfering transactions are absorbed by the freelist and reused in constant

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 throughput freelist allocation avg freelist size avg search length [Gbit/sec] [percent; log-scaled] [nodes; log-scaled] [rb_prev ops] 7 100 128 35 50 64 30 eiovar∞ 6 25 32 25 eiovar 16 20 512 5 10 eiovar64 8 15 eiovar 3 4 10 8 4 2 eiovar1 1 1 5 baseline 3 0 0 0 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 (a) (b) (c) (d)

Figure 4.7: Netperf TCP stream iteratively executed under deferred protection. The x axis shows the iteration number.

time.

4.5.2 EiovaR with Deferred Protection

Figure 4.6 is similar to Figure 4.5, but it pertains to the unmap operation rather than to map. It shows that the duration of free iova remains stable across iterations with both EiovaR and the baseline. EiovaR deallocation is still faster as it is performed in constant time whereas the baseline is logarithmic. But most of the overhead is not due to free iova. Rather, it is due to the costly invalidation that purges the IOVA from the IOTLB to protect the corresponding target buﬀer. This is the aforementioned costly hardware overhead that motivated deferred protection, which amortizes the price by delaying invalidations until enough IOVAs are accumulated and then processing all of them together. As noted, deferring the invalidations trades oﬀ safety for performance, because the relevant memory is accessible by the device even though it is already used by the kernel for other purposes. Figure 4.7 compares between the baseline and the EiovaR variants under deferred protection. Interestingly, the resulting throughput divides the variants into two, with

EiovaR512 and EiovaR∞ above 6Gbps and all the rest at around 4Gbps (Figure 4.7a). We again observe a strong negative correlation between the throughput and the length of the search to ﬁnd the next free IOVA (Figure 4.7a vs. 4.7d). In contrast to the strict setup (Figure 4.4), here we see that EiovaR variants with smaller k values roughly perform as bad as the baseline. This ﬁnding is somewhat

surprising, because, e.g., 25% of the allocations of EiovaR64 are satisﬁed by the freelist (Figure 4.7b), which should presumably improve its performance over the baseline. A

ﬁnding that helps explain this result is noticing that the average size of the EiovaR64 freelist is 32 (Figure 4.7c), even though it is allowed to hold up to k = 64 elements.

Notice that EiovaR∞ holds around 128 elements on average, so we know there are

enough deallocations to fully populate the EiovaR64 freelist. One might therefore expect that the latter would be fully utilized, but it is not.

The average size of the EiovaR64 freelist is 50% of its capacity due to the following reason. Deferred invalidations are aggregated until a high-water mark W (kernel

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 parameter) is reached, and then all the W addresses are deallocated in bulk.2 When k < W , the freelist fills up to hold k elements, which become k − 1 after the subsequent allocation, and then k − 2 after the next allocation, and so on until zero is reached, 1 k yielding an average size of k+1 Σj=0j ≈ k/2 as our measurements show. Importantly, the EiovaRk freelist does not have enough capacity to absorb all the W consecutive deallocations when k < W . The remaining W − k deallocations would therefore be freed by the baseline free iova. Likewise, out of the subsequent W allocations, only k would be satisfied by the freelist, and the remaining W − k would be serviced by the baseline alloc iova. It follows that the baseline free iova and alloc iova are regularly invoked in an uncoordinated way despite the freelist. As described in §4.4, the interplay between these two routines in the baseline version makes them susceptible to long-lasting ring interference that inevitably induces repeated linear searches. In contrast, when k is big enough (≥ W ), the freelist has sufficient capacity to absorb all W deallocations, which are then used to satisfy the subsequent W allocations and thus secure the conditions for preventing the harmful effect. Figure 4.8 experimentally demonstrates this threshold behavior, depicting the throughput as a function of the maximal allowed freelist size k. As k gets bigger, the performance slowly improves, because more—but not yet all—(de)allocations are served by the freelist. When k reaches W = 250, the freelist is finally big enough, and the throughput suddenly increases by 26%. Figure 4.9 provides further insight into this result. It shows the per-allocation length of the loop within alloc iova that iterates through the red-black tree in search for the next free IOVA (similarly to Figure 4.3). The three sub-graphs correspond to three points from Figure 4.8 that are associated with k values 64, 240, and 250. We see that the smaller k (left) yields more longer searches relative to the bigger k (middle), and that the length of the search becomes zero when k = W (right).

4.6 Evaluation

4.6.1 Methodology

Experimental Setup We implement EiovaR in the Linux kernel, and we experimentally evaluate its performance and contrast it against the baseline IOVA allocation. In an effort to attain more general results, we conducted the evaluation using two setups involving two different NICs with two corresponding different device drivers that generate different workloads for the IOVA allocation layer. The Mellanox setup is comprised of two identical Dell PowerEdge R210 II Rack Server machines that communicate through Mellanox ConnectX3 40Gbps NICs. The NICs are connected back to back via a 40Gbps optical fiber and are configured to use

2They cannot be freed before they are purged from the IOTLB, or else they could be re-allocated, which would be a bug since their stale mappings might reside in the IOTLB and point to somewhere else.

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 7

Gbit/sec 5

4 0 50 100 150 200 250 eiovar’s freelist size

Figure 4.8: Under deferred protection, EiovaRk eliminates costly linear searches when k exceeds the high-water mark W .

k=64 k=240 k=W=250 8k 8k 8k 6k 6k 6k 4k 4k 4k 2k 2k 2k 0k 0k 0k n+0k n+20kn+40k n+0k n+20kn+40k n+0k n+20kn+40k length of search loop [num of rb_prev calls] allocation [serial number]

Figure 4.9: Length of the alloc iova search loop under the EiovaRk deferred protection regime for three k values when running Netperf TCP Stream. Bigger capacity implies that the searches become shorter on average. Big enough capacity (k ≥ W = 250) eliminates the searches altogether.

Ethernet. We use one machine as the server and the other as a workload generator client. Each machine is furnished with a 8GB 1333MHz memory and a single-socket 4-core Intel Xeon E3-1220 CPU running at 3.10GHz. The chipset is Intel C202, which supports VT-d, Intel’s Virtualization Technology that provides IOMMU functionality. We conﬁgure the server to utilize one core only, and we turn oﬀ all power optimizations—

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 Netperf stream Netperf RR Apache 1MB Apache 1KB Memcached 9 40 1.2k 12k 120k

sec] 30 6 µ 0.8k 8k 80k 20 3 0.4k 4k 40k [Gbp/s] 10 throughput requests/sec requests/sec 0 latency [ 0 0.0k 0k 0k strict defer strict defer strict defer strict defer transactions/sec strict defer baseline eiovar

Netperf stream Netperf RR Apache 1MB Apache 1KB Memcached 9 40 1.2k 12k 120k

Figure 4.10: The performance of baseline vs. EiovaR allocation, under strict and deferred protection regimes for the Mellanox (top) and Broadcom (bottom) setups. Except for in the case of Netperf RR, higher values indicated better performance.

sleep states (C-states) and dynamic voltage and frequency scaling (DVFS)—to avoid reporting artifacts caused by nondeterministic events. The two machines run Ubuntu 12.04 and utilize the Linux 3.4.64 kernel. The Broadcom setup is similar, likewise utilizing two Dell PowerEdge R210 machines. The difference is: that the two machines communicate through Broadcom NetXtreme II BCM57810 10GbE NICs (connected via a CAT7 10GBASE-T cable for fast Ethernet); that they are equipped with 16GB memory; and that they run the Linux 3.11.0 kernel. The device drivers of the Mellanox and Broadcom NICs differ in many respects. Notably, the Mellanox NIC utilizes more ring buffers and it allocates more IOVAs (we observed around 12K addresses for Mellanox and 3K for Broadcom). For example, the Mellanox driver uses two buffers per packet—one for the header and one for the body—and hence two IOVAs, whereas the Broadcom driver allocates only one buffer and thus only one IOVA.

Benchmarks To drive our experiments we utilize the following benchmarks:

1. Netperf TCP stream [42] is a standard tool to measure networking performance in terms of throughput. It attempts to maximize the amount of data sent over one TCP connection, simulating an I/O-intensive workload. This is the application we used above when demonstrating long-lasting ring interference and how EiovaR solves it. Unless otherwise stated, Netperf TCP stream employs its default message size, which is 16KB.

2. Netperf UDP RR (request-response) is the second canonical conﬁguration of Netperf. It models a latency sensitive workload by repeatedly sending a single byte and waiting for a matching single byte response. The latency is then calculated as the inverse of the observed number of transactions per second.

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 3. Apache [26, 27] is a popular HTTP web server. We drive it using the ApacheBench [8] (also called “ab”), which is a workload generator that is distributed with Apache. The goal of ApacheBench is to assess the number of concurrent requests per second that the server is capable of handling by requesting a static page of a given size from within several concurrent threads. We run ApacheBench on the client machine conﬁgured to generate 100 concurrent requests. We use two instances of the benchmark, respectively requesting a smaller (1KB) and a bigger (1MB) ﬁle. The logging is disabled to avoid the overhead of writing to disk.

4. Memcached [28] is a high-performance in-memory key-value storage server. It is used, for example, by websites for caching results of slow database queries, thus improving the sites’ overall performance and scalability. We used the Memslap benchmark [2] (part of the libmemcached client library), which runs on the client machine and measures the completion rate of the requests that it generates. By default, Memslap generates a random workload comprised of 90% get and 10% set operations. Unless otherwise stated, Memslap is set to use 16 concurrent requests.

Before running each benchmark, we shut down and bring up the interface of the NIC using the ifconﬁg utility, such that the IOVA allocation is redone from scratch using a clean tree, clearing the impact of previous harmful long-lasting ring interferences. We then iteratively run the benchmark 150 times, such that individual runs are conﬁgured to take about 20 seconds. We present the the corresponding results, on average.

4.6.2 Results

The resulting average performance is shown in Figure 4.10 for the Mellanox (top) and Broadcom (bottom) setups. The corresponding normalized performance, which specifies relative improvement, is shown in the first part of Tables 4.2 and 4.3. In Figure 4.10, higher numbers indicate better throughput in all cases but Netperf RR, which depicts latency (the inverse of the throughput). No such exception is required in (the first part of) Tables 4.2 and 4.3, which, for consistency, displays the normalized throughput for all applications including Netperf RR.

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 protect benchmark baseline EiovaR diﬀ throughput strict Netperf stream 1.00 2.37 +137% (normalized) Netperf RR 1.00 1.27 +27% Apache 1MB 1.00 3.65 +265% Apache 1KB 1.00 2.35 +135% Memcached 1.00 4.58 +358%

defer Netperf stream 1.00 1.71 +71% Netperf RR 1.00 1.07 +7% Apache 1MB 1.00 1.21 +21% Apache 1KB 1.00 1.11 +11% Memcached 1.00 1.25 +25%

alloc strict Netperf stream 7656 88 -99% (cycles) Netperf RR 10269 175 -98% Apache 1MB 17776 128 -99% Apache 1KB 49981 204 -100% Memcached 50606 151 -100%

defer Netperf stream 2202 103 -95% Netperf RR 2360 183 -92% Apache 1MB 2085 130 -94% Apache 1KB 2642 206 -92% Memcached 3040 171 -94%

search strict Netperf stream 153 0 -100% (length) Netperf RR 206 0 -100% Apache 1MB 381 0 -100% Apache 1KB 1078 0 -100% Memcached 893 0 -100%

defer Netperf stream 32 0 -100% Netperf RR 32 0 -100% Apache 1MB 30 0 -100% Apache 1KB 33 0 -100% Memcached 33 0 -100%

free strict Netperf stream 289 66 -77% (cycles) Netperf RR 446 87 -81% Apache 1MB 360 70 -81% Apache 1KB 565 85 -85% Memcached 525 73 -86%

defer Netperf stream 273 65 -76% Netperf RR 242 66 -73% Apache 1MB 278 65 -76% Apache 1KB 300 66 -78% Memcached 334 65 -80%

cpu strict Netperf stream 100 100 +0% (%) Netperf RR 32 29 -8% Apache 1MB 100 99 -0% Apache 1KB 99 98 -1% Memcached 100 100 +0%

defer Netperf stream 100 100 +0% Netperf RR 30 29 -5% Apache 1MB 99 99 -0% Apache 1KB 98 98 -0% Memcached 100 100 +0%

Table 4.2: Summary of the results obtained with the Mellanox setup

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 protect benchmark baseline EiovaR diﬀ throughput strict Netperf stream 1.00 2.35 +135% (normalized) Netperf RR 1.00 1.07 +7% Apache 1MB 1.00 1.22 +22% Apache 1KB 1.00 1.16 +16% Memcached 1.00 1.40 +40%

defer Netperf stream 1.00 1.00 +0% Netperf RR 1.00 1.02 +2% Apache 1MB 1.00 1.00 +0% Apache 1KB 1.00 1.10 +10% Memcached 1.00 1.05 +5%

alloc strict Netperf stream 14878 70 -100% (cycles) Netperf RR 3359 100 -97% Apache 1MB 1469 74 -95% Apache 1KB 2527 116 -95% Memcached 5797 110 -98%

defer Netperf stream 1108 96 -91% Netperf RR 1029 118 -89% Apache 1MB 833 88 -89% Apache 1KB 1104 133 -88% Memcached 1021 130 -87%

search strict Netperf stream 345 0 -100% (length) Netperf RR 68 0 -100% Apache 1MB 27 0 -100% Apache 1KB 39 0 -100% Memcached 128 0 -100%

defer Netperf stream 13 0 -100% Netperf RR 9 0 -100% Apache 1MB 9 0 -100% Apache 1KB 9 0 -100% Memcached 9 0 -100%

free strict Netperf stream 294 47 -84% (cycles) Netperf RR 282 48 -83% Apache 1MB 250 50 -80% Apache 1KB 425 52 -88% Memcached 342 47 -86%

defer Netperf stream 268 47 -82% Netperf RR 273 47 -83% Apache 1MB 234 47 -80% Apache 1KB 279 47 -83% Memcached 276 47 -83%

cpu strict Netperf stream 100 53 -49% (%) Netperf RR 13 12 -12% Apache 1MB 99 99 -0% Apache 1KB 98 98 -0% Memcached 99 95 -4%

defer Netperf stream 55 44 -21% Netperf RR 12 11 -7% Apache 1MB 91 72 -21% Apache 1KB 98 98 -0% Memcached 93 92 -2%

Table 4.3: Summary of the results obtained with the Broadcom setup

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 Mellanox Setup Results Let us first examine the results of the Mellanox setup (Table 4.2). In the topmost part of the table, we see that EiovaR yields throughput that is 1.07–4.58x better than the baseline, and that improvements are more pronounced under strict protection. In the second part of the table, we see that the reason underlying the improved performance of EiovaR is that it reduces the average IOVA allocation time by 1–2 orders of magnitude, from up to 50K cycles to around 100–200. It also reduces the average IOVA deallocation time by about 75%–85%, from around 250–550 cycles to around 65–85 (fourth part of the table). When comparing the overhead of the IOVA allocation routine to the length of the search loop within this routine (third part of Tables 4.2), we can see that, as expected, the two quantities are tightly correlated. Roughly, a longer loop implies a higher allocation overhead. Conversely, notice that there is not necessarily such a direct correspondence between the throughput improvement of EiovaR (first part of the table) and the associated IOVA allocation overhead (second part). The reason is largely due to the fact that latency sensitive applications are less affected by the allocation overhead, because other components in their I/O paths have higher relative weights. For example, under strict protection, the latency sensitive Netperf RR has higher allocation overhead as compared to the throughput sensitive Netperf Stream (10,269 cycles vs. 7,656, respectively), yet the throughput improvement of RR is smaller (1.27x vs. 2.37x). Similarly, the IOVA allocation overhead of Apache/1KB is higher than that of Apache/1MB (49,981 cycles vs. 17,776), yet its throughput improvement is lower (2.35x vs. 3.65x). While there is not necessarily a direct connection between throughput and allocation overhead when examining only strict safety, the connection becomes apparent when comparing the strict and deferred protection regimes. Clearly, the benefit of EiovaR in terms of throughput is greater under strict protection because the associated baseline allocation overheads are higher than that of deferred protection (7K–50K cycles for strict vs. 2K–3K for deferred).

Broadcom Setup Results Let us now examine the results of the Broadcom setup (Table 4.3). Strict EiovaR yields throughput that is 1.07–2.35x better than the baseline. Deferred EiovaR, on the other hand, only improves the throughput by up to 10%, and, in the case of Netperf Stream and Apache/1MB, it offers no improvement. Thus, while still significant, throughput improvements in this setup are less pronounced. The reason for this difference is twofold. First, as noted above, the driver of the Mellanox NIC utilizes more rings and more IOVAs, increasing the load on the IOVA allocation layer relative to the Broadcom driver and generating more opportunities for ring interference. This difference is evident when comparing the duration of alloc iova in the two setups, which is significantly lower in the Broadcom case. In particular, the average allocation time in the Mellanox setup across all benchmarks and protection regimes is about 15K cycles, whereas it is only about 3K cycles in the Broadcom setup.

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 The second reason for the less pronounced improvements in the Broadcom setup is that the Broadcom NIC imposes a 10 Gbps upper bound on the bandwidth, which is reached in some of the benchmarks. Speciﬁcally, the aforementioned Netperf Stream and Apache/1MB—which exhibit no throughput improvement under deferred Eio- vaR—hit this limit. These benchmarks are already capable of obtaining line rate (maximal throughput) in the baseline/deferred conﬁguration, so the lack of throughput improvement in their case should come as no surprise. Importantly, when evaluating I/O performance in a setting whereby the I/O channel is saturated, the interesting evaluation metric ceases to be throughput and becomes CPU usage. Namely, the question becomes which system is capable of achieving line rate using fewer CPU cycles. The bottom part of Table 4.2 shows that EiovaR is indeed the more performant alternative, using 21% less CPU cycles in the case of the said Netperf Stream and Apache/1MB under deferred protection. (In the Mellanox setup, it is the CPU which is saturated in all cases but the latency sensitive Netperf RR.)

Deferred Baseline vs. Strict EiovaR We explained above that deferred protection trades off safety to get better performance. We now note that, by Figure 4.10, the performance attained by EiovaR when strict protection is employed is similar to the performance of the baseline configuration that uses deferred protection (the default in Linux). Specifically, in the Mellanox setup, on average, strict EiovaR achieves 5% higher throughput than the deferred baseline, and in the Broadcom setup EiovaR achieves 3% lower throughput. Namely, if strict EiovaR is made the default, it will simultaneously deliver similar performance and better protection as compared to the current default configuration.

Different Message Sizes The default configuration of Netperf Stream utilizes a 16KB message size, which is big enough to optimize throughput. Our next experiment systematically explores the performance tradeoffs when utilizing smaller message sizes. Such messages can overwhelm the CPU and thus reduce the throughput. Another issue that might negatively affect the throughput of small packets is the maximal number of packets per second (PPS), which NICs commonly impose in conjunction with an upper bound on the throughput. (For example, the specification of our Broadcom NIC lists a maximal rate of 5.7 million PPS [34], and a rigorous experimental evaluation of this NIC reports that a single port in it is capable of delivering less than half that much [25].) Figure 4.11 shows the throughput (top) and consumed CPU (bottom) as a function of the message size for strict (left) and deferred safety (right) using the Netperf Stream benchmark in the Broadcom setup. Initially, with a 64B message size, the PPS limit dominates the throughput in all four configurations. Strict/baseline saturates the CPU with a message size as small as 256B; from that point on it achieves the same throughput (4Gbps), because the CPU remains its bottleneck. The other three configurations enjoy

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 strict defer 10 10 8 8 6 6 4 4 2 2 0 0

throughput [Gbit/s] 64B 256B1KB 4KB 16KB 64B 256B1KB 4KB 16KB

strict defer 100 100 80 80 60 60 40 40

CPU [%] 20 20 0 0 64B 256B1KB 4KB 16KB 64B 256B1KB 4KB 16KB

eiovar message size (log scale) baseline

Figure 4.11: Netperf Stream throughput (top) and used CPU (bottom) for diﬀerent message sizes in the Broadcom setup.

a gradually increasing throughput until the line rate is reached. However, to achieve the same level of throughput, strict/EiovaR requires more CPU than deferred/baseline, which in turn requires more CPU than deferred/EiovaR.

Concurrency Our ﬁnal experiment focuses on concurrent I/O streams, as concurrency ampliﬁes harmful long-lasting ring interference. Figure 4.12 depicts the results of running Memcached in the Mellanox setup with an increasing number of clients. The left sub- graph reveals that the baseline allocation hampers scalability, whereas EiovaR allows the benchmark to scale such that it is up to 5.5x more performant than the baseline (with 32 clients). The right sub-graphs highlights why, showing that the baseline IOVA allocation becomes costlier proportionally to the number of clients, whereas EiovaR allocation remains negligible across the board.

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 throughput alloc_iova 120k 60k 100k 50k 80k 40k 60k 30k

40k cycles 20k 20k 10k

transactions/sec 0k 0k 1 2 4 8 1 2 4 8 16 32 16 32 eiovar clients (log scale) baseline

Figure 4.12: Impact of increased concurrency on Memcached in the Mellanox setup. EiovaR allows the performance to scale.

4.7 Related Work

Several studies recognized the poor performance of the IOMMU for OS protection and addressed the issue using various techniques. Willmann et al. [64] studied the poor performance guests exhibit when IOMMU is used for inter-guest protection. Two of the strategies suggested by the authors to enhance performance are also applicable to native OS that use the IOMMU for OS protection: The“shared mappings” strategy reuses existing mappings when multiple I/O buffers reside on the same page, yet does not reduce the the CPU utilization significantly. The “persistent mappings” strategy defers the actual unmap operations to allow the reuse of mappings when the same I/O buffer is reused later. This strategy improves performance considerably, yet relaxes the protection IOMMU delivers. A similar work by Amit et al. [4] presented additional techniques that can be applied to enhance the performance of native OS that use IOMMU protection with a short window of vulnerability. Asynchronous IOTLB invalidations and refinement of the “persistent mappings” to limit the time stale mappings may be deferred was proved to reduce the overhead considerably. Nonetheless, these techniques still relax the IOMMU protection and may allow invalid DMA transaction to succeed and corrupt system memory. Tomonori [60] studied the IOMMU performance bottleneck, suggesting management of the IOVA free space using bitmaps instead of red-black trees can enhance performance by 20% on x86 platforms. Similarly, Cascardo [20] showed using red-black trees for IOVA free space management on Power PC systems delivers significantly worse performance. Nonetheless, using bitmaps for free space management consumes significantly more memory and often does not scale well. Apparently Intel refrained from using bitmaps

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 for these reasons.

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 Chapter 5

Reducing the IOTLB Miss Overhead

5.1 Introduction

Virtual to physical IO address translation is on the critical path of DMA operations in computer systems that use IOMMUs since it is invoked on every memory access by the device that performs the DMA. To decrease the translation latency, IOMMU uses a cache of recent translations called the I/O Translation Lookaside Buffer (IOTLB). The IO virtual to physical address translation algorithm that the IOMMU executes is similar to the algorithm used by the CPU’s MMU, and thus studies about TLBs should be relevant for learning the behavior of the IOTLB [41, 6, 35, 46]. Although many proposed optimizations were supposed to improve the TLB’s performance by reducing the miss rate, none of the studies above succeeded in devising a TLB with a miss rate close to 0%. This is because at least the first access to an address causes a miss on the TLB. Prefetching can hide some or all of the cost of TLB misses, as has been shown in several recent studies [12, 56, 53, 46]. Amit et al. were, as far as we know, the first to explore prefetching in the context of IOTLB and used 2 simple prefetching approaches. The first, proposed by Intel, uses the Address Locality Hints (ALH) mechanism[39]. With this approach, the IOMMU prefetches adjacent virtual address translations. The second approach is Mapping Prefetch (MPRE), whereby the operating system provides the IOMMU with explicit hints to prefetch the first mapping of each group of streaming DMA mappings, where a group is defined as a scatter-gather list of several contiguous pages that are mapped consecutively. The evaluation of these prefetchers has been covered elsewhere; we refer the interested reader to [5]. In general, we can classify prefetchers into 2 groups: those that capture strided reference patterns and those that decide whether to prefetch on the basis of history. (Strided reference patterns are a sequence of memory reads and writes to addresses, each of which is separated from the last by a constant interval.) Anand et al. reviewed 5 main

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 prefetching mechanisms for the data TLB [46, 45], two of which belong to the ﬁrst class: sequential prefetching and arbitrary stride prefetching. These prefetching mechanisms cannot be adapted to the context of IOMMU because one of them depends on the Page Table structure and the second uses the Program Counter register. Consequently, we used only the other three for comparison against our prefetching mechanism and adjusted them to the IO context. The ﬁrst of these is based on limited history and called Markov prefetching [43]. The second, called recency-based prefetching, is based on a much larger history (until the previous miss to a page) and is described in [56]. The third is a stride-based technique called distance prefetching that was proposed for TLBs in [46].

5.2 General description of all the prefetchers we explore

IOMMU Main Memory

IOTLB Page Table 1) A Virtual address translation is required 2) IOTLB Translating miss IOVA to PA

Prediction Table

If both IOTLB and 3) Update the PB are missed, prediction table translate from page table Logic

Prefetch Buffer

4) Prefetch the relevant IOVA translations to the PB using the prediction table and the page table

Figure 5.1: General scheme.

In a nutshell, the prefetch mechanism wraps the regular translation mechanism, predicts the next IOTLB misses, and brings the translations closer to the IOMMU so they can be supplied rapidly when they are required but not found in the IOTLB. A prefetcher contains 3 main parts: a data base called the prediction table, a special cache

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 called the prefetch buffer, and a logic unit. These are depicted in Figure 5.1. The prediction table summarizes information about the recent IOTLB misses and provides the logic unit relevant data for deciding what translation to prefetch and whether to prefetch. Each row in this table has a tag and s slots containing data to calculate the next s address translations to prefetch.The size of the table is therefore the number of rows multiplied by s. In our work we used s=2. In Markov prefetching, the current missing page is used as the tag, and the table gives two other pages with high probability of being missed immediately after: these two pages need to be prefetched. Distance prefetching uses as the tag the current distance (difference in page number between the current miss and the previous miss), and the table gives two other distances that occurred immediately after this distance was encountered. Recency prefetching keeps an LRU stack of TLB misses by maintaining two additional pointers in each page table entry (kept in memory). On each IOTLB miss, the logic unit updates the prediction table and decides which translation to prefetch into the prefetch buffer. The prefetch buffer is as close as the IOTLB to the IOMMU, and it can be searched concurrently with the IOTLB. Hence, in the case of a prefetch buffer hit, the IOMMU gets the translation as quickly as if it were in the IOTLB. Note that the prefetch mechanism observes the miss reference stream to make deci- sions and never directly loads the fetched entries into the IOTLB. As a result, prefetching does not influence the miss rate of the IOTLB; it can only hide the performance cost of some of the IOTLB misses. Hence the only drawback of the prefetch mechanism is the additional memory traffic of prefetched and unused translations.

5.3 Markov Prefetcher (MP)

The Markov prefetcher is based on the Markov Chain Theorem and the assumption that there is a good chance that history repeats itself. It uses a matrix (X ∗Y ) to represent an approximation of a Markov state transition diagram and build it dynamically. Markov state transition diagrams answer the question: given a visit to a specific state, what are the X − 1 states with the highest probability to be visited right after it? The diagram contains states denoting the referenced unit (virtual addresses in this context) and transition arcs denoting the probability to move from one state to the state pointed to by the arc. The probabilities are tracked from prior references to that state. To know which states have the highest probability to be visited, one should search in the state’s arcs. Joseph et al. were the first to propose this mechanism for caching [43]. Kandiraju et al. extended it to work with TLBs and called it the Markov Prefetcher [46], and we extended the latter to work with IOTLBs, as will be discussed shortly. We will briefly describe the Markov Chain theorem for a better understanding of how the prefetcher works.

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 5.3.1 Markov Chain Theorem

Let us assume the following deﬁnitions: A stochastic process X = {Xt : t ∈ T } is a collection of random variables, representing the evolution of some system of random

values over time. We call Xt the state of the process at time t. The Markov property is deﬁned as a memoryless transition between states, namely, the next state depends only on the current state and not on the sequence of events that preceded it.

Let x1, x2...xn be random variables.

Then P r(Xn+1 = x | X1 = x1,X2 = x2, ...Xn = xn) = P r(Xn+1 = x | Xn = xn). A Markov chain is a stochastic process with the Markov property.

The Markov chain can thus be used for describing systems that follow a chain of

linked events represented as states, S = {s1, s2, ...sn}. The process starts in one of these states and moves successively from one state to another. These changes of state

of the system are also called transitions. If the system is currently in state ssi, the

probability that the next state will be sj is denoted by pij, where pij depends only on

the current state (si) of the system and not on the sequence of events that preceded it

nor on time i. This can be described mathematically as: P r[Xn+1 = b | Xt = a] = p{ab} (Stationary Assumption). The states and transitions are denoted by a Markov state transition diagram, which is represented as a directed graph or a matrix and usually built using the history of the process.

Example

This can be illustrated by a simple example. Let us assume that the entire World Wide Web consists of only 3 websites, e1, e2, e3, visited in the sequence e1, e1, e2, e2, e1, e2, e1, e3, e3, e2, e3. Then the Markov state transition diagram, built according to the history of these visits, is illustrated in 5.2.

2/4

1/4 e1 2/4 e2 1/4

1/4 0 1/4 1/2 e3

1/2

Figure 5.2: Markov state transition diagram, which is represented as a directed graph (right) or a matrix (left).

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 The left is the matrix representation and the right is the directed graph representation. If we are in state e1 there is a chance of 1/4 to stay in e1, 2/4 to move to e2 and 1/4 to move to e3. If we are in state e2 there is a chance of 1/4 to stay in e2, 2/4 to move to e1 and 1/4 to move to e3, etc.

5.3.2 Prefetching Using the Markov Chain

The prediction table for the Markov prefetcher includes a tag column containing the virtual addresses that missed on the TLB. Each row of the table has s slots, with each slot containing a virtual page address that has the highest (approximated) probability to be accessed after the virtual address denoted by the tag of the row. These slots therefore correspond to translations to be prefetched the next time a miss occurs. On a TLB miss, the prefetcher looks up the missed address in the table. If not found, then this entry is added, and the s slots for this entry are kept empty. If the missed address is found, then a prefetch is initiated for the corresponding s slots of this address.

5.3.3 Extension to IOMMU

In the TLB extension of the Markov prefetcher [46], the slots contain an entire page table entry (including the virtual address and the translation). This is in contrast to our IOMMU extension, where the slots contain only virtual addresses; the translations corresponding to these addresses are brought by invoking a page walk for each address. This difference is due to the fact that in the IO context the translations are not available after a DMA transaction is finished, unlike the CPU context where the virtual addresses exist until the process is killed or the memory is free. In addition to prefetching, the prefetcher’s logic unit goes to the entry of the previous virtual address that missed, and adds the current missed address into one of its s slots (whichever is free). If all the slots are occupied, then it evicts one in accordance with LRU policy. As a result, the s slots for each entry correspond to different virtual pages that also missed immediately after this page. Since the table has limited entries, an entry (row) could itself be replaced because of conflicts. A simplified hardware block diagram implementation of the Markov prefetcher with s = 2 is given in Figure 5.3.

5.4 Recency Based Prefetching (RP)

While Markov prefetching schemes were initially proposed for caches, recency prefetching has been proposed solely for TLBs [56, 46]. This scheme uses the memory accesses (and misses on the TLB) as points on the time-line and works on the principle that after a TLB miss of a virtual address with a given recency, the next TLB miss will likely be of a virtual address with the same recency. The underlying assumption is that if an application has sequential accesses to a set of data structures, then there is a temporal

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 ordering between the accesses to these structures, despite the fact that there may be no ordering in either the virtual or physical address space. To keep an updated data structure that contains all the page table entries ordered by their recency, the recency prefetcher builds an LRU stack of page table entries. The hardware TLB maintains temporal ordering and naturally contains virtual addresses with contemporary recency: if the TLB has N entries, then it implements the upper N entries of the recency stack and the rest of the entries are located in the prefetcher database (the LRU maintained in the main memory).

5.4.1 TLB hit

Assume we have a TLB with 4 entries, A, B, C, and D, corresponding to an LRU stack order of 1,2,3, and 4, respectively. If a virtual address C, which is the third address on the recency stack, is accessed, then only the TLB is updated. C becomes the ﬁrst in the LRU order, A becomes second, B becomes third, and the rest of the recency stack, which is in the main memory, remains the same as in Figure 5.4.

5.4.2 TLB miss

When accessing a virtual address that does not exist in the TLB (a TLB miss), the required virtual address translation is removed from the in-memory LRU stack, inserted into the top of the TLB, and the address has the most contemporary recency at that point in time. All the following addresses are pushed down. As a result, the last entry

Page # (Virtual) Predicted Addresses Prefetch Tags Buffer

1) P1 is missed on the IOTLB after P0 was missed P1 P2 P3

3) P2 and P3 are prefetched to the prefetch buffer P0 P1

2) P1 is inserted to the P0 predicted addresses

Figure 5.3: Schematic implementation of the Markov Prefetcher.

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 1 A C C A 2 B A B 3 TLB C B 4 D D 2) A and B move to D the 2nd and 3rd 1) C is accessed position respectively when it is the 3rd TLB entry and then moved to the 1st 5 U position U U

6 V V V stack 7 W W W 8 X X X 9 recency Y Y Y

10 Z Z Z In In memory

LRU stack order Recency stack before Updating recency Recency stack after (Contemporary recency) TLB hit stack on TLB hit TLB hit

Figure 5.4: Schematic depiction of the recency prefetcher on a TLB hit.

is evacuated from the TLB to make room for the inserted entry and inserted to the in-memory stack. Thus, the only in-memory recency stack manipulations required are those related to TLB misses. In addition to all these updates, the prefetch mechanism prefetches 2 LRU entries with similar contemporary recency to the accessed virtual address, namely the predecessor and the successor of the accessed virtual address before it was removed from the in memory LRU stack.

Example

Assume we have a TLB with 4 entries, A, B, C, and D, with an LRU stack order of 1, 2, 3, and 4, respectively, and a memory recency stack with 6 entries, U, V, W, X, Y , and Z, with an LRU stack order of 5, 6, 7, 8, 9, and 10, respectively; see Figure 5.5. If virtual address Y, which is 9th in the LRU stack order, is accessed, the LRU stack is updated as follows:

1. Y is removed from the in-memory recency stack and inserted to the ﬁrst entry in the TLB.

2. D is evacuated from the TLB and inserted to the ﬁrst place in the in-memory stack; namely, it becomes the 5th in the LRU stack order.

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 1 A Y 2) A,B and C move Y A to the 2nd , 3rd and 2 B 4th position A B TLB respectively 3 C B 3) D is moved from C 4 D the TLB to the in- C memory stack. From the 4th to the 5th position. D 5 U D U 6 V U stack V 7 W V W 8 X W

recency X 9 Y X 1) Y is accessed 10 Z when it is the 9th Z Z entry in the LRU 4) X and Y are stack and then prefetched to the

In In memory moved to be the first prefetch buffer position in the TLB

LRU stack order Recency stack before Updating recency Recency stack after (Contemporary recency) TLB miss stack on TLB miss TLB miss

Figure 5.5: Schematic depiction of the recency prefetcher on a TLB miss.

3. A,B,C stay in the TLB but become 2nd, 3rd and 4th respectively, in the LRU stack order. Since Y was missed in the TLB, prefteching is executed, that is to say X and Z are inserted into the prefetch buﬀer.

5.4.3 Extension to IOMMU

As explained before, virtual addresses translations are available only during the DMA transactions and unpadded right after the transaction is ﬁnished, so keeping the translations is useless. Since the virtual address allocator reuses the addresses after they are freed, it is probable that the recency prefetcher can predict missing patterns. We thus propose holding the virtual addresses in the stack even if they are unmapped and, when a prefetch is executed, looking for the translation in the page table and prefetching it only if it exists.

5.5 Distance Prefetching (DP)

The distance prefetcher works on the hypothesis that if we could keep track of diﬀerences (”distances”) between successive addresses, then we could make more predictions in a smaller space than the previous prefetchers. Let us deﬁne distance as the subtraction between 2 addresses and assume that the sequence of accessed addresses is 0, 2, 10, 12,

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 20, 22, 30, 32. Then the distances of this sequence are: 2, 8, 2, 8, 2, 8, 2. Knowing that a distance of 2 is followed by a distance of 8 and vice versa allows us to predict, after only the third access (address 10), the rest of the sequence.This is the main idea behind the distance prefetcher. Although the Markov prefetcher [43, 46] and recency prefetcher [56, 46] are well- suited to learning accessing patterns in a system, they require considerably more space than DP to hold the data when the domain becomes large (the domain includes more addresses). In order to detect a pattern of a sequential scan, for example, MP and RP need considerable space. They also take a while to learn a pattern, since a prefetch is done only after repetitions in addresses. According to Kandiraju et al. [46], who proposed it, the distance prefetcher can detect at least as many accessing patterns as the recency and Markov prefetchers. Unlike the distance prefetcher, the Markov and recency prefetchers cannot be used to predict the ﬁrst access to a given address.

Prefetch Page # (Virtual) Predicted Addresses Buffer Previous IOVA Tags

Missed IOVA - d1 d2 d3

Current distance d0 d1 4) Translate and Prefetch prefetch the IOVAs translations 2) Insert the current Previous distance distance to the previous distance row Page Table + + 1) IOVA is missed and current distance is 3) Calculate the addresses Prefetch calculated to prefetch by adding Virtual distances to current IOVA Addresses

Figure 5.6: Schematic depiction of the distance prefetcher on a TLB miss.

The hardware implementation is similar to the Markov prefetcher implementation except that the prediction table contains distances instead of addresses, and calculations are required to obtain these distances, as shown in Figure 5.6. The table rows are indexed with distances and each row contains 2 predicted distances. The tag cells represent the distance of the current missed virtual address from the last missed virtual address and the 2 other cells contain the prediction of possible distances from the current address. When a virtual address is missed on the TLB, the current distance is calculated from the previous virtual address, which is kept in a register. The previous distance is also kept in a register, and the current distance is inserted to its row. Then 2 distances from the current distance row are taken in order to calculate 2 addresses which will be prefetched to the prefetch buﬀer. Now that we have possible virtual addresses, we look

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 for their translations in the page table and, if found, fetch them from the page table and store them in the TLB.

5.6 Evaluation

5.6.1 Methodology

We designed a simulator to simulate different types of TLBs and prefetchers. The simulator input is a specific IOMMU architecture (TLB and prefetcher) and a stream of events (map memory address, unmap memory address and access to memory). The output is statistics of the IOMMU performance. To supply the event stream to the simulator, we implemented an IOMMU emulation on KVM/QEMU to log the DMAs that the emulated I/O devices perform (in the relevant locations in the QEMU code we entered print commands). While writing the logs to files, we then ran our benchmarks: Netperf TCP stream, Netperf UDP RR, Apache and Memcached, which are described in chapters 3 and 4. After the benchmarks finish running, we get a set of files which are the recorded benchmark runs and include all the information needed to simulate different IOTLB schemes. Finally, we implemented simulations of all the prefetchers described above. Given a prefetcher and a benchmark log file, the simulator parses the file, creates an event stream, and executes the corresponding methods of the prefetcher to these events. The virtual machines of our modified QEMU used Ubuntu 12.04 server images and were configured to use emulated e1000 NIC devices. Each IOMMU simulation included a 32-entry fully associative LRU IOTLB and a 32-entry fully associative Prefetch Buffer. The size of the prediction table of the prefetcher varied as shown in Figures 5.7, 5.8 and 5.9.

5.6.2 Results

In order to understand the results, one should understand what memory accesses are included in a DMA transaction. First the device reads the DMA descriptor which resides in the Rx or Tx descriptor ring, gets the buffer address, and then accesses (write or read) that memory address. When IOMMU is enabled, every access requires a translation and the IOMMU checks whether the translation is cached in the IOTLB or in the prefetch buffer. Each E1000 device in our virtual machines includes only 2 rings with 256 descriptors. One ring takes 4k bytes of memory, and 1 page is sufficient to contain it. Hence, each virtual machine uses only 2 addresses for the NIC rings and the maximum possible virtual addresses is 514: 2 addresses of the rings and 256 buffers in each ring (Rx and Tx). When the NIC only receives data, there are only 258 virtual addresses: 2 addresses of the rings and 256 Rx buffers. Note that the Rx ring contains the maximum number of buffers and the Tx ring is empty.

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 Minimum accesses per transaction Maximum accesses per transaction 100

40 hit rate [%] Apache 1k

0 100

40 hit rate [%] Apache 1M

0 8 32 64 128256512 8 32 64 128256512 8 32 64 128256512 8 32 64 128256512 8 32 64 128256512 8 32 64 128256512

recency distance markov recency distance markov size of prediction table size of prediction table IOTLB rings IOTLB buffers PB buffers

Figure 5.7: Hit rate simulation of Apache benchmarks with message sizes of 1k (top) and 1M (bottom).

We observed that the 2 addresses of the rings are inserted to the IOTLB and are never invalidated. This is because 32 entries are suﬃcient for never reaching the point where those 2 addresses are the least recently used in the IOTLB and other entries are evicted before it. Hence there are 3 type of possible hits in our results:

1. IOTLB hit caused by accessing a ring descriptor;

2. IOTLB hit caused by accessing a buﬀer;

3. Prefetch buﬀer hit caused by accessing a buﬀer;

We present 2 versions of results for each benchmark, because the granularity of the transferred buffer chunks varies from system to system and affects the miss rate of the IOMMU. The right-hand columns in the figures show the results for systems with a granularity of 64 bytes, which cause the maximum number of accesses per transaction, while the left-hand columns include results for systems that have a chunk with the size of the transferred buffer.

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 Minimum accesses per transaction Maximum accesses per transaction 100

40 hit rate [%]

Netperf Stream 1k 20

0 100

40 hit rate [%]

Netperf Stream 4k 20

0 8 32 64 128256512 8 32 64 128256512 8 32 64 128256512 8 32 64 128256512 8 32 64 128256512 8 32 64 128256512

recency distance markov recency distance markov size of prediction table size of prediction table IOTLB rings IOTLB buffers PB buffers

Figure 5.8: Hit rate simulation of Netperf stream with message sizes of 1k (top) and 4k (bottom).

Markov prefetcher

The Markov prefetcher begins to be eﬀective only when the prediction table has 256 or more entries. Not coincidentally, 256 is the size of a ring in E1000 and constitutes most of the virtual addresses that are in use by the NIC.

When sending a 1k buffer, only one Ethernet packet is required, but when sending a 1M buffer the buffer is split to 700 Ethernet packets. Those 700 packets are entered one after the other to the Tx descriptor ring, which has 256 entries. The order of sending 256 packets repeats itself about 3 times, so the predictor can learn this pattern after the first 1M buffer is sent. The results are better than Apache 1k (Figure 5.7). For the same reason, when sending 1M buffers, 256 entries in the prediction table are sufficient to obtain results as good as those obtained when 512 entries are used. Any number of entries smaller than 256 results in a zero prefetch buffer hit rate because the periodicity of the rings is 256.

In all the other benchmarks, 256 prediction table entries are suﬃcient to obtain a prefetch buﬀer of about 95% (Figures 5.8, 5.9).

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 Minimum accesses per transaction Maximum accesses per transaction 100

40 hit rate [%] Netperf RR 1k 20

0 100

40 hit rate [%] Memcache

0 8 32 64 128256512 8 32 64 128256512 8 32 64 128256512 8 32 64 128256512 8 32 64 128256512 8 32 64 128256512

recency distance markov recency distance markov size of prediction table size of prediction table IOTLB rings IOTLB buffers PB buffers

Figure 5.9: Hit rate simulation of Netperf RR (top) and Memcached (bottom).

The Recency prefetcher

As expected, all benchmarks obtain a hit rate of almost 100% when the prediction table contains 512 entries and a lower hit rate when the prediction table contains at most 256 entries. Because of the IOVA allocation scheme explained in chapter 4, Rx and Tx swaps IOVAs. The IOVA recency, which is accessed both by sending and receiving, is always above 256. Hence, when running Netperf RR, 256 prediction table entries are not enough for the preftcher to predict anything. As for the Markov prefetcher, here, too, any prediction table with fewer than 256 entries will result in a 0% prefetch buﬀer hit rate because the periodicity of the rings is 256.

The Distance prefetcher

The distance prefetcher is not inﬂuenced by the addresses or by the periodicity of the NIC descriptor rings. The distances between accessed address are such that only 8 prediction table entries are enough to predict something and increasing that number does not really improve the prediction. Moreover, in some benchmarks (for example the

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 Netperf stream shown in Figure 5.8), increasing the number of prediction table entries reduces the prefetch buﬀer hit rate, because the relevance of the access pattern changes over time. Hence, the bigger the table is, the more irrelevant data is kept by it.

5.7 Measuring the cost of an Intel IOTLB miss

Experimental Setup

We used two identical Dell PowerEdge R210 II Rack Server machines that communicate through Mellanox ConnectX3 40Gbps NICs. The NICs are connected back to back and are configured to use Ethernet. We use one machine as the server and the other as a workload generator client. Each machine has a 8GB 1333MHz memory and a single socket 4-core Intel Xeon E3-1220 CPU running at 3.10GHz. The chipset is Intel C202, which supports VT-d, Intel’s Virtualization Technology that provides IOMMU functionality. We configure the server to utilize one core only and turn off all power optimization sleep states (C-states), dynamic voltage, and frequency scaling (DVFS) to avoid reporting artifacts caused by nondeterministic events. The machines run Ubuntu 12.04 with the Linux 3.11 kernel. With the help of the ibverbs library [47, 38], we bypassed the entire network stack and communicated directly with the NIC. This way we could measure the latency more accurately.

Experimental results

We first measured the IOMMU latency sending an Ethernet packet and then measured the packet’s round trip time. Each experiment was performed in two versions. In the first, a buffer was iteratively and randomly selected from a large pool of previously mapped buffers and transmitted, thus ensuring that the probability for the corresponding IOVA to reside in the IOTLB is low. We call this version the miss-version. The second version does the same but with a pool containing only one buffer, thus ensuring that the IOTLB always hits. This version is called the hit-version. The two types of experiments are detailed below:

1. Measuring the IOMMU hardware latency of sending an Ethernet packet: On each iteration of this experiment we sent a packet and polled the NIC status register until receiving notiﬁcation the packet was sent. Note that this was done without the OS network stack.

The hit-version results are given in the ﬁrst line of Table 5.1. The system takes about 3300 cycles to send 1 packet. As expected, both with or without the IOMMU the results are the same for the hit-version. The miss-version shows a diﬀerence of about 1000 cycles between running the experiment with IOMMU enabled and

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 running it with IOMMU disabled. Thus, we can learn that the IOTLB miss latency for sending a packet is 0.3 microseconds in our machine.

Intel IOTLB hit rate Intel IOMMU disabled Intel IOMMU enabled 100% 3343 3357 0% 5423 6399

Table 5.1: Hardware latency of sending an Ethernet packet [cycles]

2. Measuring the round-trip time of an Ethernet packet: We used the application ibv rc pingpong, which can be found at openfabrics.org, under the ”examples” directory. This application sets up a connection between two nodes running Mellanox adapters and transfers data. The client sends a message and the server acknowledges it by sending back a corresponding message. Only after the acknowledgement message arrives to the client does it send another message. We repeated this experiment with diﬀerent size pools of previously mapped buﬀers (both on the client and server side).

1.2

0.8

0.6

0.4

0.2 Round Trip Time Delta [micro seconds]

0 2 3 8 16 32 64 100 128 256 10k number of different buffers sent

Figure 5.10: Subtraction between the RTT when the IOMMU is enabled and the RTT when the IOMMU is disabled.

Figure 5.10 shows the difference between the round-trip time (RTT) as a function of buffer pool size with the IOMMU enabled and disabled. The first thing we notice is that with more than 32 buffers in the pool, the RTT delta jumps from ˜0.6 to ˜1.2 microseconds. From this we learn that a buffer pool size larger than 32 causes IOTLB

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 misses and the size of the IOTLB is between 32 and 64. We also learn that the IOTLB miss penalty for a round trip is ˜0.6 microseconds. Finally, we learn that there is an additional cost of ˜0.6 microseconds due to the private data structure of the NIC.

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 Chapter 6

Conclusions

6.1 rIOMMU

The IOMMU design is similar to that of the regular MMU, despite the inherent diﬀerences in their workloads. This design worked reasonably well when I/O devices were relatively slow as compared to the CPU. But it hampers the performance of contemporary devices like 10/40 Gbps NICs. We foresee that this problem will get worse due to the ever increasing speed of such devices. We thus contend that it makes sense to rearchitect the IOMMU such that it directly supports the unique characteristics of its workload. We propose rIOMMU as an example for such a redesign and show that the beneﬁts of using it are substantial.

6.2 eIOVAR

The IOMMU has been falsely perceived as the main responsible party for the significant overheads of intra-OS protection. We find that the OS is equally to blame, suffering from the “long-term ring interference” pathology that makes its IOVA allocations slow. We exploit the ring-induced nature of IOVA (de)allocation requests to design a simple optimization called eIOVAR. The latter eliminates the harmful effect and makes the baseline allocator orders of magnitude faster. It improves the performance of common benchmarks by up to 4.6x.

6.3 Reducing the IOTLB Miss Overhead

Both recency and Markov, in our case, require a prediction table with a minimum size of 256, which is the size of the descriptor ring. This size will increase to thousands if the system includes a 40 Gbps NIC. The hit rate improvement will not be worth this price. In addition, the prefetch buﬀer hardware can be used as an IOTLB instead of adding a prefetcher to the system which can be helpful when using multiple devices. The distance prefetcher is a good compromise and only few entries for device are enough to improve

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 the hit rate by up to 12%. But, as shown in chapter 3, by changing the page table data structure to a ring and using only two cache entries per ring, it is possible to predict 100% of the accessed addresses, and this is the best solution for devices that use rings.

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 Bibliography

[1] Dennis Abts and Bob Felderman. A guided tour through data-center networking. ACM Queue, 10(5):10:10–10:23, May 2012.

[2] Brian Aker. Memslap - load testing and benchmarking a server. http://docs. libmemcached.org/bin/memslap.html. libmemcached 1.1.0 documentation.

[3] AMD Inc. AMD IOMMU architectural speciﬁcation, rev 2.00. http:// support.amd.com/TechDocs/48882.pdf, Mar 2011.

[4] Nadav Amit, Muli Ben-Yehuda, Dan Tsafrir, and Assaf Schuster. vIOMMU: eﬃcient IOMMU emulation. In USENIX Ann. Technical Conf. (ATC), pages 73–86, 2011.

[5] Nadav Amit, Muli Ben-Yehuda, and Ben-Ami Yassour. Iommu: Strategies for mitigating the iotlb bottleneck. In Proceedings of the 2010 Interna- tional Conference on Computer Architecture, ISCA’10, pages 256–274, Berlin, Heidelberg, 2012. Springer-Verlag.

[6] Thomas E. Anderson, Henry M. Levy, Brian N. Bershad, and Edward D. Lazowska. The interaction of architecture and operating system design. In Proceedings of the Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS IV, pages 108–120, New York, NY, USA, 1991. ACM.

[7] Anonymized. Eﬃcient IOVA allocation. Submitted, 2014.

[8] Apachebench. http://en.wikipedia.org/wiki/ApacheBench.

[9] Apple Inc. Thunderbolt device driver programming guide: Debugging VT-d I/O MMU virtualization. https://developer.apple.com/library/mac/ documentation/HardwareDrivers/Conceptual/ThunderboltDevGuide/ DebuggingThunderboltDrivers/DebuggingThunderboltDrivers.html, 2013. (Accessed: May 2014).

[10] Apple Inc. Thunderbolt device driver programming guide: Debugging VT-d I/O MMU virtualization. https://developer.apple.com/library/mac/

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 documentation/HardwareDrivers/Conceptual/ThunderboltDevGuide/ DebuggingThunderboltDrivers/DebuggingThunderboltDrivers.html, 2013. (Accessed: May 2014).

[11] ARM Holdings. ARM system memory management unit architecture speciﬁcation — SMMU architecture version 2.0. http: //infocenter.arm.com/help/topic/com.arm.doc.ihi0062c/IHI0062C_ system_mmu_architecture_specification.pdf, 2013.

[12] Kavita Bala, M. Frans Kaashoek, and William E. Weihl. Software prefetching and caching for translation lookaside buﬀers. In USENIX Ann. Technical Conf. (ATC), pages 243–253, 1994.

[13] Thomas Ball, Ella Bounimova, Byron Cook, Vladimir Levin, Jakob Lichten- berg, Con McGarvey, Bohus Ondrusek, Sriram K. Rajamani, and Abdullah Ustuner. Thorough static analysis of device drivers. In ACM EuroSys, pages 73–85, 2006.

[14] Michael Becher, Maximillian Dornseif, and Christian N. Klein. FireWire: all your memory are belong to us. In CanSecWest applied security conference, 2005.

[15] Muli Ben-Yehuda, Jimi Xenidis, Michal Ostrowski, Karl Rister, Alexis Bruem- mer, and Leendert van Doorn. The price of safety: Evaluating IOMMU performance. In Ottawa Linux Symp. (OLS), pages 9–20, 2007.

[16] Emery D. Berger, Kathryn S. McKinley, Robert D. Blumofe, and Paul R. Wilson. Hoard: A scalable memory allocator for multithreaded applications. In ACM Int’l Conf. on Architecture Support for Programming Languages & Operating systems (ASPLOS), pages 117–128, 2000.

[17] James E.J. Bottomley. Dynamic DMA mapping using the generic device. https://www.kernel.org/doc/Documentation/DMA-API.txt. Linux kernel documentation.

[18] Randal E. Bryant and David R. O’Hallaron. Computer Systems: A Program- mer’s Perspective. Addison-Wesley Publishing Company, USA, 2nd edition, 2010.

[19] Brian D. Carrier and Joe Grand. A hardware-based memory acquisition procedure for digital investigations. Digital Investigation, 1(1):50–60, Feb 2014.

[20] Thadeu Cascardo. DMA API Performance and Contention on IOMMU Enabled Environments. https://events.linuxfoundation.org/images/ stories/slides/lfcs2013_cascardo.pdf.

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 [21] Andy Chou, Junfeng Yang, Benjamin Chelf, Seth Hallem, and Dawson Engler. An empirical study of operating systems errors. In ACM Symp. on Operating Systems Principles (SOSP), pages 73–88, 2001.

[22] Cisco. Understanding switch latency. http://www.cisco.com/c/en/us/ products/collateral/switches/nexus-3000-series-switches/white_ paper_c11-661939.html, Jun 2012. White paper. Accessed: Aug 2014.

[23] Jonathan Corbet. Linux Device Drivers, chapter 15: Memory Mapping and DMA. O’Reilly, 3rd edition, 2005.

[24] John Criswell, Nicolas Geoﬀray, and Vikram Adve. Memory safety for low- level software/hardware interactions. In USENIX Security Symp., pages 83–100, 2009.

[25] Demartek, LLC. QLogic FCoE/iSCSI and IP networking adapter evaluation (previously: Broadcom BCM957810 10Gb). http: //www.demartek.com/Reports_Free/Demartek_QLogic_57810S_FCoE_ iSCSI_Adapter_Evaluation_2014-05.pdf, May 2014. (Accessed: May 2014).

[26] The Apache HTTP server project. http://httpd.apache.org.

[27] Roy T. Fielding and Gail Kaiser. The Apache HTTP server project. IEEE Internet Computing, 1(4):88–90, Jul 1997.

[28] Brad Fitzpatrick. Distributed caching with memcached. Linux J., 2004(124), Aug 2004.

[29] Brice Goglin. Design and implementation of Open-MX: High-performance message passing over generic Ethernet hardware. In IEEE Int’l Parallel & Distributed Processing Symp. (IPDPS), 2008.

[30] Abel Gordon, Nadav Amit, Nadav Har’El, Muli Ben-Yehuda, Alex Landau, Assaf Schuster, and Dan Tsafrir. ELI: Bare-metal performance for I/O virtualization. In ACM Int’l Conf. on Architecture Support for Programming Languages & Operating systems (ASPLOS), pages 411–422, 2012.

[31] Jorrit N. Herder, Herbert Bos, Ben Gras, Philip Homburg, and Andrew S. Tanenbaum. Failure resilience for device drivers. In IEEE/IFIP Ann. Int’l Conf. on Dependable Syst. & Networks (DSN), pages 41–50, 2007.

[32] Brian Hill. Integrating an EDK custom peripheral with a LocalLink interface into Linux. Technical Report XAPP1129 (v1.0), XILINX, May 2009.

[33] The hoard memory allocator. http://www.hoard.org/. (Accessed: May 2014).

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 [34] HP Development Company. Family data sheet: Broadcom NetXtreme network adapters for HP ProLiant Gen8 servers. http://www.broadcom.com/docs/ features/netxtreme_ethernet_hp_datasheet.pdf, Aug 2013. Rev. 2. (Ac- cessed: May 2014).

[35] Jerry Huck and Jim Hays. Architectural support for translation table management in large address space machines. In Proceedings of the 20th Annual International Symposium on Computer Architecture, ISCA, pages 39–50, New York, NY, USA, 1993. ACM.

[36] IBM Corporation. PowerLinux servers — 64-bit DMA concepts. http://pic.dhe.ibm.com/infocenter/lnxinfo/v3r0m0/topic/liabm/ liabmconcepts.htm. (Accessed: May 2014).

[37] IBM Corporation. AIX kernel extensions and device support programming concepts. https://publib.boulder.ibm.com/infocenter/aix/v7r1/topic/ com.ibm.aix.kernelext/doc/kernextc/kernextc_pdf.pdf, 2013. (Acc- ssed: May 2014).

[38] ibverbs evaluation. http://www.scalalife.eu/book/export/html/434. Accessed: Aug 2014.

[39] Intel Corporation. Intel virtualization technology for directed I/O, architecture speciﬁcation - architecture speciﬁcation - rev. 1.2. http://www.intel.com/content/dam/www/public/us/en/documents/ product-specifications/vt-directed-io-spec.pdf, Sep 2008.

[40] Intel Corporation. Intel virtualization technology for directed I/O, architecture speciﬁcation - architecture speciﬁcation - rev. 2.2. http://www.intel.com/content/dam/www/public/us/en/documents/ product-specifications/vt-directed-io-spec.pdf, Sep 2013.

[41] Bruce L. Jacob and Trevor N. Mudge. A look at several memory management units, tlb-reﬁll mechanisms, and page table organizations. In Proceedings of the Eighth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS, pages 295–306, New York, NY, USA, 1998. ACM.

[42] Rick A. Jones. A network performance benchmark (revision 2.0). Tech- nical report, Hewlett Packard, 1995. http://www.netperf.org/netperf/ training/Netperf.html.

[43] Doug Joseph and Dirk Grunwald. Prefetching using Markov predictors. In Int’l Symp. on Computer Archit. (ISCA), pages 252–263, 1997.

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 [44] Asim Kadav, Matthew J. Renzelmann, and Michael M. Swift. Tolerating hardware device failures in software. In ACM Symp. on Operating Systems Principles (SOSP), pages 59–72, 2009.

[45] Gokul B. Kandiraju and Anand Sivasubramaniam. Characterizing the d-TLB behavior of SPEC CPU2000 benchmarks. In ACM SIGMETRICS Int’l Conf. on Measurement & Modeling of Comput. Syst., pages 129–139, Jun 2002.

[46] Gokul B. Kandiraju and Anand Sivasubramaniam. Going the distance for TLB prefetching: An application-driven study. In Int’l Symp. on Computer Archit. (ISCA), pages 195–206, 2002.

[47] Gregory Kerr. Dissecting a small InﬁniBand application using the Verbs API. Computing Research Repository (arxiv), abs/1105.1827, 2011. http: //arxiv.org/abs/1105.1827.

[48] Doug Lea. A memory allocator. http://g.oswego.edu/dl/html/malloc. html, 2000.

[49] Doug Lea. malloc.c. ftp://g.oswego.edu/pub/misc/malloc.c, Aug 2012. (Accessed: May 2014).

[50] Joshua LeVasseur, Volkmar Uhlig, Jan Stoess, and Stefan G¨otz.Unmodiﬁed device driver reuse and improved system dependability via virtual machines. In USENIX Symp. on Operating System Design & Implementation (OSDI), pages 17–30, 2004.

[51] Vinod Mamtani. DMA directions and Win- dows. http://download.microsoft.com/download/a/f/d/ afdfd50d-6eb9-425e-84e1-b4085a80e34e/sys-t304_wh07.pptx, 2007. (Accessed: May 2014).

[52] David S. Miller, Richard Henderson, and Jakub Jelinek. Dynamic DMA mapping guide. https://www.kernel.org/doc/Documentation/ DMA-API-HOWTO.txt. Linux kernel documentation.

[53] Jang Suk Park and Gwang Seon Ahn. A software-controlled prefetching mechanism for software-managed tlbs. Microprocess. Microprogram., 41(2):121–136, May 1995.

[54] PCI-SIG. Address translation services revision 1.1. https://www.pcisig. com/specifications/iov/ats, Jan 2009.

[55] Simon Peter, Jialin Li, Irene Zhang, Dan R. K. Ports, Doug Woos, Arvind Krishnamurthy, Thomas Anderson, and Timothy Roscoe. Arrakis: The operating system is the control plane. Technical Report UW-CSE-13-10-01, University of Washington, Jun 2014.

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 [56] Ashley Saulsbury, Fredrik Dahlgren, and Per Stenstr¨om.Recency-based TLB preloading. In Int’l Symp. on Computer Archit. (ISCA), pages 117–127, 2000.

[57] Arvind Seshadri, Mark Luk, Ning Qu, and Adrian Perrig. SecVisor: A tiny hypervisor to provide lifetime kernel code integrity for commodity OSes. In ACM Symp. on Operating Systems Principles (SOSP), pages 335–350, 2007.

[58] Livio Soares and Michael Stumm. FlexSC: Flexible system call scheduling with exception-less system calls. In USENIX Symp. on Operating System Design & Implementation (OSDI), pages 33–46, 2010.

[59] Michael Swift, Brian Bershad, and Henry Levy. Improving the reliability of commodity operating systems. ACM Trans. on Comput. Syst. (TOCS), 23(1):77–110, Feb 2005.

[60] Fujita Tomonori. Intel IOMMU (and IOMMU for Virtualization) perfor- mances. https://lkml.org/lkml/2008/6/5/164.

[61] Transpacket Fusion Networks. Ultra low latency of 1.2 microseconds for 1G to 10G Ethernet aggregation. http://tinyurl.com/ transpacket-low-latency, Dec 2012. Accessed: Aug 2014.

[62] Carl Waldspurger and Mendel Rosenblum. I/O virtualization. Comm. of the ACM (CACM), 55(1):66–73, Jan 2012.

[63] Dan Williams, Patrick Reynolds, Kevin Walsh, Emin G¨unSirer, and Fred B. Schneider. Device driver safety through a reference validation mechanism. In USENIX Symp. on Operating System Design & Implementation (OSDI), pages 241–254, 2008.

[64] Paul Willmann, Scott Rixner, and Alan L. Cox. Protection strategies for direct access to virtualized I/O devices. In USENIX Ann. Technical Conf. (ATC), pages 15–28, 2008.

[65] Rafal Wojtczuk. Subverting the Xen hypervisor. In Black Hat, 2008. http://www.blackhat.com/presentations/bh-usa-08/Wojtczuk/BH_ US_08_Wojtczuk_Subverting_the_Xen_Hypervisor.pdf. (Accessed: May 2014).

[66] Ben-Ami Yassour, Muli Ben-Yehuda, and Orit Wasserman. On the DMA mapping problem in direct device assignment. In ACM Int’l Systems & Storage Conf. (SYSTOR), pages 18:1–18:12, 2010.

100

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 אליה ניגשים כעת, להלן התא ה"נוכחי", והתא השני מחזיק את כתובות הבאה אליה יגשו, להלן התא ה"בא". כמו כן מדיניות הפינוי של הזכרון מטמון היא לפנות את התא ה"נוכחי" ברגע שניגשים לכתובת ה"באה", להביא את הכתובת ה"באה" להיות "נוכחית" ולייבא כתובת חדשה מהזכרון. בבדיקת החומרה החדשה שהצענו על ידי מבחני עומס מקובלים מצאנו שיפור של עד פי 7.56 בתפוקה ביחס לחומרה המקורית. עבודה זו התקבלה לכנס ASP LOS 2015.

• במנהל ההתקנים של ה IOMMU קיים חלק שאחראי על הקצאת הכתובת הוירטואלית. מדדנו שעבור העבודה שחלק זה מבצע המערכת מבזבזת המון זמן ומצאנו שסיבוכיות האלגוריתם שמקצה כתובות היא ליניארית, במספר הכתובות שהוקצו, במקרה הגרוע. כותב האלגוריתם כנראה הסתמך על סדר הקצאות מסויים, שגורם לאלגוריתם לעבוד בסיבוכיות של קבוע, בעומס עבודה טיפוסי, ולוגריטמית במקרה הגרוע. הצענו מנגנון המקצה כתובות בסיבוכיות של קבוע ומשפר את הביצועים עד פי 4.6. עבודה זו התקבלה לכנס F AST 2015.

• בעבודה השלישית חקרנו דרכים להפחית את ההחטאות בזכרון מטמון ע"י הוספת מנגנון המנבא את הכתובות העתידיות לגשת ומייבא אותם עוד לפני הגישה אליהם לזכרון מטמון. לשם כך בחרנו שלושה מנבאים שנחקרו במאמרים רבים והתאמנו אותם לעבודה בסביבת קלט פלט.

iii

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 זכרון. כמו כן זכרון וירטואלי מוריד מהמתכנת את הצורך להיזהר לא להרוס חלקי זכרון חשובים.

• מנצל ביעילות את הזכרון הראשי. הזכרון הראשי משמש כזכרון מטמון עבור המידע של כל התהליכים שמאוכסן בדיסק הקשיח. המידע מועבר בין הזכרון הראשי לדיסק הקשיח לפי הצורך (מה שבשימוש נמצא בזכרון ומה שלא בדיסק).

• מגן על מרחב הזכרון של התהליך מהתנגשות עם תהליכים אחרים.

מכיוון שהבעיות, שזכרון וירטואלי פתר, מאד דומות לבעיות שנוצרות עקב גישה ישירה של התקני קלט/פלט לזכרון, מעצבי חומרה שכפלו את המנגנון והכניסו אותו לתחום ההתקנים עם שינויים מינוריים. כלומר, במערכת מחשב המפעילה את הזכרון הוירטואלי להתקני קלט/פלט, ההתקנים ניגשים לזכרון דרך כתובות וירטואליות, יחידת חומרה מתרגמת את הכתובות לכתובות פיזיות והבקשה (קריאה או כתיבה) מתבצעת. אם הכתובת לא ממופה זה אומר שאין להתקן את ההרשאה לגשת לכתובת הזו ולכן הבקשה (קריאה או כתיבה) לא מתבצעת.

ההתקן מופרד מהזכרון הראשי באמצעות שכבה וירטואלית. דבר זה גורם ליציבות בטיחות גבוהה יותר של המערכת. יחידת החומרה האחראית לזכרון ורטואלי עבור התקני קלט פלט נקראת .IOMMU

היתרונות שהוזכרו לעיל אינן מגיעות ללא תמורה. מסתבר שעקב הפעלה של הזכרון הוירטואלי נוצרת תקורה לא מבוטלת. על ידי הפעלת בדיקות עומס של כרטיסי רשת מצאנו שירידת הביצועים יורדת בעד כשמונים אחוז.

כאשר חקרנו לעומק את אופי פעולת הקלט/פלט מצאנו שקיים הבדל בין איך שתהליכי תוכנה ניגשים לזכרון לבין איך שהתקנים ניגשים לזכרון. הבדל זה לא נלקח בחשבון כשהעתיקו את מנגנון הזכרון וירטואלי.

מיפינו את כל הגורמים האפשריים לירידת הביצועים, כאשר מפעילים זכרון וירטואלי עבור התקנים. מדדנו את התקורה של כל גורם וחקרנו דרכים שונות להורדת התקורה. המסקנה העיקרית שמצאנו היא שהמנהל התקנים של ה IOMMU מהווה את המרכיב העיקרי בהורדת הביצועים. כלומר אם נצליח להוריד את כמות מחזורי השעון של המעבד שמבוזבזים על המנהל התקנים, נוכל לשפר את ביצועי המערכת בצורה משמעותית גם אם לא נשפר את תקורת הפעולה של החומרה וזאת משום שהחומרה עובדת במקביל למעבד.

בחיבור זה אנו מציגים שלוש עבודות:

• מחקירת עומסי עבודה שונים על שרתים, מצאנו שהתקנים רבים, כמו למשל כרטיסי רשת, מתקשרים עם מערכת ההפעלה באמצעות מערכים מעגליים אשר סדר הגישה אל כל תא במערך סדרתי ולכן ניתן לצפות מראש מתי יגשו לכל תא במערך. תכננו IOMMU המנצל את תבנית הגישה למערך המעגלי ובמקום לנהל טבלת דפים הממומשת על ידי עץ חיפוש, הוא מנהל טבלת דפים הממומשת ע"י מערך מעגלי שטוח. ל IOMMU יש זכרון מטמון ששומר את תרגומי הכתובות הוירטואליות. זכרון זה מאורגן כך שעבור כל מערך מעגלי, שבפעולה בזמן מסויים, מוקצים שני תאים אחד שומר את הכתובת

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 תקציר

תעבורת קלט פלט, כמו למשל קריאה מדיסק קשיח או כרטיס רשת, מהווה חלק עיקרי בתהליכי העבודה של מערכות ממוחשבות בעיקר בשרתים ולכן משפיעה באופן ישיר על ביצועי המערכת הממוחשבת. על מנת לאפשר למעבד לבצע משימות שבשבילן הוא תוכנן ועוצב, מרבית התקני הקלט פלט של ימינו יודעים לבצע את תעבורות הקלט פלט ללא התערבות המעבד ולכתוב או לקרוא את המידע ישירות לזכרון המחשב.

למרות שגישה ישירה לזכרון משפרת את ביצועי המעכת בצורה ניכרת, היא יכולה לגרום לבעיות יציבות ואבטחה. על ידי אפשור גישה ישירה לזכרון המחשב, התקן קלט פלט יכול לגשת לכל מקום בזכרון המחשב ולמחוק או להחליף כל איזור זכרון שהוא. דבר זה יכול לגרום לבאגים במערכת ו/או לזהם את המערכת בתוכנות זדוניות.

בעיה זו זכתה לתשומת לב רבה לאחרונה, אומנם בעיה דומה היתה קיימת בעבר עבור תהליכי תוכנה שרצו על המעבד. לפני פיתוח מעבדים התומכים בזכרון וירטואלי כל תוכנה שרצה על המחשב היתה ניגשת לזכרון בצורה ישירה ללא שום בקרה. צורת עבודה זו יכלה לגרום לכך שתוכנה אחת תוכל בטעות או לא בטעות לגשת לזכרון של תוכנה שנייה.

בשנת 1982 הציגו מהנדסי אינטל את מעבדי ה 80286 שכללו תמיכה בזכרון וירטואלי. זכרון וירטואלי הוא טכניקה לניהול והקצאה של זכרון המחשב, המהווה שכבת אבסטרקציה מעל לזכרון הפיזי של המחשב ומדמה זכרון רציף וגדול. לכל תהליך יש את המרחב הוירטואלי שלו ועל ידי כך הוא מבודד לגמרי משאר התהליכים הרצים במקביל אליו.

הרעיון הכללי של זכרון וירטואלי הוא שתהליך לא ניגש ישירות לזכרון, אלא דרך כתובת וירטואלית אשר ממופה על ידי המערכת לכתובת פיזית, המערכת מתרגמת את הכתובת הוירטואלית לפיזית ומעבירה את הגישה (כתיבה או קריאה) למקום המתאים בזכרון הפיזי. הטכניקה הזו מאפשרת למערכת לשלוט לאן התהליך יכול לגשת ולמנוע ממנו לדרוס חלקי זכרון חשובים. מבחינת התהליך הכל שקוף והוא אינו כתוב בצורה כזו שמניחה משהו על הזכרון.

תמיכה בזכרון וירטואלי כוללת את מערכת ההפעלה, מערכת הקבצים, יחידה לתרגום כתובות וירטואליות לפיזיות, פסיקות חומרה וזכרון המחשב. היחידת חומרה שאחראית על זכרון וירטואלי נקראת MMU.

זכרון וירטואלי מספק שלוש מטרות עקריות:

• מספק למתכנת זכרון רציף וארוך, מאפשר לו להתרכז בכתיבת התוכנה ללא התעסקות בניהול

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 המחקר בוצע בהנחייתו של פרופסור דן צפריר, בפקולטה למדעי המחשב.

תודות

אני רוצה להודות מנחה שלי, דן צפריר, על הנחייתו ועזרתו המסורה, לחברי לקבוצת המחקר נדב עמית ומולי בן־יהודה, להורי ולחברי.

אני מודה לטכניון על התמיכה הכספית הנדיבה בהשתלמותי.

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015 תיכנון מחדש של יחידת ניהול הזכרון בעבור קלט / פלט

חיבור על מחקר

לשם מילוי חלקי של הדרישות לקבלת התואר מגיסטר למדעים במדעי המחשב

משה מלכה

הוגש לסנט הטכניון – מכון טכנולוגי לישראל כ''ח אדר התשע''ה חיפה מרץ 2015

משה מלכה

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-10 - 2015