The Real Difference Between Emulation and Paravirtualization of High-Throughput I/O Devices

Arthur Kiyanovski

Technion - Computer Science Department - M.Sc. Thesis MSC-2017-19 - 2017 Technion - Computer Science Department - M.Sc. Thesis MSC-2017-19 - 2017 The Real Difference Between Emulation and Paravirtualization of High-Throughput I/O Devices

Research Thesis

Submitted in partial fulfillment of the requirements for the degree of Master of Science in Computer Science

Arthur Kiyanovski

Submitted to the Senate of the Technion — Israel Institute of Technology Av 5777 Haifa August 2017

Technion - Computer Science Department - M.Sc. Thesis MSC-2017-19 - 2017 Technion - Computer Science Department - M.Sc. Thesis MSC-2017-19 - 2017 This research was carried out under the supervision of Prof. Dan Tsafrir, in the Faculty of Computer Science.

Acknowledgements

I would like to dedicate this thesis to my late grandfather, Ben-Zion Kiyanovski, who passed away while I was doing the research for this thesis. My grandfather fought courageously against the Nazis in World War II. Without people like him, none of us were here today.

I would like to thank my dear wife Assya, for her infinite support, without it I wouldn’t have been able to finish this research. I would like to thank my advisor, Prof. Dan Tsafrir, for his help and guidance along the way.

The research leading to the results presented in this paper was partially supported by the Israel Science Foundation (grant No. 605/12).

The generous financial help of the Technion is gratefully acknowledged.

Technion - Computer Science Department - M.Sc. Thesis MSC-2017-19 - 2017 Technion - Computer Science Department - M.Sc. Thesis MSC-2017-19 - 2017 Contents

List of Figures

Abstract 1

Abbreviations and Notations 3

1 Introduction 5

2 Background 9 2.1 TCP Essentials ...... 9 2.1.1 TCP Checksum Offloading ...... 9 2.1.2 TCP Segmentation Offloading ...... 9 2.1.3 TCP Congestion Control ...... 10 2.1.4 TCP SRTT ...... 11 2.2 Linux Network Stack Implementation Essentials ...... 13 2.2.1 The Socket Buffer ...... 13 2.2.2 NAPI ...... 13 2.3 QEMU Essentials ...... 13 2.3.1 Main Threads of QEMU ...... 14 2.3.2 The global mutex ...... 14 2.4 The Intel Pro/1000 PCI/PCI-X NICs (Bare Metal E1000) ...... 14 2.4.1 Control Registers ...... 15 2.4.2 Main Actions During Normal Operation of the Bare Metal E1000 16 2.5 The QEMU Emulated Intel Pro/1000 PCI/PCI-X NIC (E1000) . . . . . 17 2.5.1 Interrupt Coalescing ...... 18 2.6 The QEMU Virtio-Net Paravirtual NIC ...... 19 2.6.1 Interrupt and Kick Supression ...... 19 2.6.2 TX Interrupts ...... 19 2.7 Virtio-Net TCP Send Sequences in Throughput Workloads ...... 20 2.7.1 Virtio-Net Dual Core Send Sequence ...... 20 2.7.2 Virtio-Net Single Core TCP Throughput Send Sequence . . . . . 21

Technion - Computer Science Department - M.Sc. Thesis MSC-2017-19 - 2017 3 Motivation 27 3.1 Interposition ...... 27 3.2 Emulated I/O Devices ...... 28 3.3 Paravirtual I/O Devices ...... 28 3.4 Emulated vs Paravirtual Devices ...... 28 3.4.1 Guest Modification ...... 28 3.4.2 Performance ...... 29 3.5 Emulated vs Paravirtual NICs in Different ...... 30 3.6 Emulation vs Paravirtualization Comparison Model ...... 30

4 Experimental Setup 35 4.1 Hardware Setup ...... 35 4.2 Benchmarks ...... 35 4.2.1 Single Core Throughput Benchmark ...... 35 4.2.2 Dual Core Throughput Benchmark ...... 36

5 Single Core Configuration 37 5.1 Baseline Comparison ...... 38 5.2 Removal of TCP Checksum Calculation ...... 38 5.3 Removal of TCP Segmentation ...... 39 5.4 Improved Interrupt Coalescing ...... 40 5.4.1 ITR and TADV Conflict ...... 41 5.4.2 Static Set Interrupt Rate ...... 42 5.4.3 Interrupt Rate Considering ITR ...... 43 5.4.4 Evaluation ...... 44 5.5 Send from the I/O Thread ...... 44 5.5.1 Interrupt Coalescing in Virtio-Net ...... 47 5.6 Exposing PCI-X to Avoid Bounce Buffers ...... 47 5.7 Dropping Packets to Improve TSO Batching in Linux Guests ...... 48 5.8 Vectorized Send ...... 53 5.9 SRTT Calculation Algorithm Bug in Linux ...... 54 5.9.1 SRTT Calculation in the ...... 55 5.9.2 Bug Description ...... 56 5.9.3 Effects of the Bug ...... 56 5.9.4 Bug Fix ...... 57 5.10 Final Throughput Comparison ...... 58 5.11 Improvements Summary ...... 60

6 Initial Work on a Dual Core Configuration 65 6.1 Baseline Comparison ...... 65 6.2 Scalability of the Emulated E1000 ...... 66 6.3 Sidecore ...... 68

Technion - Computer Science Department - M.Sc. Thesis MSC-2017-19 - 2017 6.3.1 Partial Sidecore Implementation ...... 69

7 Related Work 75

8 Future work 77 8.1 Challenges on the Way to Full Sidecore Emulation of E1000 ...... 78 8.1.1 ICR ...... 78 8.1.2 IMC and IMS ...... 79

9 Conclusion 81

Technion - Computer Science Department - M.Sc. Thesis MSC-2017-19 - 2017 Technion - Computer Science Department - M.Sc. Thesis MSC-2017-19 - 2017 List of Figures

2.1 CWND values over time, for two TCP connections with the same source and destination, one starting transmission at t=0, the other at t=100[sec], and both using the Cubic congestion avoidance algorithm ...... 12

2.2 Baseline Eh register exits ...... 17

2.3 Eh emulation of ICR reading. Some implementation details have been removed ...... 18

2.4 Vh dual core setup, single batch send sequence ...... 22

2.5 Vh single core setup, single batch send sequence ...... 24

2.6 Baseline Vh exits ...... 25

3.1 Google search results illustrating the problems with vmware tools . . . . 29

3.2 Throughput comparison of Eb emulation vs a paravirtual NIC in different hypervisors. In QEMU/KVM and Virtual Box the paravirtual device is virtio-net, and in Vmware Workstation it is vmxnet3 ...... 31

5.1 Throughput comparison between baseline Vh and baseline Eh ...... 39 5.2 Change in throughput caused by removing the calculation of TCP check-

sum in Eh ...... 40 5.3 Change in throughput caused by removing the TCP segmentation code

in Eh ...... 41 5.4 Throughput difference achieved when using the 2 types of interrupt coalescing heuristics described in subsections 5.4.2 (static), 5.4.3 (ITR sensitive) ...... 44 5.5 Change in throughput caused by using the improved static interrupt

coalescing setting in Eh ...... 45 5.6 Change in throughput caused by moving the sending of packets from the

VCPU thread to the I/O thread in Eh ...... 46

5.7 Change in throughput caused by enabling PCI-X mode in Eh ...... 49

5.8 Vh compared to Eh with improvements 1-5 ...... 50

5.9 Throughput and CWND values over time, with Eh, running netperf TCP STREAM with default 16KB message size. At around 25 seconds packet dropping is enabled...... 52

Technion - Computer Science Department - M.Sc. Thesis MSC-2017-19 - 2017 5.10 Change in throughput caused by adding the packet dropping for better

TSO batching in Eh ...... 53 5.11 Change in throughput caused by adding the packet dropping for better

TSO batching in Vh ...... 54

5.12 Change in throughput caused by using vectorized sending in Eh . . . . . 55 5.13 The routine in Linux kernel 3.13 that calculates the new SRTT given the RTT of the currently ACKed packet and the previous SRTT (irrelevant code omitted) ...... 56 5.14 The code of tcp rtt estimator() after fixing the bug in SRTT calculation (irrelevant code omitted) ...... 57 5.15 tp->srtt values as they react to RTT values over time, in both the original implementation of tcp rtt estimator() on the left and the fixed version on the right ...... 58

5.16 Difference in throughput of Eh with all improvements but packet dropping, with the SRTT bug and after fixing it ...... 58

5.17 Difference in throughput of Vh with the SRTT bug and after fixing it . . 59

5.18 Throughput comparison of the best versions of Vh and Eh ...... 59

5.19 Throughput increase achieved by adding our improvements to Eh . . . . 62

5.20 Throughput increase caused by each of the improvements added to Eh out of the highest achieved throughput ...... 63

6.1 Throughput comparison between baseline Vh and baseline Eh when running our dual core basic throughput benchmark ...... 66

6.2 Throughput difference between the best version of Vh when run on a single core vs when run on 2 cores ...... 67

6.3 Throughput difference between the best version of Eh when run on a single core vs when run on 2 cores ...... 68

6.4 Throughput difference between the best single core version of Eh and the

best partial sidecore-emulated version of Eh ...... 71 6.5 Throughput difference between the partial sidecore-emulated best version

of Eh and the best version of Vh when running our dual core basic throughput benchmark ...... 72 6.6 Throughput difference between the partial best sidecore-emulated version

of Eh and the best version of Vh when running our dual core basic throughput benchmark without packet dropping ...... 73

8.1 State machine of the interrupt raising algorithm for sidecore emulating IMC and IMS ...... 80

Technion - Computer Science Department - M.Sc. Thesis MSC-2017-19 - 2017 Abstract

Emulation of high-throughput Input/Output (I/O) devices for virtual machines (VM) is appealing because an emulated I/O device works out of the box without the need to install a new device driver in the VM when moving the VM from one to another. The problem is that fully emulating a hardware device can be costly due to multiple exits. Installations therefore often prefer to use paravirtual I/O devices, which reduce the number of exits by making VMs aware that they are being virtualized at the cost of the need to install a new device driver when moving from one hypervisor to another. Previous studies report that paravirtual I/O devices provide 5.5–40x higher throughput as compared to emulated ones, leading to the perception that emulated I/O devices are significantly inferior to paravirtual ones, despite the appealing properties of emulation. We challenge this perception and show that the throughput difference between QEMU’s emulated e1000 and paravirtual virtio-net network devices is largely due to various implementation differences that are unrelated to virtualization. We resolve many of these differences and show that, consequently, the throughput difference between virtio-net and e1000 can be reduced from 20–77x to as little as 1.2–2.2x. We speculate that resolving the remaining differences will reduce this throughput difference further.

1

Technion - Computer Science Department - M.Sc. Thesis MSC-2017-19 - 2017 2

Technion - Computer Science Department - M.Sc. Thesis MSC-2017-19 - 2017 Abbreviations and Notations

I/O : Input Output OS : DMA : Direct Memory Access KVM : Kernel QEMU : Quick Emulator NIC : Network Interface Controller TX : Transmit. Usually refers to the transmit ring buffer of a NIC RX : Receive. Usually refers to the receive ring buffer of a NIC MTU : Maximum Transmission Unit, in this work 1500 IP : Internet Protocol TCP : Transport Control Protocol RTT : Round Trip Time SRTT : Smooth Round Trip Time RTO : Retransmission Timeout CWND : TCP congestion window ACK : A TCP segment sent to acknowledge the reception of a TCP segment

Eb : Intel PRO/1000 PCI/PCI-X NIC family

Ed : Linux kernel e1000 device driver in the context of a physical machine

Eh : QEMU emulation of the Eb

Eg : Linux kernel e1000 device driver in the context of a guest virtual machine

Vh : QEMU implementation of the virtio-net paravirtual NIC

Vg : virtio-net guest driver

3

Technion - Computer Science Department - M.Sc. Thesis MSC-2017-19 - 2017 4

Technion - Computer Science Department - M.Sc. Thesis MSC-2017-19 - 2017 Chapter 1

Introduction

Machine virtualization has been increasing in popularity in recent years as more and more computing is done on the cloud. Virtual machines use virtual I/O devices to perform their I/O operations. The most commonly used virtual I/O devices today are paravirtual I/O devices [KMN+16, Vmw09, Rus08, BDF+03], which are specifically designed for virtual environments and do not use an interface of any physical device. Paravirtual I/O devices combine good performance with interposition, which is necessary for many different applications, such as live migration, I/O device consolidation and aggregation, among others. However paravirtual devices have drawbacks for both the user and the hypervisor provider. The user is required to install compatible device drivers when moving from one hypervisor to another, since nowadays different hypervisors support different paravirtual devices. Moreover the hypervisor provider must implement the required device driver for all operating systems (OS). The above drawbacks do not exist when using emulated I/O devices, which are virtual I/O devices that implement the same interface as some existing physical I/O device. Like paravirtual I/O devices, emulated I/O devices also provide the benefits of interposition. However, since emulated I/O devices implement the specifications of known physical devices, there is no need to install new device drivers when moving from one hypervisor to another, as the device driver for the physical device, already installed in the guest, will also work with the emulated I/O device. Furthermore, in the case of emulated I/O devices, it is unnecessary for the hypervisor provider to implement the device driver for each OS, since the driver installed in the guest OS, for the physical device, will also work with its emulated counterpart. Despite these benefits, emulated I/O devices are rarely used in real life scenarios, due to the common conception that their performance is substantially lower than that of paravirtual I/O devices [Mol07, BNT17, RLM13, KPS+09, VMS+16, ESP+09]. This difference in performance is mainly attributed to the significantly larger number of virtualization exits caused by emulated devices [MAC+08, PS12, KMN+16]. To check whether this is indeed the case for throughput workloads, we present in Chapter 3 a model that tries to assess the maximum possible throughput of the e1000

5

Technion - Computer Science Department - M.Sc. Thesis MSC-2017-19 - 2017 emulated NIC when compared to the virtio-net paravirtual NIC, assuming exits are the only difference between the two NICs. We use NICs because they are the most I/O intensive of all I/O devices, and we wanted to see the most extreme effects possible. Our model predicted 1.13x better throughput for virtio-net, whereas the initial measured throughput was 20x better. This indicates that virtualization exits are not the dominant reason for this difference. We went on to research the differences between the e1000 and virtio-net devices, in an attempt to find the factors, other than exits, that contribute to the throughput gap. For our research we chose a setup of guest-to-host communication assigning a single core to the guest machine. This setup does not include a physical NIC in the I/O path and all network I/O is passed via the host kernel. We chose this simple setup to minimize interference and focus on the real differences between the virtual devices. In Chapter 5 we present the differences, unrelated to virtualization, which negatively affect the throughput of e1000. For each difference we present an improvement to e1000 that reduces this difference as much as possible. Our proposed improvements significantly increased the performance of e1000, reducing virtio-net’s advantage from 20x better throughput to as little as 1.2x for large message sizes. This shows that e1000 can achieve throughput much closer to that of virtio-net than suggested in the literature. These results indicate that the superiority of paravirtualization over emulation in high throughput scenarios is actually a misconception, and that the reason for the large throughput gap is not the abundance of exits in emulated devices, but rather the incorrect implementation of I/O processing in emulated devices. Then, in Chapter 6, we expand our setup by assigning two cores to QEMU to run the guest machine, which enabled us to use one of the cores as a sidecore [KRS07]. The sidecore paradigm was used in several previous works [ABYTS11, HGL+13, KMN+16]. However, to the best of our knowledge, ours is the first attempt to implement a full- fledged emulated I/O device with a sidecore. While we used our sidecore to reduce only some of the exit inducing accesses to the e1000 device, it proved very beneficial for throughput scenarios. When running our dual core benchmarks, we were able to reduce the advantage of virtio-net from 25x better throughput to 1.25x for large message sizes. These results once more show that e1000 can achieve throughput much closer to that of virtio-net than initially suggested by previous works. In Chapter 8 we describe the performance aspects of emulated vs paravirtual devices that we did not cover in this work, discuss the challenges in using a sidecore to reduce more exit inducing accesses to the e1000 device, and suggest directions for future work. We conclude in Chapter 9. In this work we will frequently refer to the virtio-net and e1000 virtual NICs, as well as to their device drivers. For simplicity, we will henceforth employ the abbreviations listed in Table 1.1. The content of this table also appears in the Abbreviations and Notations list; they are repeated here to avoid confusion as they are not standard.

6

Technion - Computer Science Department - M.Sc. Thesis MSC-2017-19 - 2017 Abbreviation Meaning Eb Intel PRO/1000 PCI/PCI-X NIC family Ed Linux kernel e1000 device driver in the context of a physical machine Eh QEMU emulation of Eb Eg Linux kernel e1000 device driver in the context of a guest virtual machine Vh QEMU implementation of the virtio-net paravirtual NIC Vg virtio-net guest driver

Table 1.1: Abbreviations for e1000 and virtio-net virtual NICs and device drivers

7

Technion - Computer Science Department - M.Sc. Thesis MSC-2017-19 - 2017 8

Technion - Computer Science Department - M.Sc. Thesis MSC-2017-19 - 2017 Chapter 2

Background

In this chapter we present background on the different aspects of virtual network communications: the TCP protocol, the Linux kernel network stack implementation, and the operation of the e1000 emulated and the virtio-net paravirtual devices.

2.1 TCP Essentials

The Transport Control Protocol, or TCP, is one of the most prevalent networking protocols on the internet. We will not describe this complicated protocol in detail but only those parts of it essential to understanding our work.

2.1.1 TCP Checksum Offloading

To ensure reliable data transfer by TCP, each TCP segment contains a checksum field, which is roughly a 1’s complement sum of all words in the TCP segment. Back in the days when network traffic was relatively scarce, the TCP stack calculated the TCP checksum for every segment as part of the segment creation process. Calculating the TCP checksum for large TCP segments is CPU intensive, and is an undesired overhead when high network performance is required. TCP Checksum Offloading, or TCO, is the capability of a NIC to calculate the TCP checksum for the packets it processes. TCO improves latency, since dedicated NIC hardware performs checksum calculation quicker than a general purpose CPU, and also adds parallelism, since while the NIC is calculating the checksum, the CPU can continue doing other useful work.

2.1.2 TCP Segmentation Offloading

The maximum size of an IP packet is 64KB, but in order for such a packet to be sent on the Ethernet wire, it needs to be segmented into smaller MTU (typically 1500 Bytes) size packets. The trivial way of achieving this is by never creating packets larger than MTU in the kernel. This solution hurts throughput oriented workloads, since the packet

9

Technion - Computer Science Department - M.Sc. Thesis MSC-2017-19 - 2017 creation process is comprised of numerous stages (adding a TCP header, adding an IP header, etc.), and the smaller the packet, the larger the relative overhead per byte sent. To improve throughput, TCP segmentation offload, or TSO, was invented. TSO is the capability of a NIC to segment packets larger than MTU into smaller MTU size packets, before sending them on the wire. With TSO, packets created by the network stack can be larger than MTU and thus throughput oriented workloads suffer less from the packet creation overhead.

2.1.3 TCP Congestion Control

TCP is a protocol that provides reliability. Reliability means that all the data that was sent from the source will eventually arrive to the destination. For example, if TCP detects that a packet was lost, it will retransmit the packet to ensure reliability. But retransmitting a packet might not be enough, as the reason for the loss of the packet might be high congestion on the path of the TCP connection. High congestion means that somewhere along the path from source to destination there is a network node (e.g. a router) that receives more networking traffic than it can handle, which forces this node to drop some of the packets arriving to it. If TCP would only retransmit a lost packet, future packets would probably be dropped again due to the same congestion. A better solution would be not only to retransmit the packet, but also to reduce the rate at which packets are sent, which will reduce the congestion, and thus reduce the chance of packets being dropped in the future. When to reduce the transmission rate due to congestion and when to increase it again is determined by the congestion avoidance algorithm, which is a part of TCP. There are many different congestion avoidance algorithms but the general idea is very similar. Before we continue, we need to define a few terms we will be using. In a TCP connection, ”in flight” packets are packets that were already sent but for which no ACK packet has yet been received. It is important to note that a packet is considered ”in flight”, in the Linux kernel implementation of TCP, the moment it is added to the TX ring. TCP holds a variable called CWND, which is short for congestion window. CWND holds the number of MTU sized packets that can be in flight for the current TCP connection. So, for example, if CWND=10, TCP will allow a transmission of data equivalent to 10 MTU sized packets, before stopping to wait for an ACK for the first packet, at which time it will be able to send another packet. The TCP connection has different states, but we will discuss only two of them to explain the general idea: Slow Start and Congestion Avoidance. Each connection starts with the slow start state, where it remains for as long as no congestion is detected by TCP. In slow start, CWND is increased by one each time an ACK is received for a packet, ensuring that the transmission rate will increase for as long as possible. Once congestion is detected, usually when no ACK for a sent packet has been received for a predefined amount of time, the TCP connection switches to the congestion avoidance

10

Technion - Computer Science Department - M.Sc. Thesis MSC-2017-19 - 2017 state. Upon switching to the congestion avoidance state, CWND is reduced, to reduce the transmission rate, and thus to reduce the congestion caused by the current TCP connection. To ensure the reliability of the connection, the lost packet is retransmitted. Different congestion avoidance algorithms behave differently at this point. The Tahoe algorithm, for example, reduces the CWND to 1 and switches back to slow start. The Reno algorithm, on the other hand, halves CWND (CWND = CWND/2), skips the slow start state and enters a third state, called Fast Recovery, which we will not cover here. For the purposes of this thesis, it is important only that CWND grows as long as no congestion is detected. Once congestion is detected, CWND is reduced, and when CWND increases again, it does so from its new value. We conclude this explanation of the congestion avoidance algorithm by showing how CWND changes across time in a real life example. Figure 2.1, taken from Figure 2 in [LSM08], shows the values of CWND for 2 different connections between the same source and destination machines, when running the Cubic congestion avoidance algorithm, which is also the default algorithm used in Linux 3.13, the version used in all machines in this thesis. At time 0, a single TCP connection starts transmitting. Initially, CWND increases, until it reaches the maximum possible rate of one of the nodes on the transmission path, at which time it experiences congestion in the form of packet loss by this node. This packet loss causes a drop in CWND from 8500 to 7000. CWND then begins to increase again, until another packet is lost and CWND drops once more. We observe that this pattern is fairly stable, with a CWND median point of around 7800. This is what a stable single TCP connection looks like when there are no changes in the network it flows through. This pattern is often called the TCP Sawtooth, since it resembles the teeth of a saw. After 100 seconds, another TCP connection initiates transmission to the same destination as the first one. Naturally, the second connection adds congestion to the line. This congestion initially causes more packet losses to the first connection, since its CWND is initially larger, and therefore the chances of a packet loss are greater for the first connection. While the CWND of the first connection is decreasing due to congestion, the CWND of the second connection increases until the path is saturated. As time passes, both connections stabilize around the same CWND value, around half of the value that the first connection stabilized at when it was the only connection on the path. This also demonstrates the fairness property of TCP. Fairness in this context means that after enough time is allowed for stabilization, all TCP connections get the same share of network bandwidth.

2.1.4 TCP SRTT

TCP Retransmission Timeout, or RTO, is the amount of time TCP waits for the ACK of a transmitted segment. If the ACK isn’t received within RTO time from the sending of the packet, it is considered lost and needs to be retransmitted [PACM11]. The value

11

Technion - Computer Science Department - M.Sc. Thesis MSC-2017-19 - 2017 Figure 2.1: CWND values over time, for two TCP connections with the same source and destination, one starting transmission at t=0, the other at t=100[sec], and both using the Cubic congestion avoidance algorithm

of RTO depends on the round trip time, or RTT, of a packet, which is the time it takes for a packet to travel to the destination combined with the time it takes for the ACK of this packet to travel back to the source. The dependancy of RTO on RTT is reasonable, since we would expect a packet sent to a nearby machine on the same local network to arrive within a few microseconds, with an RTO that is also in the order of microseconds, whereas a packet sent to a destination on the other side of the world might take seconds to arrive, with an RTO in the order of seconds.

At times there might be singular spikes in the RTT of a single packet due to interference or congestion on the line. To mitigate those spikes and have a stable estimate of the current value of RTT, a Smooth Round Trip Time or SRTT is maintained. SRTT is updated each time an ACK is received for a sent packet using the following formula:

SRTT = (1 − α) ∗ SRTT + α ∗ RTT (2.1)

RTT in equation 2.1 is the RTT for the last ACKed packet, and the value of α depends on the system. In our Linux version 3.13, α equals 1/8. In general, the smaller the α, the slower SRTT responds to changes in RTT.

12

Technion - Computer Science Department - M.Sc. Thesis MSC-2017-19 - 2017 2.2 Linux Network Stack Implementation Essentials

2.2.1 The Socket Buffer

NICs use a ring buffer called TX for the queuing of packets for the NIC to send. However, TX is not the only queue used for the sending of packets. The socket implementation in the Linux kernel holds an additional send queue, the socket buffer. The socket buffer is used when the TX ring cannot be used, e.g., because it is already full or because TCP doesn’t allow more packets to be sent before ACKs for previous packets are received (as explained in section 2.1.3). If packets can be added to TX, whenever a new packet is created it is immediately added to TX to be sent. However, once packets can no longer be added to TX, they accumulate in the socket buffer. The socket buffer also contributes to the aggregation of sent messages to large packets in the TCP stack. If the last packet currently in the socket buffer is smaller than maximum allowable size of an IP packet (64KB), then a new message sent via this socket will be added at the end of this last packet.

2.2.2 NAPI

New API or NAPI is an interface for interrupt mitigation in the Linux kernel. The goal of NAPI is to increase the throughput of a network device by reducing the number of receive interrupts raised by the device. Algorithms 2.1 and 2.2 show how NAPI achieves interrupt mitigation in 2 stages. Algorithm 2.1 runs as part of the top half of the receive interrupt handler of the network driver. It turns the interrupts off, which stops them from flooding the device driver in the event of a large stream of received packets. Algorithm 2.2, which runs as part of the bottom half of the interrupt handler, processes the packets in RX, up to a budget. If there are more packets than the budget, which means that the packet load in RX is high, interrupts are not enabled. Instead, the bottom half is scheduled again to continue polling RX. If, on the other hand, there were less than the budgeted number of packets in RX, which means the load on RX is low, there is no reason to waste CPU cycles on polling RX, and so interrupts are enabled again.

Algorithm 2.1 NAPI top half algorithm 1: turn interrupts off 2: schedule NAPI bottom half

2.3 QEMU Essentials

In this section we present implementation details of the QEMU hypervisor, which we refer to many times in this thesis.

13

Technion - Computer Science Department - M.Sc. Thesis MSC-2017-19 - 2017 Algorithm 2.2 NAPI bottom half algorithm 1: while received packets < budget AND there are packets in RX do 2: receive next packet 3: end while 4: if received packets == budget then 5: reschedule NAPI bottom half 6: else 7: turn interrupts on 8: end if

2.3.1 Main Threads of QEMU

QEMU has several different types of threads, 2 of which are relevant to this work: (1) A VCPU thread for each virtual CPU allocated for the guest machine. The guest machine runs in the context of these threads, and whenever an exit from the guest to the host occurs, its initial handling is also done in the context of the VCPU thread. (2) A single I/O thread per guest machine, which handles all asynchronous events via an event loop. An asynchronous event may be a received network packet from the outside world that needs handling, or an event scheduled by a VCPU thread, such as sending a network packet to the outside world.

2.3.2 The qemu global mutex

The qemu global mutex is a mutex used to synchronize accesses to the global resources of QEMU by the different QEMU threads. The qemu global mutex is very similar to the historical Linux Big Kernel Lock [LH02]. In the context of this thesis, we are interested in how the qemu global mutex is used by the I/O and VCPU threads of QEMU. The VCPU thread holds the mutex while running in the context of QEMU (as a result of an exit), and releases it right before reentering the guest context via the call to KVM’s ioctl. The I/O thread takes the mutex when it wakes up from polling its file descriptors, holds the mutex while handling the file descriptor events (such as sending and receiving of packets), and releases the mutex right before going to sleep upon polling its file descriptors again.

2.4 The Intel Pro/1000 PCI/PCI-X NICs (Bare Metal E1000)

Eb is a family of Gigabit Ethernet NICs. It uses two descriptor ring buffers to perform I/O: the TX ring is used for packet transmission, and the RX ring for packet reception. Each descriptor in the TX and RX rings points to a single buffer, which can be as large

as 16KB according to the specifications of Eb, but the buffers are actually set to be 4KB

in size by Ed. This means that a single packet can take more than a single descriptor on the ring.

14

Technion - Computer Science Department - M.Sc. Thesis MSC-2017-19 - 2017 Eb also has control registers, which can be used by software to manage it. These registers are exposed as a single contiguous array. The array is logically divided into 4KB pages, with each page containing all registers pertaining to a certain functionality. For example the registers in pages 2 and 3 control the reception and transmission of

packets respectively. These registers are accessed by Ed via Memory Mapped I/O or MMIO. In the following subsections we describe the control registers that are of most

interest to us in this thesis, as well as the main actions taken by both Eb and Ed during

normal I/O processing by Eb.

2.4.1 Control Registers

The following list contains the descriptions of the control registers accessed during

normal I/O processing by Eb.

STATUS Device Status. A bitmask of status bits. For example, bit 1 in the STATUS register is set iff link is up.

ICR Interrupt Cause Read. Once an interrupt is raised, ICR will hold all the reasons

for the interrupt. For example, bit 1 is set iff the TX ring is empty. The Eb specifications [Int09] define a read/clear behavior for this register, meaning that when ICR is read by software it is atomically set to 0.

ITR Interrupt Throttling. The minimum interval between two consecutive interrupts in 256ns increments

TADV Transmit Absolute Interrupt Delay Value. The maximum interval between two consecutive transmit-completed (TXDW) interrupts in 1024 microsecond increments.

IMS Interrupt Mask Set. A bitmask of allowed interrupts, in the same order as in the

ICR register. For example, Eb will raise an interrupt when the TX ring is empty only if IMS bit 1 is set.

IMC Interrupt Mask Clear. A bitmask of interrupts to be cleared in IMS. For example, if the device driver wants to turn off interrupts when the TX ring is empty, it should write the number 0x2 into IMC, which corresponds to setting only bit 1 in

IMC, which will cause Eb to clear bit 1 in IMS.

RDH Receive Descriptor Head. Points to the place on the RX ring where the next received packet will be placed by the NIC.

RDT Receive Descriptor Tail. Points to the place on the RX ring where the device driver will find the next received packet to process.

TDH Transmit Descriptor Head. Points to the place on the TX ring where the next packet to be sent by the NIC resides.

15

Technion - Computer Science Department - M.Sc. Thesis MSC-2017-19 - 2017 TDT Transmit Descriptor Tail. Points to the place on the TX ring where the device driver should put the next packet it wants to send.

2.4.2 Main Actions During Normal Operation of the Bare Metal E1000

During a normal operation by Eb there are 4 main actions that involve accessing Eb’s

control registers by Ed: sending a packet, receiving a packet, handling the top half of the interrupt handler, and handling the NAPI bottom half of the interrupt handler. We

shall now describe how each of the above actions is performed by both Ed and Eb, while emphasizing the accessed control registers. For easier reading of this section, whenever

we refer to one of the control registers of Eb, we use the name of the register in capital letters without the word register after it; e.g., ”the TDT register” will be written as ”TDT”.

Sending a Packet

The Linux kernel network stack sends a packet by calling the e1000 xmit frame() function. e1000 xmit frame() puts the packet on the TX ring starting from the descriptor pointed by TDT, and advances TDT by the amount of places the new packet takes in the TX ring. A single packet might take more than one descriptor on the ring if it is larger than the maximum size of a single buffer that can be pointed by the descriptor, which

is 4KB in the current Ed. The update of TDT serves as a notification to Eb, that there is a new packet on the TX ring that needs to be sent. As a response to the change

in TDT, Eb sends the packet and sets the 0 bit in ICR to indicate that the sending

is complete. Finally, Eb raises an interrupt if bit 0 in IMS is set, when the interrupt timers (ITR,TADV) indicate that an interrupt can be raised.

Receiving of a Packet

When a new packet arrives on the Ethernet line, Eb puts the packet in the RX ring, starting from the descriptor pointed to by RDH, and advances RDH by the number of

places the new packet takes in the RX ring. Eb then sets bit 7 in ICR to indicate that

a packet was received. Finally, Eb raises an interrupt if bit 7 in IMS is set, when the interrupt timers (ITR,TADV) indicate that an interrupt can be raised.

Top Half of Interrupt Handling

When an interrupt is raised by Eb, the top half of the interrupt handler (e1000 intr()) in

Ed is called. The top half first reads ICR to get all the interrupt causes that accumulated

in it. It then signals Eb to stop raising all interrupts by setting all bits of IMC. This write to IMC is then flushed by reading the STATUS register. Finally, the top half schedules the NAPI bottom half to run later.

16

Technion - Computer Science Department - M.Sc. Thesis MSC-2017-19 - 2017 NAPI Bottom Half of Interrupt Handling

The NAPI bottom half (e1000 clean()) works in 5 stages as follows: 1. It clears all the

descriptors on the TX ring whose packets had already been sent by Eb. 2. It goes over the RX ring from RDT to RDH and pushes the packets pointed to by these descriptors up the network stack, while rearming the processed descriptors with newly allocated buffers. 3. It advances RDT over the already processed descriptors. 4. It updates the

value of ITR according to the workload properties observed by Ed. 5. It enables all interrupts by setting the bits in IMS back on. This write to IMS is then flushed by reading the STATUS register. Since this is a NAPI bottom half, stages 2 and 3 are done in chunks, and after each chunk is finished, if there are still packets on the RX ring that were not handled, the bottom half is rescheduled again, without proceeding to the next stages. Stages 4 and 5 run only when there are no more packets left in the RX ring.

2.5 The QEMU Emulated Intel Pro/1000 PCI/PCI-X NIC (E1000)

Eh uses the trap-and-emulate [Gol74] paradigm to emulate Eb. As already explained

in section 2.4, Ed controls Eb by accessing its control registers. Therefore the natural

way to emulate Eb is to trap each access to these control registers by Eg, and emulate the necessary behavior according to the accessed register. To accomplish this, when

Eg asks to map the control registers to the memory, Eh leaves this memory unmapped.

From that point on, each access of Eg to the control registers causes an exit, which is

then handled by Eh. Figure 2.2 shows exits that occur during a transmission of a single packet and reception of a single ACK, when running the netperf TCP STREAM benchmark on a single core, as described in our single core benchmark in Section 4.2.

packet send + receive emulation interrupt top (3 exits) interrupt bottom (4 exits) HOST

GUEST TDT ICR IMC STATUS RDT ITR IMS STATUS single packet preparation

Figure 2.2: Baseline Eh register exits

Figure 2.2 is a good graphic representation of the actions involving accesses to

control registers by Eg as described in subsection 2.4.2. On the left side of the figure we see the exit caused by access to TDT. During the emulation of this exit, the packet added to the TX ring is sent, and if there are packets available for reception, they are

put on the RX ring. After handling the TX and RX rings, Eh raises an interrupt to

signal Eg that the rings were handled. At this point the top half of the interrupt handler

17

Technion - Computer Science Department - M.Sc. Thesis MSC-2017-19 - 2017 static u i n t 3 2 t m a c i c r read(E1000State ∗s , int index ) { u i n t 3 2 t r e t = s−>mac reg [ ICR ] ; s−>mac reg[ICR] = 0 ; return r e t ; }

Figure 2.3: Eh emulation of ICR reading. Some implementation details have been removed

in Eg begins. As already explained in subsection 2.4.2, 3 registers are accessed in the top half: ICR, IMC and STATUS. In the bottom half, as described in 2.4.2, we see 4 exits due to accesses to RDT, ITR, IMS and STATUS. This shows that in our setup,

when baseline Eh is used to send a stream of TCP packets, each packet sent causes an overhead of 8 control register related exits. Figure 2.3 is presented to give the reader a better sense of what it means to emulate a register access. The figure shows the code of the mac icr read() method that is called

by Eh, whenever an exit was caused by reading ICR in Eg. mac icr read() simply clears the value of ICR and returns the previous one. This creates the necessary illusion for the guest that reading ICR clears it atomically. Other implementation details of mac icr read() were removed for clarity.

All control register exits of Eh are handled in the context of the VCPU thread of

QEMU. This also includes the actual sending of the packets by Eh to their destination. The I/O thread of QEMU is used only for the receiving of packets.

2.5.1 Interrupt Coalescing

Algorithm 2.3 shows the interrupt coalescing mechanism of Eh. It is called whenever

Eh decides an interrupt is needed, e.g. when a packet is received. In this algorithm IMS, TADV and ITR registers are used, as well as a software timer called mit timer. This algorithm is also called when mit timer expires. In this algorithm, the mit timer timer is used to defer interrupts according to the values of the ITR and TADV registers set by the guest. The guest also disables interrupts by setting the IMC register (which clears the IMS register) when the NAPI bottom half is receiving packets, which further defers interrupts.

Algorithm 2.3 Eh interrupt coalescing algorithm 1: if IMS == 0 OR mit timer running then 2: don’t inject interrupt 3: else 4: start mit timer according to the value in ITR and TADV 5: inject interrupt 6: end if

18

Technion - Computer Science Department - M.Sc. Thesis MSC-2017-19 - 2017 2.6 The QEMU Virtio-Net Paravirtual NIC

Like Eh, Vh also uses the trap-and-emulate paradigm, and uses two descriptor ring buffers (called vring virtqueues), TX and RX, for transmitting and receiving packets

respectively. Unlike with Eh, each descriptor in the rings of Vh points to a list of all the buffers of a single packet, and the maximum buffer size is 64KB.

The numerous registers of Eh are replaced by a single notification port in Vh. The

act of writing to this notification port via Port I/O is called a ”kick”. Whenever Vg

adds a packet to the TX ring, it kicks the TX vring, which causes an exit to Vh, and Vh sends the packets in tx.

2.6.1 Interrupt and Kick Supression

Similarly to Eb’s IMC register, Vh also supports suppression of interrupts. For this purpose the virtqueue has a variable called used event, which holds an index of a descriptor in the virtqueue. Only when the descriptor pointed to by used event is

consumed by Vh will an interrupt will be raised. Vg sets the used event variable in its NAPI bottom half to point to the first available place on the RX ring. This way, the first received packet initiates an interrupt, but all subsequent packets that are received while the NAPI bottom half is still scheduled do not initiate an interrupt.

Vh also allows suppression of kick exits. For this purpose the virtqueue has a variable called avail event, which holds an index of a descriptor in the virtqueue. When a new packet is added to the virtqueue, if avail event points to the place in the virtqueue where the packet was added, the kick will write to the notification port of the virtqueue, causing an exit to QEMU. Otherwise, the kick will skip the exit-inducing write to the

notification port. The packet sending routine in Vh (start xmit()) first sends all available packets on the TX ring, and then sets avail event to point to the first available place on

the TX ring. Thus, in a burst of packets sent by Vg, the kick following the first packet in the burst causes an exit, but the following packets do not, as they are added before start xmit() sends all packets on the TX ring. This means that effectively there is a single kick exit for every burst of sent packets.

2.6.2 TX Interrupts

Vh is designed to reduce the number of TX interrupts to an absolute minimum without hurting performance. It does so by simply removing the already processed packets from the TX queue in the guest driver send function and not in the interrupt handler. This

means that Vg does not wait for the next interrupt to clear the TX ring, so the chances of the TX ring ever becoming full are greatly reduced, and there is no need for a TX

interrupt to clear the TX ring. This is in contrast to Eh, where the ring is cleared in the interrupt handler, and not in the send function, thus requiring a TX interrupt.

19

Technion - Computer Science Department - M.Sc. Thesis MSC-2017-19 - 2017 2.7 Virtio-Net TCP Send Sequences in Throughput Work- loads

We present here a detailed description of how I/O is processed by virtio-net during a TCP send throughput workload. We first describe the send sequence when running our dual core benchmark from section 4.2, which is easier to understand, and then we describe the send sequence when running our single core benchmark from section 4.2, which is less intuitive and builds on the explanation of the dual core send. We will be referencing the single core send sequence in Chapter 5 to compare the send sequence of

Eh to that of Vh when relevant.

2.7.1 Virtio-Net Dual Core Send Sequence

Vh was originally designed to work like a physical NIC, where I/O is processed by the

device asynchronously. This is achieved by running the I/O processing of Vh on the I/O thread of QEMU, enabling the device to run in parallel to the VCPU threads of the guest. We now present an example that illustrates some key implementation

optimizations used in Vh to improve throughput. In this example we will be running the dual core benchmark from section 4.2.

We wish to highlight three features achieved by the implementation of Vh as seen in the following description of Figure 2.4. First, we can see that packets are sent and received in batches. This batching is due to the nature of TCP which sends packets in batches to avoid congestion and packet loss. The sent batches cause ACK batches in the receive side. Second, there is only a single kick exit per batch, which reduces the

exits caused by Vh. Third, there is only a single interrupt per batch, which is enabled

by NAPI and reduces greatly the number of interrupts caused by Vh. As explained in subsection 2.1.3, in any given moment, the TCP protocol allows a certain number of packets to be sent before the ACKs for these packets arrive. Let us assume that at this moment we can send N packets. This translates in the code to being able to add only N packets to TX during a send burst. Figure 2.4 shows a graphic representation of sending a stream of TCP packets when

running the above dual core benchmark. In (a), Vg starts adding packets to be sent to TX, and the first packet added causes a kick exit because used event points to the first empty place on TX. During the handling of the kick exit, the I/O thread is signaled

that there are packets to send, so it starts sending them. (b) shows that Vg continues adding more packets to TX, in parallel to the I/O thread sending them. After adding N packets to TX, the network stack stops adding further packets to TX and waits for ACKs to arrive for the previous packets sent. When the I/O thread finishes sending all of the packets on TX it moves the used event to the first free place on TX, as seen in (c). Moving used event effectively enables the kick exit, which will be necessary to wake up the I/O thread to start sending the next batch. At some point the ACKs for the

20

Technion - Computer Science Department - M.Sc. Thesis MSC-2017-19 - 2017 sent packets arrive. (d) shows what happens with the first ACK. The I/O thread wakes up, puts the ACK on RX, and since avail event is pointing to the first empty place in

RX, injects an interrupt to the guest. The interrupt handler in Vg schedules the NAPI receive function, which passes the ACK to the TCP stack. Since this is an ACK, the in-flight packets are decreased, enabling a new packet to be added to TX. The TCP stack immediately inserts the next packet (t1) to be sent to TX. Adding t1 causes a kick exit since used event points to the first free place in TX. Since the I/O thread is currently receiving packets, t1 will not be sent until there are no more packets to receive. (e) shows that the I/O thread continues adding packets r1..rN to RX without causing an interrupt since avail event continues pointing at the place where r1 was added. The NAPI bottom half on the VCPU thread continues reading the packets from RX and adding new packets to TX. (f) shows that once the NAPI finishes clearing the RX ring , it moves the avail event to point to the next free position in RX. Meanwhile the I/O thread, which no longer has packets to receive, goes on to send all the packets in TX.

2.7.2 Virtio-Net Single Core TCP Throughput Send Sequence

In subsection 2.7.1 we attributed the high throughput Vh achieves when running our dual core benchmark to 3 features: packet batching both in send and receive, reduced kick exits and reduced interrupts via NAPI. We now wish to explore the send sequence

of Vh, when using our single core benchmark, and show how it is still able to achieve the above 3 features. Figure 2.5 illustrates the send sequence for a single core. As in the dual core send sequence, the received ACK packets initiate the sending of the next batch of packets. Therefore, we will begin from the point in time right before the first ACK in a batch is received. At this point both TX and RX are empty and the VCPU thread is halted. From the description of (e) in the figure, it will be easy to see why this is the case, but for now we take these initial conditions as a given.

In Figure 2.5 (a), packet r1 has been received, and Vh, in the context of the I/O thread, puts r1 at the head of RX. Since avail event is pointing to the current location

in RX, Vh injects an interrupt to the guest without releasing the qemu global mutex. The interrupt wakes up the halted VCPU thread, which now has a higher priority than the I/O thread, and therefore context is switched to the VCPU thread. (b) In response to the interrupt, the guest runs the NAPI bottom half, which reads r1 and pushes it up the TCP stack. The TCP stack sees that it is an ACK, which lowers the number of in-flight packets, and allows the next packet t1 to be sent. Packet t1 is added to the TX tail, as part of the processing of r1. Since used event is pointing to the TX tail, the kick that comes after adding t1 causes an exit to QEMU. The VCPU thread blocks upon trying to acquire qemu global mutex, since it is being held by the I/O thread, and does not inform the I/O thread there are new packets on TX. Context is

switched back to the I/O thread in (c), where Vh continues adding packets r2,..,rN to

21

Technion - Computer Science Department - M.Sc. Thesis MSC-2017-19 - 2017 Guest Kick QEMU Guest QEMU TX TX used_event t1 Send used_event t2 Send tN

RX RX avail_event avail_event

(a) (b)

Guest QEMU QEMU Guest Kick TX TX

used_event t1

used_event

Interrupt RX RX avail_event avail_event r1

Receive

(c) (d) Guest QEMU Guest QEMU TX TX

used_event t1 used_event t1 t2 t2 Send

tN tN

RX RX avail_event r2 Receive rN avail_event

(e) (f)

Figure 2.4: Vh dual core setup, single batch send sequence

22

Technion - Computer Science Department - M.Sc. Thesis MSC-2017-19 - 2017 RX, but since avail event is still pointing to the location of r1 in RX, interrupts are not injected to the guest during this process. (d) After placing packets r2,..,rN on RX, the I/O thread considers its work to be done, since it is still unaware of t1 being on TX. It thus releases the qemu global mutex, which wakes up the VCPU thread. Context is switched to the VCPU thread, which acquires the qemu global mutex, and now sends a signal to the I/O thread telling it to wake up and handle the packets on TX, releases the qemu global mutex, and enters the guest at the point of the last exit, right after the kick from (b). We continue in the guest, in the context of the VCPU, where NAPI resumes reading packets from rx and pushing them up the TCP stack, and as with r1, each ACK packet received, results in the the TCP stack taking a packet from the socket buffer and placing it on TX. However, this time the kick does not cause an exit since used event is pointing to the position of t1 on TX. After consuming r1..rN, the NAPI bottom half has no more packets to read from RX. Therefore it moves avail event to the first free place on RX, effectively enabling interrupts. (e) Since packets were removed from the socket buffer during the receiving of r1,..,rN, netperf is now awakened from blocking, and continues calling sendto() in a loop, until it is blocked again when there is no more space in the socket buffer. At this point, since netperf is the only process in the guest, the guest halts, which causes a HALT exit. Control is switched to QEMU. Since VCPU is halted, QEMU runs the I/O thread, which sends t1,..,tN. After sending t1,..,tN, used event is moved to the next free space on TX, effectively enabling the kick exit. (f) is the final stage at which we start the next batch all over again. At this stage, the VCPU is halted and TX and RX are empty as in the initial state right before (a).

Even on a single core, Vh achieves the 3 benefits described in subsection 2.7.1. However, we now wish to emphasize two steps in the above sequence crucial to the batching of sent packets and without which the other two benefits could not be achieved: (1) In stage (a) of Figure 2.5 the qemu global mutex is not released prior to injecting an interrupt to the guest. If it was, packets would not be sent in batches because the mutex would be unlocked when the kick exit occurred, which will enable the VCPU to lock the mutex and send a signal to wake the I/O thread in order to handle the packet on TX. Context would immediately switch from the VCPU to the I/O thread for sending this packet, and the kick exit would immediately be enabled again after sending the packet. (2) In stage (d) of Figure 2.5, when the kick exit is finally handled on the VCPU, context is not switched to the I/O thread for sending t1 due to the scheduling priorities of the VCPU and I/O threads. The I/O thread has lower priority since it was busy adding packets to RX, while the VCPU thread was blocked on the mutex. Of course it is still possible to increase the priority of the I/O thread (or decrease that of the VCPU thread), which will cause control to switch to the I/O thread and again cancel batching. Figure 2.6 summarizes the single core send sequence exits. It illustrates the much

lower number of exits Vh causes when compared to Eh. As we saw in Figure 2.2, Eh

23

Technion - Computer Science Department - M.Sc. Thesis MSC-2017-19 - 2017 Guest QEMU Guest Kick QEMU TX TX used_event used_event t1

Interrupt RX RX avail_event r1 avail_event

(a) (b)

Guest QEMU Guest QEMU TX TX used_event t1 used_event t1 t2

tN

RX RX avail_event r2

rN avail_event

(c) (d)

Guest QEMU Guest QEMU TX TX t1 t2 Send

tN used_event used_event

RX RX

avail_event avail_event

(e) (f)

Figure 2.5: Vh single core setup, single batch send sequence

24

Technion - Computer Science Department - M.Sc. Thesis MSC-2017-19 - 2017 causes 8 exits per packet, and here we see that Vh causes 2 exits per batch of packets.

When running the single core benchmark with 64K message sizes, Vh achieves batches

of 48 packets each. Therefore, we get 2/48=0.04 exits per packet in Vh, meaning that

Eh causes 192x more exits per packet than Vh in this case.

many packets received at once (batching) many packets sent at once (batching) Kick exit HOST Halt exit GUEST many packets prepared for sending at once (batching)

Figure 2.6: Baseline Vh exits

25

Technion - Computer Science Department - M.Sc. Thesis MSC-2017-19 - 2017 26

Technion - Computer Science Department - M.Sc. Thesis MSC-2017-19 - 2017 Chapter 3

Motivation

In this chapter we present the main motivation for our research. We start by showing a comparison between emulated and paravirtual I/O devices. Then, we show that paravirtual devices achieve substantially higher throughput than emulated ones, by both quoting throughput measurements from previous works and by performing our own measurements using NICs in different hypervisors. We use NICs because they are the most I/O intensive I/O devices, and we wanted to observe the most extreme effects possible. We then examine the conception that the large number of virtualization exits in emulated devices is the reason for the large throughput gap between the two device types. We develop a model to calculate the theoretical maximum throughput

that can be achieved with Eh assuming the only difference between Eh and Vh is the number of exits the two cause. Our model indicates that, contrary to the common conception, exits alone should be a small factor in the throughput difference between

Vh and Eh. This serves as our motivation to find the factors, besides exits, that hurt

the throughput of Eh, to reduce these factors as much as possible, and to see the real

throughput difference between Vh and Eh.

3.1 Interposition

Traditional I/O virtualization is implemented using the trap-and-emulate [Gol74, PG74] paradigm, in which the hypervisor exposes a virtual device to the guest, traps all requests of the guest to the virtual device, and emulates these requests using its physical I/O devices. When using the trap-and-emulate method, the virtual I/O of the guest is decoupled from the physical I/O performed by the host, allowing the host to interpose on the I/O activity of the guest. There are many benefits to interposition [RW11]. It allows the encapsulation of the state of a guest machine. The machine can be stopped, its state representation written to a file, and this file can be copied to a different host, where it can be later resumed without the guest ever knowing it was stopped. Live migration [CFH+05, NLH05] even allows a guest to be moved from one host to another without stopping its execution.

27

Technion - Computer Science Department - M.Sc. Thesis MSC-2017-19 - 2017 Consolidation of physical I/O devices is possible by multiplexing multiple guest I/O devices over a single physical I/O device in the host. Multiple physical I/O devices can also be aggregated to service a single virtual I/O device. This improves the performance of the device and allows masking of failures, since when one of the physical devices malfunctions, others can take over. I/O interposition can also add capabilities to the I/O device that are not present in physical devices, such as replication of disk writes to multiple disks to allow recovery from disk failures, I/O compression, I/O encryption, I/O metering and filtering, among others. Two types of virtual I/O devices implement interposition: emulated and paravirtual, which we will present in the following sections.

3.2 Emulated I/O Devices

An emulated I/O device is a virtual I/O device implemented using the trap-and-emulate paradigm, imitating a known physical device. The hypervisor exposes the same interface as the device it emulates to the guest, and therefore the guest interacts with the emulated device exactly as it would have interacted with the physical device being emulated.

3.3 Paravirtual I/O Devices

A paravirtual I/O device, like its emulated counterpart, is a virtual I/O device imple- mented using the trap-and-emulate paradigm. However, a paravirtual device does not imitate any physical device. Instead, paravirtual I/O devices implement a new kind of a device that is designed specifically for virtual environments and reduces the number of exits necessary to drive it.

3.4 Emulated vs Paravirtual Devices

Emulated and paravirtual devices differ in 2 aspects: guest modification and performance.

3.4.1 Guest Modification

When using an emulated device, no guest modification is necessary, since the guest uses the same device driver it would have used if it was connected directly to the physical device being emulated. However, in the case of a paravirtual device, it is necessary to install the compatible device driver on the guest machine. This has implications for both the user and the virtual device developers. The user needs to install a new driver in the guest each time it is moved to a cloud provider that uses a different hypervisor. The current reality is such that different hypervisor vendors use different paravirtual devices, e.g., vmware’s vmxnet3 [Vmw09], KVM/QEMU’s virtio-net [Rus08]

28

Technion - Computer Science Department - M.Sc. Thesis MSC-2017-19 - 2017 and ’s PV [BDF+03]. Furthermore, installation of paravirtual device drivers often causes problems to the user. Figure 3.1 shows that searching Google for “problems with vmware tools” (a suite of paravirtual drivers for Vmware’s hypervisors) yields 2.5 million results. This is to illustrate that indeed users are experiencing difficulties with the installation of paravirtual drivers. As for the paravirtual device developers, they also need to develop and maintain a device driver for many different OSes. This is not the case with emulated devices, since the drivers written for the physical device by its vendors would also work with the emulated device.

Figure 3.1: Google search results illustrating the problems with vmware tools

3.4.2 Performance

Paravirtual I/O devices are known to achieve higher performance than emulated ones. Table 3.1 summarizes the throughput difference between emulated and paravirtual network devices with their respective citation references. The table shows that the performance can be 5.5x-40x higher for the paravirtual devices. This difference in performance is attributed in the literature [MAC+08, PS12, KMN+16] to the large number of exits emulated devices cause, while paravirtual devices are designed specifically to minimize exits. This superior performance of paravirtual devices makes them the popular choice in most of today’s real world virtualization applications [KMN+16].

29

Technion - Computer Science Department - M.Sc. Thesis MSC-2017-19 - 2017 Source Emu. Tput Para. Tput Para./Emu. Molnar [Mol07] 7.41 303.35 40x Bugnion et al. [BNT17] 239 5230 22x Rizzo et al. [RLM13] 250 3250 13x Koh et al. [KPS+09] 100 550 5.5x Vrijders et al. [VMS+16] 300 10,000 33x Eiraku et al. [ESP+09] 200 1800 9x

Table 3.1: Emulated vs paravirtual network device throughput in Mbps in the literature

3.5 Emulated vs Paravirtual NICs in Different Hypervi- sors

To further examine the throughput difference between emulated and paravirtual I/O devices, we performed our own measurements using different hypervisors. We ran netperf TCP STREAM test from guest to host, as described in our single core benchmark

in section 4.2. Figure 3.2 compares the throughput of Eb emulation to that of the paravirtual NIC for 3 different hypervisors: QEMU/KVM, Virtual Box and Vmware Workstation, when using our baseline single core, guest-to-host setup. The paravirtual

NIC in the first two hypervisors is Vh, while in Vmware Workstation it is vmxnet3,

since a Vh implementation is unavailable. As predicted in the literature, the results for

QEMU/KVM and VirtualBox show that the paravirtual Vh significantly outperforms

the emulated Eh. We were unable to explain why Eh outperforms vmxnet3 in the Vmware Workstation, but we speculate that this is due to some problem with the installation of the paravirtual drivers of vmxnet3.

3.6 Emulation vs Paravirtualization Comparison Model

The results from the literature presented in Table 3.1 and our own measurements in Figure 3.2 show that paravirtual network devices indeed achieve throughput much higher than that of emulated network devices. However, the explanation that exits are the reason for this large throughput gap seemed unrealistic to us in high throughput scenarios, where we expect most of the I/O processing to be done in batches, with relatively few control register related exits. This discrepancy between the throughput achieved and our intuition regarding the overhead of exits led us to ask ourselves, “what is the real difference between emulation and paravirtualization?” To answer this question, we devised a model for comparison between emulation and paravirtualization that starts with the assumption that the difference is indeed the much larger number of exits in emulation, as stated in the literature. Our model describes the throughput achieved by NICs as a special case of all I/O devices. In our model we assume that sending packets requires work that should be done regardless of and unrelated to virtualization. This work should require

30

Technion - Computer Science Department - M.Sc. Thesis MSC-2017-19 - 2017 QEMU/KVM Virtual Box Vmware Workstation 10 10 10 8 8 8 6 6 6 4 4 4 2 2 2 throughput [Gbps] 0 0 0 64 1K 4K 64 1K 4K 64 1K 4K 256 16K 64K 256 16K 64K 256 16K 64K

80 25 2

60 20 1.5 15 40 1 10 20 5 0.5

normalized throughput 0 0 0 64 1K 4K 64 1K 4K 64 1K 4K 256 16K 64K 256 16K 64K 256 16K 64K paravirt message size [bytes] e1000

Figure 3.2: Throughput comparison of Eb emulation vs a paravirtual NIC in different hypervisors. In QEMU/KVM and Virtual Box the paravirtual device is virtio-net, and in Vmware Workstation it is vmxnet3

more or less the same effort, in CPU cycles, in both emulated and paravirtual devices. Otherwise, if our goal is to answer our question, the comparison is invalid. Qualitatively speaking, if said work takes W cycles per packet, the overhead of a single exit in cycles

is E, there are N exits per packet sent, and the frequency of the CPU is CPUfreq, then the throughput of a virtual NIC can be described by equation 3.1.

CPU CPU throughput = freq ∗ bytes per packet = freq ∗ bytes per packet cycles per packet W + N ∗ E (3.1)

Since we set CPUfreq and bytes per packet as constant for both e1000 and virtio-net, we can look at the cycles per packet in our comparison.

Equations 3.2 and 3.3 below describe the above cycles per packet for virtio-net and

e1000 denoted by Cv and Ce respectively. We mark the N and E variables with v and e subscripts for virtio-net and e1000 respectively. W is denoted without a subscript since under our assumption, it is the same for both NICs.

Cv = W + Nv ∗ Ev (3.2)

31

Technion - Computer Science Department - M.Sc. Thesis MSC-2017-19 - 2017 Ce = W + Ne ∗ Ee (3.3)

We subtract equation 3.2 from 3.3 to get equation 3.4:

Ce = Cv − Nv ∗ Ev + Ne ∗ Ee (3.4)

This equation is very straightforward. To calculate the time it takes to send a packet using e1000, we subtract the exit overhead of virtio-net from the time it takes to send a packet in virtio-net and add the exit overhead of e1000. We now express the variables on the right-hand side of equation 3.4 using the knowledge we acquired about the exits in e1000 and virtio-net by exploring their implementations in QEMU, as described in Sections 2.4, 2.5 and 2.6.

Nv can be expressed using equation 3.5. In this equation, Bv is the number of packets (b for batch size) sent by virtio-net for a single kick exit.

1 Nv = (3.5) Bv

Ne can be expressed using equation 3.6. In this equation, the 1 is the exit due to an access to the TDT register, which occurs for every packet added to the TX ring. The second part of the right-hand side describes the number of exits in the handling of an interrupt. There are 7 exits on average per interrupt in e1000, as described in section

2.5. We multiply the total number of exits per interrupt by Iv, the number of interrupts per second in virtio-net, to get the total number of interrupt related exits per second.

We allow ourselves to use Iv, since the number of interrupts per second is part of W,

which under our assumption is the same for both NICs. Then we divide by Pe, which is the number of packets per second sent by e1000, to get the per-packet number of exits that occur during interrupt handling.

Iv ∗ 7 Ne = 1 + (3.6) Pe

Pe can be expressed using equation 3.7. We divide the CPU frequency by the number of cycles it takes to send a single packet by e1000 to get how many packets are sent every second.

CPUfreq Pe = (3.7) Ce Finally, after combining equations 3.4, 3.5, 3.6 and 3.7, we get equation 3.8, which is our final model of the number of cycles it takes for a single packet to be sent by e1000.

C − Ev + E v Bv e Ce = (3.8) 1 − Iv∗7∗Ee CPUfreq To get the values for the variables in this equation, we run our single core throughput

32

Technion - Computer Science Department - M.Sc. Thesis MSC-2017-19 - 2017 benchmark described in section 4.2 and measure the necessary parameters. We set the message size in netperf to 65160 bytes to make the packet size constant. To explain why this message size leads to a constant packet size, we must explain how the Linux network stack handles large packets. The maximum IP packet size is 64KB, but the Linux network stack, for packets larger than MTU, sends only packets which hold an amount of data that can be divided into MTU sized packets, to enable easier segmentation to MTU sized packets for transmission on physical media. An MTU sized packet in our setup can hold 1448 bytes of data, which, combined with the 52 header bytes of TCP/IP, add up to the 1500 bytes of an MTU. 65160 bytes is the largest message size that can be held by an IP packet and is divisible by 1448. When using this message size, the network stack in the guest creates exactly a single packet from every message added, so this way we effectively control the packet size created by the network stack to be constant for both virtio-net and e1000. Whereas most measurements necessary for the evaluation of equation 3.8 are collected in a straightforward manner, e.g. by looking at packet and interrupt counts provided by standard tools like ifconfig and procfs, measuring the exit overhead in cycles is a bit more complicated. We define the overhead of an exit to be the time it takes for control to pass from the exit-causing instruction in the guest, through KVM, to QEMU, until the exit handling function in QEMU is called. To this we also add the time it takes from the moment the exit handling function in QEMU finishes execution to the time control returns to the guest to resume execution. To measure the overhead of an exit we use the RDTSC cycle counter instruction before and after the access to the relevant register in e1000’s guest driver, and the kick port in virtio-net’s guest driver. The time spent between these two points includes both the overhead of the exit, and also the time spent in QEMU to handle it. Since we are looking for the exit overhead, we need to remove the time spent handling the exit in QEMU. For this purpose we put two more cycle counters before and after the functions in QEMU that actually perform the necessary emulation, which in e1000 are e1000 mmio read() and e1000 mmio write(), and in virtio-net is virtio net handle tx bh(). Using these cycle counters we measured the total overhead of exiting from the guest to QEMU (via KVM), and then returning back from QEMU to the guest, excluding the actual handling of the exit in QEMU. Table 3.2 shows the results of our measurements. Inserting these values into equation 3.8, we get that, according to our model, sending a packet in e1000 should take 152,564 cycles, which means that the achievable throughput of e1000 should be 8,200 Mbps. In reality, e1000 achieves 535 Mbps, which is 15x slower than optimally expected. This result strengthened our suspicion that exits are not the only cause of the throughput gap between emulated and paravirtual I/O devices. This was our motivation to research the differences between e1000 and virtio-net to find what really causes the throughput difference between them, as an indication for the real difference between emulation and paravirtualization.

33

Technion - Computer Science Department - M.Sc. Thesis MSC-2017-19 - 2017 Cv 135,882 cycles Ev 28,761 cycles Bv 46.13 packets/kick-exit Iv 382.7 inerrupts/second Ee 14,788 cycles CPUfreq 2.4 GHz

Table 3.2: Measurements for calculating e1000 cycles per packet

34

Technion - Computer Science Department - M.Sc. Thesis MSC-2017-19 - 2017 Chapter 4

Experimental Setup

4.1 Hardware Setup

Our hypervisor is QEMU 2.2.0. Host and guest machines in our experiments run Ubuntu 14.04 with Linux kernel 3.13. As our representative virtual NICs we chose the emulated Intel PRO/1000 PCI/PCI-X NIC, which is widely emulated by popular hypervisors, and virtio-net, which is the paravirtual NIC in QEMU/KVM. We use the user-space implemented virtio-net device, and not the more performant kernel-space implemented vhost-net [Tsi09] device, since the current emulated Intel PRO/1000 PCI/PCI-X NIC is implemented in the user space and we want to have a fair comparison. The host is a Dell PowerEdge R610 with a single CPU socket, running a 4 core Intel Xeon E5620 processor, and has 16GB of RAM.

4.2 Benchmarks

Throughout this thesis, throughput results for Eh and Vh are achieved by using two types of benchmarks:

4.2.1 Single Core Throughput Benchmark

In this benchmark, the guest machine connects to the host machine via a TAP backend, which is the networking backend recommended by the QEMU documentation [QEM17]. We run the netperf TCP STREAM micro-benchmark, which continuously sends data from the guest to the netserver process in the host using the sendto system call. All threads of QEMU are pinned to a single core while the netserver process is pinned to a different core on the same socket. We run the netperf test 10 times for 30 seconds and report the average results. There is no actual NIC involved in our setup, and the networking is done between the guest and the host via the host kernel. We chose this setup because it is the simplest to analyze and find inherent differences between emulated and paravirtual devices.

35

Technion - Computer Science Department - M.Sc. Thesis MSC-2017-19 - 2017 4.2.2 Dual Core Throughput Benchmark

This benchmark is similar to the single core benchmark, except that instead of pinning the QEMU threads to a single core, we pin the VCPU thread to one core and the I/O thread to another. Using this test we explore some of the synchronization and

parallelism differences between Eh and Vh.

36

Technion - Computer Science Department - M.Sc. Thesis MSC-2017-19 - 2017 Chapter 5

Single Core Configuration

In this chapter we present our efforts to find the real difference between Eh and Vh as representatives of emulation and paravirtualization on a single core setup.

Section 5.1 shows an initial comparison of baseline Eh vs baseline Vh. Then sections

5.2 - 5.8 present a series of differences we found between Vh and Eh, unrelated to

virtualization, that hurt the throughput of Eh. For each such difference we describe an

improvement to Eh that attempts to reduce the throughput gap as much as possible. Then, for each such improvement, we include an evaluation of the impact of this

improvement on the throughput of Eh. Each evaluation of an improvement shows a figure with 8 graphs, comparing 2 setups

of Eh. The red line represents Eh with all previously presented improvements, not

including the current improvement discussed, and the green line represents Eh with all previous improvements including the current improvement. This way we see the benefit of adding the current improvement, and also the gradual growth of the throughput with each added improvement. We present the improvements in the order they were discovered during our research. In each such evaluation figure we present 4 metrics: 1. Throughput - the throughput as reported by the netperf benchmark; 2. Interrupts - interrupts per second injected by the virtual NIC to the guest; 3. Packet size - the size of the IP packet created by the network stack in the guest; 4. Packets/batch - the average number of packets sent by the virtual NIC in a single call to its send function. Each of the above metrics is represented by 2 graphs, regular and normalized. The regular shows the absolute values of the metric presented, and the normalized shows the

values normalized to the version of Eh without the current improvement (the red line). The normalized graph shows the impact of the improvement relatively to the previous setup.

Table 5.3 summarizes the list of improvements we made to Eh in the order presented in this chapter for easy reference, as we will be using the improvement numbers in the evaluation figures. In section 5.9 we present a bug we found in the Linux kernel’s TCP implementation

37

Technion - Computer Science Department - M.Sc. Thesis MSC-2017-19 - 2017 # improvement name 1 removal of TCP checksum calculation 2 removal of TCP segmentation 3 improvement of interrupt coalescing 4 sending from the I/O thread 5 exposing PCI-X mode to avoid bounce buffers 6 dropping packets to improve TSO in guest 7 using vectorized sending

Table 5.1: List of improvements made to Eh to improve its throughput

and our proposed fix to this bug. Because this bug greatly affects the throughput of Eh,

all the results in this thesis, except for those specifically stating baseline Eh or baseline

Vh, were obtained after this bug was fixed in the guest machine.

Finally, section 5.10 we show a final comparison of Eh with all of our improvements

when compared to Vh, and discuss the results.

5.1 Baseline Comparison

Figure 5.1 compares the baseline throughput of Eh vs. Vh as they are implemented in our version of QEMU. This figure will serve as our initial point of comparison. In this figure we can see not only the large throughput gap, already presented in Figure 3.2,

but also a few hints as to the reasons for it: 1. Except for message sizes of 512-4K, Vh

injects significantly fewer interrupts than Eh; 2. The packets created by the network

stack in the guest are much larger for Vh than for Eh; 3. Vh sends many packets in a

single send sequence in QEMU, while Eh sends only a single packet per send sequence. We will be addressing these issues in the next sections.

5.2 Removal of TCP Checksum Calculation

Eb devices [Int09] support TCO. Eh emulates TCO by calculating the TCP checksum in software before sending the packet to its destination. To show why this is an unnecessary overhead for most cases, let us consider the two possible destination types for a packet. If the packet destination is within the current host, e.g., it is sent to another guest on the same host or to the host itself, there is no need to calculate the TCP checksum since the packet never travels through a lossy medium. If, on the other hand, the destination is outside of the host, the packet will travel through one of the bare metal NICs installed in the host machine. If the bare metal NIC supports TCO (as is usually the case nowadays), it will be able to calculate the TCP checksum in hardware, much more efficiently than can be done by QEMU.

Vh does not calculate the TCP checksum when not necessary and uses virtio-net- headers to avoid the dropping of packets due to incorrect checksum. These headers are

38

Technion - Computer Science Department - M.Sc. Thesis MSC-2017-19 - 2017 throughput [Gbps] interrupts [K/sec] packet size [10KB] packets/batch 9 8 6 160 8 7 5 140 7 6 120 6 5 4 100 5 4 3 80 4 regular 3 3 2 60 2 2 40 1 1 1 20 0 0 0 0 64 1K 4K 64 1K 4K 64 1K 4K 64 1K 4K 256 16K 64K 256 16K 64K 256 16K 64K 256 16K64K

80 2 18 160 70 15 140 60 120 50 12 100 40 1 9 80 30 6 60

normalized 20 40 10 3 20 0 0 0 0 64 1K 4K 64 1K 4K 64 1K 4K 64 1K 4K 256 16K 64K 256 16K 64K 256 16K 64K 256 16K64K baseline virtio-net message size [KB] baseline e1000

Figure 5.1: Throughput comparison between baseline Vh and baseline Eh

flags that are added to the beginning of each packet and tell the host network stack whether it needs to verify the TCP checksum before passing the packet on. They can be implemented by any virtual network device, as already done for the e1000e and vmxnet3 devices in QEMU. However, we did not implement virtio-net-headers in e1000, and instead added a single line patch to the tap device in the host kernel telling it to always ignore TCP checksums. We did this as a proof of concept that indeed removing checksum calculation is beneficial, but the full virtio-net-headers implementation is necessary as a full solution. Figure 5.2 shows the improvement in throughput gained by removing the unnecessary

TCP checksum calculation from Eh when compared to the baseline Eh. As expected, the larger the packet size, the more data is checksummed, and the more time is saved per packet when removing the checksum calculation. This reduced packet processing time translates to higher throughput but also to a higher interrupt rate with larger packets: an interrupt is raised for every sent packet, and since sending a packet takes less time, this happens more frequently.

5.3 Removal of TCP Segmentation

Eb devices support TSO. Eh emulates TSO by segmenting large packets into MTU size ones in software. This is unnecessary in most cases. If the destination of the packet is inside the host, the Linux kernel networking infrastructure (tap, bridge) can handle

39

Technion - Computer Science Department - M.Sc. Thesis MSC-2017-19 - 2017 throughput [Gbps] interrupts [K/sec] packet size [10KB] packets/batch 0.8 7 4 2 6 0.6 5 3 4 0.4 2 1 3 regular 0.2 2 1 1 0 0 0 0 64 1K 4K 64 1K 4K 64 1K 4K 64 1K 4K 256 16K64K 256 16K 64K 256 16K 64K 256 16K 64K

2 2 2 2

1 1 1 1 normalized

0 0 0 0 64 1K 4K 64 1K 4K 64 1K 4K 64 1K 4K 256 16K 64K 256 16K 64K 256 16K 64K 256 16K 64K no TCO message size [KB] baseline e1000 Figure 5.2: Change in throughput caused by removing the calculation of TCP checksum in Eh

packets larger than MTU, eliminating the need for segmentation. And if the destination is outside the host, the packet will travel through one of the bare metal NICs installed in the host machine, which probably supports TSO and will be able to segment the packets much more efficiently than QEMU.

Vh does not segment the packets into MTU-size ones when this can be done by the

NIC. We did the same for Eh by removing the code that segments the large packets. Figure 5.3 shows the improvement in throughput gained by removing the unnecessary

TCP segmentation code from Eh when compared to the throughput measured after the previous improvement. Just as with the previous improvement, the larger the message size, the larger the packets created by the guest network stack, leading to more segmentation and thus to more time saved per packet when segmentation is eliminated. This reduced packet processing time translates to higher throughput but also to a higher interrupt rate with larger packets: an interrupt is raised for every sent packet, and since sending a packet takes less time, this happens more frequently.

5.4 Improved Interrupt Coalescing

Interrupt coalescing in Eh is implemented using a timer. When this timer expires, an interrupt is injected into the guest. The interval of the timer is set according to the

interrupt coalescing registers of Eb. We observed that the interrupt rate used in the

40

Technion - Computer Science Department - M.Sc. Thesis MSC-2017-19 - 2017 throughput [Gbps] interrupts [K/sec] packet size [10KB] packets/batch 1.5 7 4 2 6 3 1 5 4 2 1 3 regular 0.5 2 1 1 0 0 0 0 64 1K 4K 64 1K 4K 64 1K 4K 64 1K 4K 256 16K64K 256 16K 64K 256 16K 64K 256 16K 64K

3 3 2 2

2 2 1 1 1 1 normalized

0 0 0 0 64 1K 4K 64 1K 4K 64 1K 4K 64 1K 4K 256 16K 64K 256 16K 64K 256 16K 64K 256 16K 64K no TSO message size [KB] previous(up to 1) Figure 5.3: Change in throughput caused by removing the TCP segmentation code in Eh

baseline implementation of Eh was too high, greatly decreasing throughput. We first describe some problems we found with the current implementation of interrupt coalescing

registers in Eh (subsection 5.4.1) and then present improved interrupt coalescing in Eh for throughput scenarios (subsections 5.4.2 and 5.4.3). This is followed by an evaluation of the improved interrupt coalescing. We leave the comparison of interrupt coalescing

in Eh to that in Vh to subsection 5.5.1.

5.4.1 ITR and TADV Conflict

There is an inherent conflict between the interrupt coalescing registers TADV and ITR

in the official specifications of Eb and the way they are used in Ed. According to the specifications, TADV “can be used to ENSURE that a transmit interrupt occurs at some predefined interval after a transmit is completed” and ITR is defined as the ”minimum

inter-interrupt interval”. While the TADV register is set by Ed to 32 once during the initialization of the NIC (which translates to approximately 32 microseconds), ITR is

constantly updated by Ed according to the type of load (throughput or latency) to values between 195 and 976 (which translate to 49 and 250 microseconds respectively). So in a

throughput scenario, Ed asks Eb to set the inter-interrupt interval to be both smaller than 32 microseconds and larger than 243 microseconds, which is a contradiction.

The implementors of Eh chose to use the highest interrupt rate set by either TADV or

41

Technion - Computer Science Department - M.Sc. Thesis MSC-2017-19 - 2017 ITR. This effectively disables ITR as it always instructs lower interrupt rates than TADV, and sets the interrupt rate of one interrupt every 32 microseconds for all workloads. We speculate that this was a cautionary decision to ensure correctness. However, the high interrupt rate hurts throughput workloads. We decided to change this decision and ignore the high interrupt rate set by TADV for throughput scenarios. Instead we chose to use ITR as our guide for interrupt rates in throughput scenarios. We support our decision not only by the obvious throughput benefits we gain but also by the implicit signs in the driver implementation showing that ITR is the correct register to use for

interrupt coalescing. The implementation of Eg changes the value of ITR every interrupt

raised to adhere to the current workload type as perceived by Eg. This manipulation of ITR is a clear indication that the driver developer intended to manage the interrupt rate using ITR. And since the driver developer is Intel, which is also the manufacturer

of Eb, it is reasonable to assume ITR is the correct register for interrupt coalescing manipulation.

For any reasonable implementation of a device driver for Eb, our decision should not affect correctness, and indeed we did not observe any errors since making this change. The change might harm performance only if the interrupt rate is too slow, which does not happen at least in the scenarios we tested, as we show in the following subsections.

5.4.2 Static Set Interrupt Rate

We wanted to get an indication of an interrupt rate that produces good throughput, ignoring the values in ITR and TADV. We used the netperf benchmark in our guest- to-host setup. We started from the initial inter-interrupt interval of 32µs set in TADV and gradually increased it in increments of 32µs, until the interval was too large, and we started seeing a drop in throughput. We got the best throughput where the timer interval was set to the equivalent of TADV=320 or ITR=1280, which is 10 times longer

than the interval set in baseline Eh. We will call this inter-interrupt interval TI, short for Throughput Interval. In this static setup, we use TI as our inter-interrupt interval for all scenarios, as it produces greatly improved throughput for all message sizes and it is also always greater than ITR, which conforms to our choice to use ITR as our minimum inter-interrupt interval indicator. We note 2 things about the above choice of an interrupt rate: (1) TI does not necessarily achieve best throughput for all scenarios and setups. Our objective was to show that the current interrupt rate is not configured well for throughput, and to select an interrupt rate setting that is clearly beneficial. (2) The search for the best interrupt

rate was done with all of our improvements to Eh described in this chapter, including those to be presented in the next sections, since each added improvement changes the optimal interrupt rate. While setting a single interrupt rate for all scenarios improves throughput across all message sizes, it also increases latency. For example, running netperf TCP RR using our

42

Technion - Computer Science Department - M.Sc. Thesis MSC-2017-19 - 2017 single core setup yields a result of around 357µs per request-response when using TI for

interrupt coalescing, as opposed to the 194µs achieved by baseline Eh, when the value in TADV is used as the inter-interrupt interval - an increase of 80% in latency. To address this issue a heuristic is needed to decide whether the current workload is considered a latency workload or a throughput workload and change the interrupt coalescing timer accordingly. In the next subsection we present our efforts in this direction.

5.4.3 Interrupt Rate Considering ITR

Since our static increase of interrupt rate increases latency, we needed to somehow

decide when Eh is handling a latency workload, and in these cases use the interrupt

rate of the baseline Eh. This doesn’t mean that the interrupt rate of baseline Eh is optimal for latency, but our work focuses on throughput, and our only concern at this stage is not to negatively affect the latency achieved without our changes.

Eg already looks at the current workload and decides whether it is latency or

throughput. When the workload is latency according to Eg, it sets ITR to 195, and

when the workload is throughput, Eg sets ITR to numbers higher than 195, up to a maximum of 976. In this heuristic we use the TADV value for the inter-interrupt

interval, as in the original implementation of Eh, when ITR=195 for latency, and TI when ITR>195 for throughput scenarios. This way latency workloads are not negatively affected by the decreased interrupt rate. Figure 5.4 shows the throughput difference achieved when adding both our static interrupt coalescing setting and the ITR-sensitive interrupt coalescing heuristic to all previous improvements. The static setting achieves throughput that is better across all message sizes, while the ITR-sensitive heuristic makes an incorrect decision that the workload is a latency workload for message sizes smaller than 4K. There are 2 problems with the ITR-sensitive heuristic: (1) The ITR values we presented are device-driver implementation dependent, so this solution will work only

for the current device driver in Linux guests. (2) Eg’s decision as to whether the current workload is a latency or throughput workload is not good enough. Clear throughput scenarios are interpreted as latency ones for message sizes smaller than 4K. The reason

for this might be a calibration of Eg to physical environments that is inappropriate for virtual ones, or simply a bad calibration altogether. To solve the above 2 problems, it is necessary to develop a new heuristic for deciding whether the current workload is throughput or latency. This heuristic should be

implemented in the code of Eh so that it will work with guests running all operating systems. We are currently working on the creation of such a heuristic, which will be published in future work. In this work we proceed with our static interrupt coalescing setting. We assume that when such a throughput/latency deciding heuristic is created, it will be able to recognize that our benchmarks are clear throughput workloads.

43

Technion - Computer Science Department - M.Sc. Thesis MSC-2017-19 - 2017 2 static 1.8 ITR sensitive previous (up to 2) 1.6

1.4

1.2

1

0.8

throughput [Gbps] 0.6

0.4

0.2

0 64 1K 2K 4K 8K 128 256 512 16K 32K 64K message size [bytes]

Figure 5.4: Throughput difference achieved when using the 2 types of interrupt coalescing heuristics described in subsections 5.4.2 (static), 5.4.3 (ITR sensitive)

5.4.4 Evaluation

Figure 5.5 shows the benefit in throughput gained by the static interrupt coalescing setting when adding it to the previous improvements (1-2). The interrupt rate shown in the figure is the same for all message sizes, since the interrupt coalescing timer is set to to an interrupt rate of around 3051 interrupts/sec (the equivalent of TADV=320). The actual interrupt rate observed is a bit lower, since the timers are checked at a fixed location in the code after sending and receiving packets. Therefore, the time that passes until the timer is checked slightly increases the interval between interrupts. Reduction in the number of interrupts also reduces the overhead of exits, which are abundant in

the interrupt handler of Eg. This reduced overhead of interrupts and exits translates into higher throughput.

5.5 Send from the I/O Thread

The basic send sequence of Eh works as follows: When Eg is given a packet to send by the network stack, it places the packet at the tail of TX ring pointed to by TDT, and advances TDT to the next free descriptor on the TX ring. Advancing TDT causes

an exit to QEMU, where Eh goes over the TX ring, and sends all the packets pointed to by descriptors between TDH and TDT. This sequence seems fairly similar to the

one in Vh. However, unlike in the case of Vh where Vh sends the packets on TX in the

context of the I/O thread, Eh sends the packets in the context of the VCPU thread,

44

Technion - Computer Science Department - M.Sc. Thesis MSC-2017-19 - 2017 throughput [Gbps] interrupts [K/sec] packet size [10KB] packets/batch 2 6 4 2 5 3 4 1 3 2 1

regular 2 1 1 0 0 0 0 64 1K 4K 64 1K 4K 64 1K 4K 64 1K 4K 256 16K 64K 256 16K 64K 256 16K 64K 256 16K 64K

25 1 8 2 7 20 6 15 5 4 1 10 3

normalized 5 2 1 0 0 0 0 64 1K 4K 64 1K 4K 64 1K 4K 64 1K 4K 256 16K 64K 256 16K 64K 256 16K 64K 256 16K 64K interrupt coalescing message size [KB] previous(up to 2) Figure 5.5: Change in throughput caused by using the improved static interrupt coalescing setting in Eh

while handling the TDT exit. This send sequence has two basic flaws. First, it doesn’t scale. Ideally we would like to be able to parallelize the send sequence by having one thread (the VCPU thread) add packets to the TX, and another thread send the packets to their destination. In fact this is exactly what happens in hardware NICs. The device driver writes new packets to the TX ring, and the NIC asynchronously sends them. But since the current implementation

of Eh does everything on the VCPU thread, such parallelization is impossible. The second flaw is that batching is not used when sending packets. Each packet added

by Eg is immediately sent by Eh. Batching would be better for throughput loads as it increases instruction cache locality, prefetching effectiveness, and branch prediction accuracy.

As explained in subsections 2.7.1 and 2.7.2, Vh solved both of the above problems by

sending the packets in the context of I/O thread of QEMU. This way, Vg adds packets to

the TX ring, in the context of the VCPU thread, and Vh sends them to their destination in the context of the I/O thread. This solution is scalable since each of the threads can run on its own core in parallel. Sending from the I/O thread also promotes batching. This was discussed in subsections 2.7.1 and 2.7.2, and will be illustrated in the figures of Chapter 6.

Using the same idea as in Vh, we moved the sending of packets in Eh to the I/O thread by using the QEMU bottom-half scheduling mechanism, which schedules work

45

Technion - Computer Science Department - M.Sc. Thesis MSC-2017-19 - 2017 on the I/O thread. Figure 5.6 shows the improvement in throughput gained by moving the sending of packets to the I/O thread, in addition to improvements 1-3. As mentioned, sending from the I/O thread promotes sending batches of packets each time the I/O thread runs, which can be seen in the packets/per batch graph of the figure. Figure 5.6 also shows 2 indirect results of increased batching that further contribute to the improvement in throughput: First, we can see an increase in the packet size sent by the guest. This increase occurs when packets can’t be added to the TX ring (e.g when it is full or CWND doesn’t allow more packets to be sent) and are buffered in the socket buffer. When packets accumulate in the socket buffer they also grow in size, as explained in subsection 2.2.1. Sending packets from the I/O thread defers the sending of packets, giving more opportunity to the TX to fill up and the socket buffer packet-increasing properties to kick in. Second, we see a significant drop in interrupt rate. This happens because sending packet batches also results in receiving batches of ACKs for them. The NAPI bottom half keeps the interrupts turned off (by setting IMC) while there are more packets to be received, essentially deferring the next interrupt until there are no more packets to be received in the current batch of ACKs. We are not sure why the interrupt rate goes down to around 1000 interrupts for all message sizes, and we leave this for future work.

throughput [Gbps] interrupts [K/sec] packet size [10KB] packets/batch 5 3 6 4 5 4 3 2 4 3 3 2

regular 2 1 2 1 1 1 0 0 0 0 64 1K 4K 64 1K 4K 64 1K 4K 64 1K 4K 256 16K 64K 256 16K 64K 256 16K 64K 256 16K 64K

18 1 12 4 15 10 3 12 8 9 6 2 6 4

normalized 1 3 2 0 0 0 0 64 1K 4K 64 1K 4K 64 1K 4K 64 1K 4K 256 16K 64K 256 16K 64K 256 16K 64K 256 16K 64K xmit on I/O thread message size [KB] previous (up to 3) Figure 5.6: Change in throughput caused by moving the sending of packets from the VCPU thread to the I/O thread in Eh

46

Technion - Computer Science Department - M.Sc. Thesis MSC-2017-19 - 2017 5.5.1 Interrupt Coalescing in Virtio-Net

Interrupt coalescing in Vh is different than in Eh. As discussed in this section and in

section 5.4, Eh uses a delay timer and NAPI for interrupt coalescing, while Vh uses

only NAPI without a timer. We were unable to get good performance out of Eh when not using the interrupt coalescing timer, and we were unable to find the exact reasons for this; we therefore leave it as an open issue. We explain what we do know about

interrupt handling differences between Eh and Vh to aid future exploration of this issue.

First, Vh does not raise TX interrupts. Eh tries to raise an interrupt for every sent

packet, and these interrupts are deferred using Eh’s interrupt coalescing timer. It is

possible that Vh will never raise TX interrupts since Vg reclaims the resources of TX

buffers that were already sent by Vh before new packets were added to TX. Therefore,

the unreclaimed buffers in TX do not interfere with sending new packets. Eg, on the

other hand, reclaims the resources of TX buffers that have already been sent by Eh in its interrupt handler. Therefore, if we turn off TX interrupts, and no packets are received, then the TX ring will very quickly fill up with unreclaimed buffers, and it will be impossible to add new packets to TX. This problem can probably be solved by

changing the implementation of Eg to reclaim the resources of TX buffers before adding new packets to TX. For this solution to be legal under our unmodified guest assumption,

we will also need to push this change to the device drivers of Eb in all OSs, and make

sure that it doesn’t affect the correctness of the device drivers when used with Eb.

Second, Vh kicks cause exits only once every batch of packets added to TX. This is

because of the kick suppression mechanism described in subsection 2.6.1. Eh, on the other hand, writes to TDT for every packet added to TX, causing an exit each time. As we will discuss in section 5.6, adding overhead to packet sending can affect on the batching of packets sent, which in turn can affect the batching of ACK packets received. These effects on batching directly influence the time by which NAPI delays an interrupt, since NAPI delays an interrupt as long as packets are being received. Therefore, the smaller the batches, the shorter the interrupt delays and the higher the interrupt rate. Perhaps the TDT exits contribute to NAPI’s inability to function as the sole interrupt

coalescing mechanism and the timer helps mitigate this problem. Eh also causes 7 control register related exits for every interrupt handled, as explained in section 2.5. While we are not sure how, perhaps these exits also contribute to this problem.

5.6 Exposing PCI-X to Avoid Bounce Buffers

One of the stranger phenomena we encountered with Eh was the higher throughput obtained when we gave the guest less RAM. More specifically, maximum throughput was obtained as long as the guest machine RAM was smaller than 4GB. Beyond that point, increasing the RAM used to initialize the guest decreased the throughput. We initially thought it counterintuitive that decreasing the resources would increase performance.

47

Technion - Computer Science Department - M.Sc. Thesis MSC-2017-19 - 2017 After thoroughly investigating this phenomenon, we discovered the reason: in the current

implementation, Eh exposed itself as a 32-bit PCI device, which meant that it could access only 4GB of RAM when performing DMA to access the TX ring and its buffers. When more than 4GB of RAM were assigned to the guest, the guest kernel could

allocate buffers in high addresses that were inaccessible by Eh, which meant the guest kernel had to copy those buffers to buffers in lower addresses (called bounce buffers)

before Eh could perform DMA reads of these buffers. This extra copying is the reason more RAM meant lower throughput. The solution to this problem was very simple once we understood its cause. All we had to do was enable the PCIX MODE flag in

the STATUS register of Eh, which switches Eh to PCI-X mode. When Eh is in PCI-X mode, it uses 64-bit addressing, which means it can access all the RAM and bounce buffer copying is no longer necessary.

Vh doesn’t suffer from this phenomenon as it exposes itself as a 64-bit capable device in the first place. Figure 5.7 shows the improvement in throughput due to enabling the PCI-X mode

in Eh and eliminating the need to use bounce buffers, together with improvements 1-4. Throughput improves due to the reduced time needed to send a packet since no bounce buffer copying is necessary. We are not sure why batching improves so much, but a good guess is that it has to do with the scheduling priorities of the VCPU and I/O threads. As discussed in subsection 2.7.2, the dynamic priority of the QEMU threads plays a role in deciding which thread is going to run. While the batching in the previous section is quite low on average, it is not stable during runtime. For example, with 64KB messages, batch size fluctuates between a 1-packet batch and 13-packet batches. We believe that the send stabilizes on higher batch sizes when the overhead of bounce buffer copying is eliminated: eliminating the overhead reduces the runtime in the context of the guest, which increases the priority of the VCPU thread over that of the I/O thread. This in turn decreases the chances that the I/O thread will interfere with the VCPU thread while it is adding packets to the TX ring, thus improving batch size overall.

With the improvements added to Eh up to this stage, the send sequence of Eh looks

very similar to that of Vh as described in subsection 2.7.2. The main differences are

that in Eh, the TDT access causes an exit for every packet sent as opposed to a single

kick exit per batch in Vh, and that Eh uses a timer for interrupt coalescing in addition

to NAPI. (As explained in section 5.4, we are not sure why Eh can’t manage interrupt coalescing with NAPI only.)

5.7 Dropping Packets to Improve TSO Batching in Linux Guests

Figure 5.8 shows Eh with all the improvements we described up to section 5.6 compared

to Vh. This figure shows that we have brought the throughput of Eh very close to that

48

Technion - Computer Science Department - M.Sc. Thesis MSC-2017-19 - 2017 throughput [Gbps] interrupts [K/sec] packet size [10KB] packets/batch 7 3 6 30 6 5 25 5 2 4 20 4 3 15 3 regular 2 1 2 10 1 1 5 0 0 0 0 64 1K 4K 64 1K 4K 64 1K 4K 64 1K 4K 256 16K 64K 256 16K 64K 256 16K 64K 256 16K 64K

2 2 2 9 8 7 6 5 1 1 1 4 3 normalized 2 1 0 0 0 0 64 1K 4K 64 1K 4K 64 1K 4K 64 1K 4K 256 16K 64K 256 16K 64K 256 16K 64K 256 16K 64K PCI-X enabled message size [KB] previous(up to 4)

Figure 5.7: Change in throughput caused by enabling PCI-X mode in Eh

of Vh. More interesting is that Eh actually achieves higher throughput than Vh for message sizes 4KB and 8KB, which was very strange to us when we first witnessed it. What caught our attention was the packet size graph in this figure, where we can see that for the above message sizes the guest sends packets that are significantly larger

with Eh than with Vh. This observation led us to inquire why the network stack in the guest achieves such poor TCP segment aggregation in a throughput workload. We expected that under our throughput workload, the network stack in the guest would aggregate multiple sent messages to create large TCP segments to increase throughput. In this section we present our findings regarding this behavior of the guest TCP stack. We also present a method that can be used within any virtual NIC to solve this issue. Our method counterintuitively uses packet dropping to alter the behavior of the guest network stack to create larger packets, which results in higher throughput. As mentioned in subsection 2.2.1, packets can be added to the TX ring only if TCP allows it. To explain this, let us suppose we are in a steady state of sending packets. In this state there is a certain number of packets in flight. At a certain point in time, there will be a packet that would have normally been added to TX, but since there are CWND (see subsection 2.1.3) packets already in flight, this packet will not be added to TX, and instead will be added to the socket buffer. This will also happen to all the following packets until enough ACKs for the in-flight packets are received, which will cause the first packet in the socket buffer to be added to the TX. Therefore, the value of

49

Technion - Computer Science Department - M.Sc. Thesis MSC-2017-19 - 2017 throughput [Gbps] interrupts [K/sec] packet size [10KB] packets/batch 9 5 6 160 8 5 140 7 4 120 6 4 100 5 3 3 80 4

regular 2 3 2 60 2 40 1 1 1 20 0 0 0 0 64 1K 4K 64 1K 4K 64 1K 4K 64 1K 4K 256 16K 64K 256 16K 64K 256 16K 64K 256 16K64K

2 5 2 30 4 25 20 3 1 1 15 2 10 normalized 1 5 0 0 0 0 64 1K 4K 64 1K 4K 64 1K 4K 64 1K 4K 256 16K 64K 256 16K 64K 256 16K 64K 256 16K 64K virtio-net message size [KB] e1000 up to 5

Figure 5.8: Vh compared to Eh with improvements 1-5

CWND is one reason that packets are aggregated to larger packets in the socket buffer. Ideally we would have liked CWND to stabilize when the aggregation of packets benefits throughput. In our guest machine, CWND rose to values higher than optimal, which decreases the aggregation of packets. As explained in subsection 2.1.3, CWND is lowered when congestion is detected by the TCP congestion algorithm. Congestion is detected when an ACK for a sent packet is not received in a timely manner. To achieve better packet aggregation for throughput scenarios, we decided to drop packets

intentionally in Eh to decrease CWND and thus increase the packet size. To do this we

used algorithm 5.1 in the QEMU code of both Eh and Vh. The algorithm examines the average packet size once in a while, and if it is too small, it drops a packet to reduce CWND, in an attempt to improve packet aggregation in the guest. To understand this algorithm we must first define the constants used in it. P is the number of preceding packets we consider when calculating the current average packet size being sent by the guest. If P is too small, then we might drop packets too quickly, before the previous drop of a packet takes effect, which will cause a continuous drop in CWND and in throughput. If P is too large, then the time between drops might be too long. This might cause CWND to climb to higher values and remain there until the next drop of a packet, reducing the packet aggregation and thus throughput. M is the size of the average packet below which we drop packets. If we set it too high (for example 64KB), then we will get packet drops all the time since the average packet size

50

Technion - Computer Science Department - M.Sc. Thesis MSC-2017-19 - 2017 will never be so high. Throughput will also decrease in this case. But if we set M too low (for example 2KB), then for message sizes larger than M we will never drop any packets, losing all the benefits of the algorithm. Finally, m is the size of the average packet below which we do not drop packets. We discovered its necessity empirically, after seeing that throughput does not improve if packets are dropped when the average packet size is very small.

Algorithm 5.1 Lowering CWND by dropping packets for better packet aggregation 1: if P packets were sent by the guest since last check and average packet size of the last P packets sent by the guest is in [m,M] then 2: drop packet 3: end if

Our current version of the packet dropping algorithm is static, meaning that we initialize the algorithm parameters P, M and m once and never change them. We did not explore the possibility of creating a dynamic algorithm that sets these parameters according to the current networking load of the system, but it can be explored in future

work. For now we simply played with the numbers and chose, for every NIC (Eh and

Vh), the 3 parameters that showed good performance. This choice is by no means suitable for all cases, but it shows the potential of using this method of packet dropping.

The parameters that showed good results for Eh are (P,M,m) = (8,000, 60,000, 25,000),

and for Vh they are (P,M,m) = (8,000, 62,000, 25,000).

Figure 5.9 shows the throughput achieved by Eh together with the CWND value in the guest TCP stack, when running the netperf TCP STREAM benchmark with the default message size of 16KB. At first, without the packet dropping, a high CWND and low throughput is observed. Then, at around 25 seconds, packet dropping is enabled, which causes CWND to drop and the throughput to rise significantly due to the increased TCP segment sizes achieved by the guest. After CWND drops, it immediately starts rising again, and once the higher CWND values cause the TCP segment size to decrease

once again, Eh again drops a packet, which sends the CWND value back down. Figure 5.10 shows the throughput achieved with packet dropping together with improvements 1-5. Most of the improvement stems from the increase in packet size for

larger message sizes. Batching in Eh is reduced to around 12-13 for larger message sizes, since the TX can hold up to around 12-13 packets of the maximum IP packet size of 64KB. This is because the TX is 256 buffers long, and a maximum size packet takes up around 20 buffers. At this point we would like to note that the size of both TX and RX

in Vh is also 256, but in the case of Vh each entry in the queue holds a list of buffers

for a single packet, which means the rings in Vh can hold up to 256 packets. We tried

increasing the ring size of Eh to the maximum of 4096 buffers to avoid overfilling. This reduced throughput for reasons we did not explore, and we leave the optimal TX size for future work. In fact, when changing the queue sizes to get a feeling for which size is the best, we got the best throughput around the default setting of 256, so we left the

51

Technion - Computer Science Department - M.Sc. Thesis MSC-2017-19 - 2017 10 3000

9 2500 8

7 2000 6

5 1500 CWND 4

throughput [Gbps] 1000 3

2 500 1 throughput CWND 0 0 0 20 40 60 80 100 time [seconds]

Figure 5.9: Throughput and CWND values over time, with Eh, running netperf TCP STREAM with default 16KB message size. At around 25 seconds packet dropping is enabled.

queue sizes in this default setting.

Figure 5.11 shows the impact of packet dropping on Vh. It shows Vh with and without packet dropping graphs. Packet dropping is a general solution that works regardless of the NIC type. Therefore, we see a very similar increase in the average

packet size and thus in throughput, as shown for Eh in Figure 5.10. It is important to note that the presented packet dropping algorithm is incomplete and serves only as a proof of concept, as it has a few basic flaws. First, the algorithm parameters are static, as already discussed in this section. Moreover, the algorithm works only for a single TCP connection since there is a different CWND for each TCP connection and our algorithm doesn’t take connections into account. It is possible to use this algorithm with multiple unencrypted TCP connections by inspecting packets in the virtual NIC and maintaining the average packet size per connection. However, this will not work for encrypted connections (such as IPSEC), since inspecting such encrypted packets to see which connection they belong to is impossible in this case. A more complete solution should probably be implemented in the TCP stack itself by better managing the packet aggregation for throughput scenarios but this is out of the scope of our work.

52

Technion - Computer Science Department - M.Sc. Thesis MSC-2017-19 - 2017 throughput [Gbps] interrupts [K/sec] packet size [10KB] packets/batch 7 3 7 30 6 6 25 5 2 5 20 4 4 15 3 3 regular 2 1 2 10 1 1 5 0 0 0 0 64 1K 4K 64 1K 4K 64 1K 4K 64 1K 4K 256 16K 64K 256 16K 64K 256 16K 64K 256 16K 64K

2 2 4 2

3

1 1 2 1

normalized 1

0 0 0 0 64 1K 4K 64 1K 4K 64 1K 4K 64 1K 4K 256 16K 64K 256 16K 64K 256 16K 64K 256 16K 64K drop packets message size [KB] previouis(up to 5) Figure 5.10: Change in throughput caused by adding the packet dropping for better TSO batching in Eh

5.8 Vectorized Send

Packets are sent by a virtual NIC in QEMU by passing them to the networking backend (in our case the TAP backend), which is responsible for handing the packet over to the network stack of the host, where it will be sent to its destination. The backend has 2 types of functions for the sending of packets: qemu send packet async(), which receives a single buffer containing the whole packet and uses the write() system call to write the packet to the TAP device in the host, and qemu sendv packet async(), which receives a vector of buffers (called an iov for I/O vector), and uses the writev() system call to write the packet to the TAP device in the host.

Each packet on the TX ring of both Eh and Vh is comprised of a list of buffers. There- fore the natural choice for sending a packet is the qemu sendv packet async() function, which will receive the vector of buffers on the TX ring as input and write them to the

TAP device. Vh indeed uses qemu sendv packet async(), but Eh first copies the packet buffers on the TX ring to a single linear buffer and then uses qemu send packet async() to send the packet. Since copying the TX buffers to another intermediate buffer seems wasteful, we

changed the implementation of Eh to use the qemu sendv packet async() function directly with the vector of buffers held by the TX ring. We call this change ”vectorized send”. Figure 5.12 shows the results of this change. There is a slight improvement in

53

Technion - Computer Science Department - M.Sc. Thesis MSC-2017-19 - 2017 throughput [Gbps] interrupts [K/sec] packet size [10KB] packets/batch 9 5 7 160 8 6 140 7 4 5 120 6 100 5 3 4 80 4 3 regular 2 60 3 2 2 1 40 1 1 20 0 0 0 0 64 1K 4K 64 1K 4K 64 1K 4K 64 1K 4K 256 16K 64K 256 16K 64K 256 16K 64K 256 16K64K

4 3 10 30 9 25 3 8 2 7 20 6 2 5 15 4 1 3 10 normalized 1 2 5 1 0 0 0 0 64 1K 4K 64 1K 4K 64 1K 4K 64 1K 4K 256 16K 64K 256 16K 64K 256 16K 64K 256 16K 64K virtio-net drops message size [KB] virtio-net no drops Figure 5.11: Change in throughput caused by adding the packet dropping for better TSO batching in Vh

throughput when the guest sends large packets, but throughput decreases for smaller packet sizes. Previous works [GOB01, GKR05] have shown that using zero copy techniques achieve better throughput than copying when the amount of data processed is large, and worse throughput than copying when the amount of processed data is small. We would expect the same kind of effect from our change since we also remove a copy on the I/O processing path. However, we were unable to determine the reason for the somewhat chaotic pattern of throughput we observed for smaller message sizes. While it is not clear that the throughput curve with vectorized send is better than

without it, we chose to present this change to illustrate another difference between Vh

and Eh implementation (albeit not very significant).

5.9 SRTT Calculation Algorithm Bug in Linux

While we were running experiments with our improved Eh, we got very inconsistent throughput results. Our investigation uncovered a bug in the calculation of SRTT (which is explained in subsection 2.1.4), in the Linux kernel. In the following subsections

we will describe the bug, our fix, and how the fix affects the performance of both Eh

and Vh.

54

Technion - Computer Science Department - M.Sc. Thesis MSC-2017-19 - 2017 throughput [Gbps] interrupts [K/sec] packet size [10KB] packets/batch 8 3 7 25 7 6 6 20 2 5 5 4 15 4 3 regular 3 10 1 2 2 5 1 1 0 0 0 0 64 1K 4K 64 1K 4K 64 1K 4K 64 1K 4K 256 16K 64K 256 16K 64K 256 16K 64K 256 16K 64K

2 3 2 2

2 1 1 1 1 normalized

0 0 0 0 64 1K 4K 64 1K 4K 64 1K 4K 64 1K 4K 256 16K 64K 256 16K 64K 256 16K 64K 256 16K 64K vectorized send message size [KB] previous(up to 6)

Figure 5.12: Change in throughput caused by using vectorized sending in Eh

5.9.1 SRTT Calculation in the Linux Kernel

The default Linux TCP stack runs in kernel space, where floating point computations are unavailable. Therefore formula 2.1 for the calculation of SRTT is implemented in the Linux kernel using integer computations. Figure 5.13 shows the code in Linux kernel 3.13, which calculates the new SRTT given the old SRTT and the RTT of the currently ACKed packet. To facilitate understanding of the code, we note the following. First, the value held in tp->srtt is not SRTT, but rather 8*SRTT. So the formula you see implemented in this code is the same as formula 2.1, in which α is 1/8, and both sides of the equation are multiplied by 8, giving equation 5.1.

7 (8 ∗ SRTT ) = ∗ (8 ∗ SRTT ) + RTT (5.1) 8 8*SRTT was probably used to obtain higher precision of SRTT without using floating point numbers, but the achieved precision isn’t enough, as we shall demonstrate when we describe the bug we found in the next subsection. Second, to avoid too-small RTO values that cause frequent unnecessary timeouts, the minimum allowed value of RTT is 1, and therefore the minimum value of tp->srtt is 8, which explains the first ”if” clause in row 5 of Figure 5.13.

55

Technion - Computer Science Department - M.Sc. Thesis MSC-2017-19 - 2017 1 static void t c p r t t e s t i m a t o r ( struct sock ∗sk , 2 const u32 mrtt ){ 3 struct t c p s o c k ∗ tp = t c p s k ( sk ) ; 4 long m = mrtt ; /∗ RTT ∗/ 5 i f (m == 0) 6 m = 1 ; 7 i f ( tp−>s r t t != 0) { 8 m −= ( tp−>s r t t >> 3 ) ; /∗ m is now error in rtt est ∗/ 9 tp−>s r t t += m; /∗ rtt = 7/8 rtt + 1/8 new ∗/ 10 . . . 11 } 12 . . . 13 }

Figure 5.13: The routine in Linux kernel 3.13 that calculates the new SRTT given the RTT of the currently ACKed packet and the previous SRTT (irrelevant code omitted)

5.9.2 Bug Description

We can describe the bug in a single sentence. Whenever tp->srtt ∈ [8, 14], if tp->srtt increases, it will return down to its previous value. For example, if tp->srtt is 8, and the RTT of the currently ACKed packet is 2, then tp->srtt will increase to 9, and will not go back down to 8, even if the RTT for all packets from now on will be 0. The reason for this bug is the loss of accuracy in row 8 in Figure 5.13. The result of tp->srtt>>3 is 1 for all the numbers in [8,15]. And since the minimum value of m before row 8 is 1, then the minimum value of m when row 8 is executed is 0, which leaves tp->srtt as it was before.

5.9.3 Effects of the Bug

To explain the effects of the bug we must first introduce the socket pacing rate in the Linux kernel, which is represented in the code by the sk pacing rate variable. The sk pacing rate variable ensures that packets are sent out at a pace no slower than a packet per millisecond. One of the parameters used to determine that the pace is slow is the SRTT. More specifically, sk pacing rate is divided by tp->srtt, if tp->srtt is greater than 10. This means that when tp->srtt grows beyond 10, sk pacing rate immediately drops by an order of magnitude. This sets up a chain of events that dramatically slows down the speed at which maximum throughput is achieved by the socket. The sk pacing rate variable also limits the amount of data that can be held by the socket, which reduces the number of packets sent per batch, which slows the increase of CWND. CWND is the other parameter that affects sk pacing rate, which means that the slow growth of CWND further slows the growth of sk pacing rate. In our setup, when using

Eh, it takes around 500 seconds to achieve the maximum throughput when running netperf with the default 16KB messages, when using the original calculation of SRTT

56

Technion - Computer Science Department - M.Sc. Thesis MSC-2017-19 - 2017 1 static void t c p r t t e s t i m a t o r ( struct sock ∗sk , 2 const u32 mrtt ){ 3 struct t c p s o c k ∗ tp = t c p s k ( sk ) ; 4 long m exact = mrtt <<3; 5 i f ( m exact == 0){ 6 m exact = 8 ; 7 } 8 i f ( tp−>s r t t != 0) { 9 m exact −= ( tp−>s r t t e x a c t >> 3 ) ; 10 tp−>s r t t e x a c t += m exact ; 11 tp−>s r t t = tp−>s r t t e x a c t >> 3 ; 12 . . . 13 } 14 . . . 15 }

Figure 5.14: The code of tcp rtt estimator() after fixing the bug in SRTT calculation (irrelevant code omitted)

in the Linux kernel, as compared to around 1 second after our bug fix.

5.9.4 Bug Fix

Since the cause of the bug is the loss of accuracy due to tp->srtt shifting 3 places to the right, the natural solution was to shift all variables participating in the calculation of SRTT 3 places to the left, before the calculation, and then shifting the results back 3 places to the right when the calculation was complete. The fixed code can be seen in Figure 5.14, where all the variables that were shifted 3 places left have the suffix “ exact” added to their name. Figure 5.15 shows the values of tp->srtt as they react to RTT values in both the original implementation of tcp rtt estimator() and in the fixed version. In the original version it can be seen that tp->srtt is monotonically increasing, even when RTT should have lowered it, and in the fixed version we can see that tp->srtt reacts correctly to both increases and decreases in RTT. Figure 5.16 shows the impact of the srtt bug fix on the throughput of the best

version of Eh without the improvement of the packet dropping algorithm, which masks the effects of the bug. The impact is prominent across all message sizes. Figure 5.17

shows the impact of the srtt bug fix on the throughput of Vh, again without the packet dropping algorithm. Here the impact is noticeable only for message sizes in [512,4K).

This bug has such a dramatic effect on Eh because Eh tends to have spikes in the value of RTT, specifically in the beginning of the netperf benchmark, as can be seen in the

left graph of Figure 5.15. We did not observe these spikes in Vh. We were unable to determine the reason for these spikes, but either way, with the bug fixed, the throughput

of Eh no longer fluctuates like it did with the bug.

57

Technion - Computer Science Department - M.Sc. Thesis MSC-2017-19 - 2017 21 current RTT 21 current RTT 18 original tp->srtt 18 fixed tp->srtt 15 15 12 12 9 9

time[Jiffies] 6 time[Jiffies] 6 3 3 0 0 0 0.05 0.1 0.15 0.2 0 0.05 0.1 0.15 0.2 time[sec] time[sec]

Figure 5.15: tp->srtt values as they react to RTT values over time, in both the original implementation of tcp rtt estimator() on the left and the fixed version on the right

8 2.4 7 2.2 6 2 5 1.8 4 1.6 3 1.4 2 1.2

throughput [Gbps] 1 1

0 normalized throughput 0.8 64 1K 2K 4K 8K 64 1K 2K 4K 8K 128 256 512 16K 32K 64K 128 256 512 16K 32K 64K message size [bytes] message size [bytes] e1000 fixed srtt e1000 original srtt

Figure 5.16: Difference in throughput of Eh with all improvements but packet dropping, with the SRTT bug and after fixing it

5.10 Final Throughput Comparison

Figure 5.18 compares the best versions of Eh and Vh in terms of throughput achieved: Eh

with all of our improvements added and Vh with two improvements added to it as well - the SRTT bug fix and the dropping of packets heuristic. Before all of our improvements,

Vh had 20–77x higher throughput than Eh, and after all of our improvements it has only 1.2–2.2x higher throughput. We can also see that the lines of other parameters in the figure are much more similar than with the baseline versions of the NICs. These results indicate that the conception that paravirtual I/O devices are greatly superior to emulated ones in throughput scenarios is a misconception. The results also show that virtualization exits do not play as large a role in the throughput difference between emulated and paravirtual I/O devices as presented in previous works. At this point we would like to return to the maximum throughput that can theoreti-

cally be achieved by Eh in our single core benchmark, as we predicted in our model in section 3.6. To see how close we got to the prediction of our model, we ran our single

core throughput benchmark with a 65160-byte message size, using Eh with all of the

58

Technion - Computer Science Department - M.Sc. Thesis MSC-2017-19 - 2017 9 1.7 8 1.6 7 1.5 6 1.4 5 1.3 4 1.2 3 1.1 2 1

throughput [Gbps] 1 0.9

0 normalized throughput 0.8 64 1K 2K 4K 8K 64 1K 2K 4K 8K 128 256 512 16K 32K 64K 128 256 512 16K 32K 64K message size [bytes] message size [bytes] virtio-net fixed srtt virtio-net original srtt

Figure 5.17: Difference in throughput of Vh with the SRTT bug and after fixing it

improvements presented in this chapter, and got a throughput of 7725 Mbps, which is 94% of the 8200 Mbps predicted by our model.

As Figure 5.18 shows, there are still 2 main differences between Eh and Vh that need

to be addressed: (1) the interrupt rate of Vh is much lower than that of Eh, and (2) Vh

is able to utilize batching much better than Eh. We leave the effort to address these issues to future research.

throughput [Gbps] interrupts [K/sec] packet size [10KB] packets/batch 10 3 7 30 6 8 25 2 5 20 6 4 15 3

regular 4 1 2 10 2 1 5 0 0 0 0 64 1K 4K 64 1K 4K 64 1K 4K 64 1K 4K 256 16K 64K 256 16K 64K 256 16K 64K 256 16K 64K

3 2 2 11 10 9 8 2 7 6 1 1 5 1 4

normalized 3 2 1 0 0 0 0 64 1K 4K 64 1K 4K 64 1K 4K 64 1K 4K 256 16K 64K 256 16K 64K 256 16K 64K 256 16K 64K best virtio-net message size [KB] best e1000

Figure 5.18: Throughput comparison of the best versions of Vh and Eh

59

Technion - Computer Science Department - M.Sc. Thesis MSC-2017-19 - 2017 category name improvements in category also improves Vh implementing hardware TCO removal, TSO removal no acceleration in software parameter setup interrupt coalescing, enabling PCI-X no optimizations xmit from I/O thread, vectorized send no interaction with network packet dropping yes stack bug fixes SRTT bug fix yes

Table 5.2: Eh improvement categories

5.11 Improvements Summary

In this section we summarize the improvements to Eh presented in this chapter. Table 5.2 shows a division of our improvements into 5 categories, according to the nature of each improvement. This table gives a good overview of the types of problems we’ve

found with the implementation of Eh. Table 5.3a shows a summary of the throughput achieved by adding up all of our improvements. Each row shows the throughput in Mbps achieved when adding all improvements up to and including the improvement in this row. The same data is also presented graphically in figures 5.19a and 5.19b. Figure 5.19a uses a linear scale Y axis to present the data and figure 5.19b uses a logarithmic scale Y axis, to emphasize the throughput increases achieved in the lower message sizes that are indistinguishable in figure 5.19a. Table 5.3b shows the increase in throughput achieved by each improvement in percents out of the maximum achieved throughput. For example 40% of the maximum achieved throughput for 64KB message sizes is attained by moving the transmission to the I/O thread. According to this table, the 4 most significant improvements are: interrupt coalescing, xmit on I/O thread, PCI-X enabled and drop packets. The first and last of the above improvements are most beneficial for small and large message sizes respectively. While the other two improvements increase throughput across all message sizes. You can also see negative numbers in this table. These mean that the improvement actually reduces the throughput in some cases. For example using vectorized send actually reduces the maximum throughput achieved by 16% for 1K message sizes. The same data is presented graphically in figure 5.20 which once again shows distinctly the 4 improvements most dominant in the throughput increase we achieved. In this figure the improvements that reduce the throughput in some message sizes (the negative numbers in table 5.3b) are expressed as 0% change in throughput, as negative effects on throughput can’t be expressed in such a figure.

60

Technion - Computer Science Department - M.Sc. Thesis MSC-2017-19 - 2017 imprv.\msg. size 64 128 256 512 1K 2K 4K 8K 16K 32K 64K baseline 3 5 11 21 41 48 76 136 226 341 438 no TCO 3 5 10 21 45 50 81 156 279 464 666 no TSO 3 5 10 21 45 49 83 167 380 691 1344 interrupt coales. 62 68 89 102 116 133 243 414 780 1150 1933 xmit on I/O thrd. 93 205 454 772 1226 2095 2467 3476 4084 4484 4851 PCI-X enabled 104 242 499 861 1507 2729 3181 4416 4597 6102 6801 drop packets 87 205 468 850 1509 3295 5103 5861 6246 6820 6835 vectorized send 86 206 455 806 1272 2140 5065 6033 6721 7287 7291

(a) Throughput in Mbps (rounded) achieved by Eh with each added improvement

imprv.\msg. size 64 128 256 512 1K 2K 4K 8K 16K 32K 64K baseline 3 2 2 2 3 1 1 2 3 5 6 no TCO 0 0 0 0 0 0 0 0 1 2 3 no TSO 0 0 0 0 0 0 0 0 2 3 9 interrupt coales. 57 26 16 9 5 3 3 4 6 6 8 xmit on I/O thrd. 29 56 73 78 74 60 44 51 49 46 40 PCI-X enabled 11 15 9 10 19 19 14 16 8 22 27 drop packets -16 -15 -6 -1 0 17 38 24 25 10 0 vectorized send -1 1 -3 -5 -16 -35 -1 3 7 6 6

(b) Throughput increase caused by each of the improvements added to Eh in percents out of the highest achieved throughput

Table 5.3: Throughput increase achieved by adding our improvements to Eh

61

Technion - Computer Science Department - M.Sc. Thesis MSC-2017-19 - 2017 8 vectorized send drop packets 7 PCI-X enabled xmit on I/O thread 6 interrupt coalescing no TSO no TCO 5 baseline

4

3 throughput [Gbps] 2

1

0 64 1K 2K 4K 8K 128 256 512 16K 32K 64K message size [bytes] (a) Linear Y axis scaling

23 22 21 20 2-1 2-2 2-3 2-4 -5

throughput [Gbps] 2 2-6 2-7 2-8 2-9 64 1K 2K 4K 8K 128 256 512 16K 32K 64K message size [bytes] (b) Logarithmic Y axis scaling

Figure 5.19: Throughput increase achieved by adding our improvements to Eh

62

Technion - Computer Science Department - M.Sc. Thesis MSC-2017-19 - 2017 vectorized send interrupt coalescing drop packets no TSO PCI-X enabled no TCO xmit on I/O thread baseline

100

80

60

40

20 percent of maximum throughput 0 64 1K 2K 4K 8K 128 256 512 16K 32K 64K message size [bytes]

Figure 5.20: Throughput increase caused by each of the improvements added to Eh out of the highest achieved throughput

63

Technion - Computer Science Department - M.Sc. Thesis MSC-2017-19 - 2017 64

Technion - Computer Science Department - M.Sc. Thesis MSC-2017-19 - 2017 Chapter 6

Initial Work on a Dual Core Configuration

In the previous chapter we showed what happens when QEMU runs a guest machine with a single VCPU, when all of QEMU’s threads are running on a single core. In this chapter we give initial results when running the QEMU threads on 2 cores, one for the VCPU thread, and the other for the I/O thread. Splitting the threads allows for parallel execution, which is expected to improve performance but raises parallelism-related issues.

We first present the baseline comparison of Eh vs Eh when running our dual core

benchmark. We then show that Eh with all of the improvements from Chapter 5 performs poorly in our dual core benchmark and further show that this is due to mutex

contention issues, from which Vh suffers less. We then present the sidecore paradigm, which can serve as a solution to the mutex contention problem and also reduce the overhead of exits. We present a partial sidecore implementation that shows great

improvement in throughput. We then compare the best version of Eh against the best

version of Vh when running our dual core benchmark. All throughput figures in this chapter are obtained by running our dual core through- put benchmark as presented in section 4.2.

6.1 Baseline Comparison

Figure 6.1 shows a throughput comparison of baseline Eh vs baseline Vh as they are implemented in our version of QEMU, when running our dual core throughput

benchmark. This figure will serve as our initial comparison point between Vh and Eh.

Like in the single core case, the initial throughput difference between Vh and Eh is very

large. Baseline Vh achieves throughput that is 7–173x better than Eh. In the next section we will examine how well our improvements from the first chapter perform in a dual core setup.

65

Technion - Computer Science Department - M.Sc. Thesis MSC-2017-19 - 2017 throughput [Gbps] interrupts [K/sec] packet size [10KB] packets/batch 14 30 6 150 12 25 5 10 20 4 100 8 15 3 6 regular 4 10 2 50 2 5 1 0 0 0 0 64 1K 4K 64 1K 4K 64 1K 4K 64 1K 4K 256 16K 64K 256 16K64K 256 16K 64K 256 16K64K

180 4 11 150 160 10 9 140 3 8 120 7 100 100 2 6 80 5 60 4 50

normalized 1 3 40 2 20 1 0 0 0 0 64 1K 4K 64 1K 4K 64 1K 4K 64 1K 4K 256 16K64K 256 16K 64K 256 16K 64K 256 16K64K baseline virtio-net message size [KB] baseline e1000

Figure 6.1: Throughput comparison between baseline Vh and baseline Eh when running our dual core basic throughput benchmark

6.2 Scalability of the Emulated E1000

Figure 6.2 shows what happens to the throughput of Vh with packet dropping when running on a dual core setup, in comparison to the single core setup, with drop packet

parameters (P,M,m) = (8,000, 62,000, 9,000) for both setups. It is clear that Vh

benefits from the extra core across all message sizes. Figure 6.3 shows that Eh with all

improvements from Chapter 5, unlike Vh, scales poorly when given an additional CPU

core. In fact, not only does Eh scale poorly, but it actually achieves throughput that is worse than on a single core across all message sizes except for 64 bytes. The reason we found for this poor scalability is a certain type of contention on the qemu global mutex, which we first introduced in subsection 2.3.2. For best throughput scalability when 2 cores are assigned to a single guest, we would ideally like the VCPU thread to add packets to the TX ring on one core, while at the same time the I/O thread sends these packets to their destination on the another core. However, exits that

occur when TDT is accessed by Eg greatly impair this parallel execution in Eh. TDT is advanced for each packet that is added to the TX ring. During the exit caused by this access to TDT, the VCPU thread gets stuck waiting for the qemu global mutex until the I/O thread finishes sending previous packets and doing other I/O thread work. The I/O thread also suffers from contention over qemu global mutex, since it doesn’t start sending the next packet until the qemu global mutex is released by the VCPU thread.

66

Technion - Computer Science Department - M.Sc. Thesis MSC-2017-19 - 2017 throughput [Gbps] interrupts [K/sec] packet size [10KB] packets/batch 14 25 7 150 12 6 20 120 10 5 8 15 4 90 6 3

regular 10 60 4 2 5 30 2 1 0 0 0 0 64 1K 4K 64 1K 4K 64 1K 4K 64 1K 4K 256 16K 64K 256 16K64K 256 16K 64K 256 16K64K

2 30 1 70 25 60 20 50 40 1 15 30 10 20 normalized 5 10 0 0 0 0 64 1K 4K 64 1K 4K 64 1K 4K 64 1K 4K 256 16K 64K 256 16K 64K 256 16K 64K 256 16K 64K best virtio-net 2 cores message size [KB] best virtio-net 1 core

Figure 6.2: Throughput difference between the best version of Vh when run on a single core vs when run on 2 cores

This contention over the qemu global mutex eliminates the batching we saw in the single core setup, since this contention forces the sending of one packet at a time. The lack of batching in turn cancels NAPI interrupt coalescing and lowers the packet sizes as explained in section 5.5. All of these effects in turn reduce throughput significantly.

We don’t see the above contention problem with Vh, since the kick notification of Vg (which is comparable to the advancing of TDT) doesn’t cause an exit most of the time,

as described in subsection 2.6.1. Therefore, unlike with Eh, there is no exit per packet, which means there is much less contention over qemu global mutex , and VH can fully benefit from the parallel execution on the VCPU and I/O threads.‘ Table 6.1 shows the qemu global mutex contention as reported by the mutrace [Poe09] tool, when running netperf TCP STREAM for 300 seconds with a message size of 64KB. The Locked column indicates how many times qemu global mutex was locked, Contended, how many times qemu global mutex was was contended, % Contended, the percentage of times the mutex was contended out of the times the mutex was locked, and Dual % / Single % - literally, the percentage of contended accesses to the mutex in the dual core benchmark for this NIC divided by the percentage of contended accesses to the mutex in the single core benchmark. This is an indication of the increase in

contention from the single to the dual core benchmark. For Eh, the qemu global mutex was contended 5x times more in the dual core setup than in the single core setup, while

for Vh the contention decreased by 10%. The qemu global mutex is locked significantly

67

Technion - Computer Science Department - M.Sc. Thesis MSC-2017-19 - 2017 throughput [Gbps] interrupts [K/sec] packet size [10KB] packets/batch 8 3 7 14 6 12 6 2 5 10 4 8 4 3 6 regular 1 2 2 4 1 2 0 0 0 0 64 1K 4K 64 1K 4K 64 1K 4K 64 1K 4K 256 16K 64K 256 16K 64K 256 16K 64K 256 16K 64K

12 1 24 14 22 10 20 12 18 8 16 10 14 8 6 12 10 6 4 8 4 normalized 6 2 4 2 2 0 0 0 0 64 1K 4K 64 1K 4K 64 1K 4K 64 1K 4K 256 16K 64K 256 16K 64K 256 16K 64K 256 16K 64K best e1000 1 core message size [KB] best e1000 2 cores

Figure 6.3: Throughput difference between the best version of Eh when run on a single core vs when run on 2 cores

Setup Locked Contended % Contended Dual % / Single % Eh single core 7,112,248 373,375 5% - Eh dual core 17,358,389 4,685,595 27% 5X Vh single core 529,507 115,521 22% - Vh dual core 877,896 175,499 20% 0.9X Eh sidecore 26,374,558 4,080,292 15% 3X

Table 6.1: qemu global mutex contention in different configurations as measured by mutrace

more in Eh than in Vh due to the larger number of I/O exits in Eh. We will discuss the sidecore row of this table in section 6.3.

6.3 Sidecore

The contention over qemu global mutex can be dealt with in different ways. It can be eliminated using fine grained locking instead of the single coarse grain lock in qemu global mutex. Use of the qemu global mutex can also be eliminated or reduced. We decided to go with the second solution. We used the sidecore paradigm to eliminate exits, thus eliminating the locking of qemu global mutex during the handling of these exits. The sidecore paradigm works as follows: Control registers are mapped to the host

68

Technion - Computer Science Department - M.Sc. Thesis MSC-2017-19 - 2017 physical memory by the hypervisor, so that when the guest writes to these registers, no exits occur. In order to perform the necessary actions when these registers change their value, a polling thread is created. This polling thread polls the memory to which the registers are mapped, and whenever the value in one of the registers changes, the polling thread emulates the necessary behavior, the same way it would have done in the emulate phase of the trap and emulate paradigm. A special CPU core is dedicated to this polling thread to ensure short reaction times to register changes. This dedicated core is called a sidecore, and thus the paradigm is called the sidecore paradigm. The sidecore paradigm was first introduced by Kumar et al. [KRS07], and later employed in several other works [ABYTS11, HGL+13, KMN+16], but to our knowledge has never been used before with a full-fledged emulated I/O device.

6.3.1 Partial Sidecore Implementation

As described in section 2.4, the registers of Eb are grouped into 4KB pages. The registers that are of most interest to us are those described in subsection 2.4.1. These registers reside in pages 0, 2 and 3 in the register array. Due to the limitations imposed by the architecture, pages can only be mapped in whole page granularity. Therefore, for each of the above 3 pages, we must decide whether to map the whole page, so we can avoid the exits when registers in this page are accessed, or leave it unmapped and suffer the penalty. The decision to map a page or leave it unmapped depends on whether the sidecore

can preserve a correct emulation of Eh according to the specifications of Eb. We call a page ”sidecore emulatable” if it can be mapped to the host memory, and emulated correctly via a sidecore. Page 2, which contains TDH and TDT (among others), is sidecore emulatable as none of the registers in this page have semantics that prevent a

correct sidecore emulation. We implemented a partial sidecore emulation of Eh, where

only page 2 of the Eh register pages is mapped in the host memory, and polled by the sidecore thread. The other register pages are left unmapped and are emulated using exits as before. In our partial sidecore implementation, we eliminated only exits caused

by TDT increments in Eg. We expected our partial sidecore implementation to improve the throughput of

Eh for 2 reasons: First, the sidecore eliminates the overhead of all TDT related exits. Second, since the TDT exits are those that cause the contention problem described in section 6.2, eliminating these exits will also eliminate the related contention, and improve

the scalability of Eh by enabling the VCPU and I/O threads to work simultaneously when sending packets.

Implementation Details

For our initial sidecore thread prototype, we decided to add the sidecore polling behavior to the I/O thread of QEMU, instead of creating a separate sidecore thread. There are

69

Technion - Computer Science Department - M.Sc. Thesis MSC-2017-19 - 2017 two reasons for this decision. First, it is very convenient, since the I/O thread already contains a working polling loop, which currently polls event file descriptors as described

in subsection 2.3.1. Therefore all we had to do was add our polling of the Eh registers into the existing polling loop and to change the existing polling in the I/O thread to be non-blocking. Without the second change, the I/O thread would block, which would

increase the response time to changes in Eh registers. Implementing the sidecore thread as part of the I/O thread also reduces contention over the qemu global mutex, which the sidecore thread would need to acquire when performing the necessary emulation of the different registers.

Evaluation

We ran our basic dual core benchmark as described in subsection 4.2.2, to see how

well our partial sidecore performs in comparison to Vh. Henceforth, any mention of

our sidecore implementation refers to Eh with all improvements from the first chapter, with our partial sidecore implementation, with drop packet parameters changed to work better for the sidecore case at (P,M,m) = (8,000, 60,000, 8,000), and with increased

queue size, from the original 256 of Eh to 4096, which is the maximum Eh allows. As

explained in section 5.7, throughput did not improve when the ring sizes of Eh were increased in the single core setup. However, when using a sidecore, throughput did improve when the ring sizes were increased. This also makes for a more equal comparison

between Eh and Vh, as with large ring sizes the rings never fill up in both Vh and Eh in

our setup, unlike in the single core case where the TX ring of Eh did fill up.

Figure 6.4 shows the throughput achieved by Eh, with our partial sidecore implemen-

tation, compared to the throughput achieved by the best version of Eh when running on

a single core. With the TDT exits eliminated, Eh with a sidecore scales well, as opposed

to 2-core Eh without a sidecore, as shown in Figure 6.3. The improved scalability is due to having eliminated the TDT exit related contention over qemu global mutex, as described in section 6.2. We can also see that our sidecore restores the batching of

packets in Eh, which was lost due to the contention over qemu global mutex. This batching also causes increased packet sizes, as explained in section 3.2. We expected the interrupt rate to go down but it fluctuates instead; we are not sure why it increases to 3K between 4KB and 32KB message sizes. Looking back at Table 6.1, we can see that while the sidecore removed the contention due to the exits caused by TDT accesses, the overall contention over qemu global mutex is still relatively high. We speculate that this contention occurs when the sidecore polls its memory continuously, locking qemu global mutex every polling iteration at a high

rate, while the interrupt handler in Eg causes 7 exits in registers that are not TDT, contending with the I/O thread. However, this speculation should be clarified by further investigation. Figure 6.5 compares the throughput achieved by our partial sidecore implementation

70

Technion - Computer Science Department - M.Sc. Thesis MSC-2017-19 - 2017 throughput [Gbps] interrupts [K/sec] packet size [10KB] packets/batch 10 4 7 256 6 8 3 128 5 64 6 4 32 2 3 16

regular 4 8 2 2 1 4 1 2 0 0 0 1 64 1K 4K 64 1K 4K 64 1K 4K 64 1K 4K 256 16K 64K 256 16K 64K 256 16K 64K 256 16K64K

4 4 1 256 128 3 3 64 32 2 2 16 8 4 normalized 1 1 2 1 0 0 0 0.5 64 1K 4K 64 1K 4K 64 1K 4K 64 1K 4K 256 16K 64K 256 16K 64K 256 16K 64K 256 16K64K best e1000 sidecore message size [KB] best e1000 1 core

Figure 6.4: Throughput difference between the best single core version of Eh and the best partial sidecore-emulated version of Eh

to the throughput achieved by Vh. Figure 6.6 shows the same comparison when neither NIC uses our packet dropping algorithm from section 5.7. Packet dropping is very

effective for Vh on dual core setups, but is not so effective for Eh. Further investigation is required to determine the reasons. The extent to which the sidecore contributes to throughput can be seen most clearly in Figure 6.6, where the throughput exceeds that

of Vh for most message sizes. Figures 6.3, 6.4, 6.5 and 6.6, show that our partial sidecore implementation achieves

great improvement over the non-sidecore versions of Eh. Vh achieves throughput that is

only 1.25–2.7x higher than Eh with a partial sidecore when packet dropping is enabled,

and sidecore-emulated Eh achieves better throughput than Vh for most message sizes when packet dropping is disabled. Our partial sidecore emulation eliminates only the exits due to accesses to the TDT register. There are 6 other registers, for a total 7 exits each time the interrupt handler

in Eg is executed, slowing down the performance of Eh. If we can sidecore-emulate the register pages containing these 6 registers as well, we expect to see an even larger boost in performance. We address the known challenges on the way to sidecore emulating all

the registers on the data path of Eh in section 8.1.

71

Technion - Computer Science Department - M.Sc. Thesis MSC-2017-19 - 2017 throughput [Gbps] interrupts [K/sec] packet size [10KB] packets/batch 14 25 7 12 6 256 20 128 10 5 64 8 15 4 32 6 3 16

regular 10 8 4 2 5 4 2 1 2 0 0 0 1 64 1K 4K 64 1K 4K 64 1K 4K 64 1K 4K 256 16K 64K 256 16K64K 256 16K 64K 256 16K64K

3 140 6 4.00 120 5 2.00 2 100 4 1.00 80 3 0.50 60 1 40 2 0.25 normalized 20 1 0.13 0 0 0 0.06 64 1K 4K 64 1K 4K 64 1K 4K 64 1K 4K 256 16K 64K 256 16K64K 256 16K 64K 256 16K64K best virtio-net message size [KB] best e1000 sidecore Figure 6.5: Throughput difference between the partial sidecore-emulated best version of Eh and the best version of Vh when running our dual core basic throughput benchmark

72

Technion - Computer Science Department - M.Sc. Thesis MSC-2017-19 - 2017 throughput [Gbps] interrupts [K/sec] packet size [10KB] packets/batch 14 30 6 450 400 12 25 5 350 10 20 4 300 8 250 15 3 6 200 regular 4 10 2 150 100 5 1 2 50 0 0 0 0 64 1K 4K 64 1K 4K 64 1K 4K 64 1K 4K 256 16K 64K 256 16K64K 256 16K 64K 256 16K64K

2 2 2 3

2 1 1 1 1 normalized

0 0 0 0 64 1K 4K 64 1K 4K 64 1K 4K 64 1K 4K 256 16K 64K 256 16K 64K 256 16K 64K 256 16K 64K sidecore no drops message size [KB] virtio-net no drops Figure 6.6: Throughput difference between the partial best sidecore-emulated version of Eh and the best version of Vh when running our dual core basic throughput benchmark without packet dropping

73

Technion - Computer Science Department - M.Sc. Thesis MSC-2017-19 - 2017 74

Technion - Computer Science Department - M.Sc. Thesis MSC-2017-19 - 2017 Chapter 7

Related Work

Rizzo et al. [RLM13] also present performance optimizations to Eh. However, their focus is not data throughput, but rather the throughput of packets for high packet rate switching. They turn off TSO and do not examine the different interactions between

Eh and the TCP stack in the guest, which we show to significantly affect throughput. Our work extends their work to high throughput TCP workloads. Their 3 proposed performance optimizations are: (1) Interrupt moderation, in which they were the first to

implement the interrupt coalescing registers ITR and TADV in Eh. We show that the

implementation of interrupt coalescing of Eh is suboptimal, since it effectively ignores the value of ITR. We improve the interrupt coalescing implementation, thereby lowering the interrupt rates for throughput scenarios and achieving a significant throughput

increase. (2) Send combining, whereby Eg is modified to advance TDT only after a batch of packets has been added to the TX ring. This has the double effect of both reducing the exits caused by accesses to TDT and improving locality by batching packets. This solution is not applicable in our work, since we assume an unmodified guest. Instead, we propose sending from the I/O thread, which has a similar effect to batching in

Eh without modifying the guest, and we show that the TDT exits can be completely

eliminated by adding a sidecore. (3) Paravirtual extension of Eh, which reduces exits further. This is also not applicable under the assumption of an unmodified guest. The sidecore technique was first introduced by Kumar et al. [KRSG07], where it was used to accelerate network interrupt management of a self-virtualizing NIC and shadow page table management of paravirtual guest machines. Since then it has been successfully used to improve the performance of an emulated IOMMU [ABYTS11] and paravirtual network devices [LA09, HGL+13, KMN+16]. In SplitX [LBYG11], the authors propose a hypervisor execution model that uses a sidecore combined with proposed but not yet implemented hardware extensions that will handle guest exits on a dedicated core in future processors. To our knowledge, we are the first to attempt the sidecore technique to improve the performance of an emulated I/O device.

75

Technion - Computer Science Department - M.Sc. Thesis MSC-2017-19 - 2017 76

Technion - Computer Science Department - M.Sc. Thesis MSC-2017-19 - 2017 Chapter 8

Future work

While the research presented in this thesis shows promising initial improvements in the performance of emulated virtual I/O devices, it is only the beginning. Many issues need to be further researched before emulated I/O devices can compete with paravirtual ones. We looked only at the very specific scenario of throughput micro-benchmarks, using a TCP connection when the guest machine is the sender, it is assigned a single core, and the host is the receiver. Additional scenarios to be investigated include: (1)Latency scenarios, where the overhead of exits is more significant. (2) UDP and other protocols must also be explored. For example, we’ve seen that ACK packets play a role in the performance of virtual devices in throughput scenarios, but UDP does not have ACK packets. (3) Scenarios where the guest is the receiver of networking traffic. (4) Scenarios where the guest is connected to another guest in the same machine, or a remote machine. (5) Assigning more than 1 VCPU to the guest. (6) Running macro-benchmarks to see how real-world applications will work. Finding a throughput/latency deciding heuristic to replace our currently throughput- only interrupt coalescing timer setting from section 5.4. Understanding why we were

unable to do without the interrupt coalescing timer from Eh might also help eliminate it altogether, improving both throughput and latency. In section 5.7 we showed that the Linux TCP stack does not fully utilize TCP segment aggregation to achieve maximum throughput in our setup. Improving the segment aggregation algorithm in the Linux kernel may be beneficial for throughput workloads in setups similar to ours. Another research direction is to eliminate the qemu golbal mutex to reduce the contention effects shown in section 6.2. While this may benefit paravirtual I/O devices as well, we believe it will be more beneficial to emulated ones due the higher mutex contention caused by emulated devices, closing the gap between the two types of devices. Further improvement in the performance of emulated I/O devices can be achieved by implementing the devices as a kernel module, avoiding exits to user space, as was done in vhost-net [Tsi09], which is an implementation of the virtio-net network device, in a kernel module.

77

Technion - Computer Science Department - M.Sc. Thesis MSC-2017-19 - 2017 In Chapter 6 we showed a partial implementation of Eh using a sidecore and the benefits it can bring. The next step would be to expand the use of the sidecore to

emulate as many control registers of Eh as possible. We present the challenges we faced when trying to do so in section 8.1. Using a single sidecore to emulate a single NIC is wasteful. Previous work [HGL+13, KMN+16] has already shown that a single sidecore can be used to handle multiple paravirtual I/O devices at once. It makes sense to explore the use of a single sidecore to emulate multiple emulated I/O devices in the same manner.

8.1 Challenges on the Way to Full Sidecore Emulation of E1000

Our partial sidecore emulation removes only the exits due to accesses to registers in

page 2, of which TDT is the only register actually accessed by Eg during I/O processing. Emulating page 2 on a sidecore is fairly straightforward since the semantics of all registers in page 2 is simple. Page 0, on the other hand, contains registers with problematic semantics, which might render it non-sidecore-emulatable. In this subsection we present the page 0 registers, with problematic semantics. For each of these registers we present the problems arising from their semantics, as well as suggestions for possible solutions, in an effort to make page 0 sidecore emulatable. We note that these proposed solutions have not been fully explored, and are presented here as points for further research.

8.1.1 ICR

Eb’s specifications [Int09] state that, with regard to ICR, ”All register bits are cleared upon read”. This semantics cannot be satisfied when using a sidecore, simply because the sidecore can detect changes in registers when new values are written into them, whereas here it must detect them when the register is read. Since there is no indication in the value of ICR that it was read by the guest, there is no way for the sidecore to know when to clear it.

We might circumvent this problem by assuming that Eg reads ICR only once upon handling the interrupt. This assumption is true and reasonable for Linux guests, since

ICR is used to inform Eg of the reason for the interrupt. Since ICR is expected to clear upon read, there is no reason to read it again until the next interrupt is raised.

Assuming the above, Eh can hold an ICR shadow variable, which will be filled instead of ICR, and will be copied into ICR and cleared atomically right before raising an interrupt, so that the state of ICR is correct when raising the interrupt, and ICR shadow is ready to collect interrupt reasons for the next interrupt. In this case ICR is actually never cleared, but the correct value is in ICR when it is read in the interrupt handler of the guest. To make the above solution work, there is a problem to solve. In the current

78

Technion - Computer Science Department - M.Sc. Thesis MSC-2017-19 - 2017 implementation of Eh, the interrupt injection is implemented as a ”level” interrupt.

When the IRQ level is raised by Eh to inject an interrupt, the injected interrupt causes

the interrupt handler of Eg to run. The handler reads ICR to get the interrupt reason.

Reading ICR causes an exit, and when Eh handles this exit it both clears ICR and lowers the IRQ level. This implementation ensures that the IRQ level is down when

execution of the interrupt handler in Eg is over. But if the ICR read doesn’t cause an

exit, as happens when using a sidecore, then Eh has no indication to lower the IRQ level. After the interrupt handler terminates, the Linux kernel writes to the 2 LAPIC MSRs. These writes cause exits to KVM, which check the IRQ level before entering the guest again, and since the IRQ level is still up, KVM immediately injects another interrupt. This causes an infinite loop of interrupts in the guest.

A solution to this problem might be to change the type of interrupt used by Eh to edge triggered. With edge triggered interrupts, there is no need to lower the interrupt level explicitly, avoiding the infinite loop.

8.1.2 IMC and IMS

IMC and IMS are both used to set the mask of allowed interrupts in IMS. Eb sets bits

in IMS directly, and Ed clears bits in IMS by setting them in IMC. Eb’s specifications requires that setting a bit in IMC should clear this bit atomically in IMS. This semantics is impossible to ensure using a sidecore, since it will take some time for the sidecore thread to notice the change in IMC. There might be a solution to this problem if we are allowed to deviate from the specifications without affecting any reasonable driver. A reasonable driver is a driver

in which the events concerning IMC and IMS take place in the following order: 1. Eh

raises an interrupt 2. Eg starts handling the interrupt, immediately disabling interrupts

by clearing IMC. 3. Eg finishes handling the interrupt, after which it enables interrupts again by setting IMS. This order is reasonable since an interrupt cannot be raised after interrupts are disabled, and they will naturally be disabled after an interrupt is raised, in order to prevent further interrupts from interfering in the handling of the current one.

If the above assumption is allowed, we can use IMC and IMS in Eh in the way shown in Algorithm 8.1. This algorithm would run as part of the interrupt raising method of

Eh and determine whether raising an interrupt is currently allowed, according to the values of IMC and IMS. Figure 8.1 shows a graphical representation of Algorithm 8.1 as a state machine. The state machine starts in state A. At this point interrupts are enabled and the algorithm

allows Eh to raise interrupts. When an interrupt is raised, the interrupt handler in Eg is called, and the handler disables further interrupts by setting IMC, which switches

the machine to state B. At state B, if Eh tries to raise an interrupt it should fail, since

IMC was set, and indeed the algorithm won’t allow it. After Eg finishes handling the interrupt, it enables interrupts again by setting IMS, which switches the machine to

79

Technion - Computer Science Department - M.Sc. Thesis MSC-2017-19 - 2017 Algorithm 8.1 Interrupt raising algorithm for sidcore emulating IMC and IMS 1: if IMC==0 and IMS==0 then 2: raise interrupt 3: else if IMC!=0 and IMS!=0 then 4: IMC=IMS=0 5: raise interrupt 6: else 7: don’t raise an interrupt 8: end if

h 1. e1000 raises an interrupt

start A B g IMC=0 2. e1000 disables interrupts IMC=1 IMS=0 IMS=0

C IMC=1 IMS=1

Figure 8.1: State machine of the interrupt raising algorithm for sidecore emulating IMC and IMS

state C. At state C, if Eh tries to raise an interrupt it should succeed, since IMS was set, and indeed the algorithm will allow it, moving the machine to state A before raising the interrupt to start all over again.

Known Problems with Algorithm 8.1

Algorithm 8.1 is not a full solution but only an initial idea for future work. Several issues need to be addressed. First, while in state A, before moving to state B, more than one interrupt can theoretically be injected into the guest. This can happen if the second interrupt is injected before the handler called for the first one sets IMC, for example, if the VCPU thread is not scheduled while the interrupts are injected. Second, this algorithm handles only the standard case. However, at any given point

the guest OS might deactivate the network interface that uses Eg. At this point IMC is set to disable interrupts. This might happen when in state C, which will be missed by the sidecore since IMC already equals 1 in stage C. In this situation the state machine

will allow an interrupt to be raised by Eh, when interrupts should be disabled.

80

Technion - Computer Science Department - M.Sc. Thesis MSC-2017-19 - 2017 Chapter 9

Conclusion

Emulated I/O devices are considered to be inferior to paravirtual I/O devices due to their poorer performance, which is commonly attributed in the literature to their large number of virtualization exits. Our comparison of QEMU’s emulated e1000 NIC to the paravirtual virtio-net NIC shed light on this misconception. We found several key implementation differences between the e1000 and virtio-net, unrelated to virtualization, and showed that it was these, and not the exits, that caused the low throughput in e1000. Adding numerous improvements to the implementation of e1000, most of which were inspired by the implementation of virtio-net, we were able to greatly reduce the throughput difference between e1000 and virtio-net. Our improvements to e1000 closed the throughput gap between e1000 and virtio-net. Whereas virtio-net achieved 20–77x better throughput than e1000 without our improvements, it achieved only 1.2–2.2x better throughout than our improved version of e1000 for single core throughput scenarios. For dual core throughput scenarios, virtio-net’s advantage was reduced from 25–173x better throughput over unimproved e1000 to 1.25–2.7x better throughput over our improved e1000 version, when using a partial sidecore implementation. This relatively small difference in throughput between e1000 and virtio-net shows that the conception that emulated I/O devices are significantly inferior to paravirtual I/O devices in throughput scenarios is a misconception, and that exits do not play such a large role in the throughput difference between emulated and paravirtual I/O devices.

81

Technion - Computer Science Department - M.Sc. Thesis MSC-2017-19 - 2017 82

Technion - Computer Science Department - M.Sc. Thesis MSC-2017-19 - 2017 Bibliography

[ABYTS11] Nadav Amit, Muli Ben-Yehuda, Dan Tsafrir, and Assaf Schuster. vIOMMU: Efficient IOMMU Emulation. In USENIX Annual Technical Conference (ATC), 2011.

[BDF+03] Paul Barham, Boris Dragovic, Keir Fraser, Steven Hand, Tim Harris, Alex Ho, Rolf Neugebauer, Ian Pratt, and Warfield Andrew. Xen and the art of virtualization. In Symposium on Operating Systems Principles (SOSP), 2003.

[BNT17] Eduard Bugnion, Jason Nieh, and Dan Tsafrir. Hardware and Software Support for Virtualization. Morgan & Claypool Publishers, 2017.

[CFH+05] Christopher Clark, Keir Fraser, Steven Hand, Jacob Gorm Hansen, Eric Jul, Christian Limpach, Ian Pratt, and Andrew Warfield. Live migration of virtual machines. In Proceedings of the Conference on Symposium on Networked Systems Design & Implementation, 2005.

[ESP+09] Hideki Eiraku, Yasushi Shinjo, Calton Pu, Younggyun Koh, and Kazuhiko Kato. Fast networking with socket-outsourcing in hosted virtual machine en- vironments. In Proceedings of the ACM Symposium on Applied Computing, 2009.

[GKR05] Dror Goldenberg, Michael Kagan, and Ran Ravid. Transparently achieving superior socket performance using zero copy socket direct protocol over 20gb/s infiniband links. In IEEE International Conference on Cluster Computing, 2005.

[GOB01] Karim Ghouas, Knut Omnag, and Hakon Bugge. Via over sci - consequences of a zero copy implementation, and comparison with via over myrinet. In IEEE International Parallel and Distributed Processing Symposium, 2001.

[Gol74] Robert P. Goldberg. Survey of virtual machine research. In IEEE Computer, 1974.

[HGL+13] Nadav Har’El, Abel Gordon, Alexander Landau, Muli Ben-Yehuda, Avishay Traeger, and Ladalsky Razya. Efficient and Scalable Paravirtual I/O System. In USENIX Annual Technical Conference (ATC), 2013.

83

Technion - Computer Science Department - M.Sc. Thesis MSC-2017-19 - 2017 [Int09] Intel. PCI/PCI-X Family of Gigabit Ethernet Controllers Software Devel- oper’s Manual. Intel, 2009.

[KMN+16] Yossi Kuperman, Eyal Moscovici, Joel Nider, Razya Ladelsky, Abel Gordon, and Dan Tsafrir. Paravirtual remote i/o. In ACM International Confer- ence on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2016.

[KPS+09] Younggyun Koh, Calton Pu, Yasushi Shinjo, Hideki Eiraki, Go Saito, and Daiyuu Nobori. Improving Virtualized Windows Network Performance by Delegating Network Processing . In IEEE International Symposium on Network Computing and Applications, 2009.

[KRS07] Sanjay Kumar, Himanshu Raj, and Karsten Schwan. Re-architecting VMMs for Multicore Systems: The Sidecore Approach. In Workshop on Interaction between Operating Systems & Computer Architecture (WIOSCA), 2007.

[KRSG07] Sanjay Kumar, Himanshu Raj, Karsten Schwan, and Ivan Ganev. Re- architecting VMMs for Multicore Systems: The Sidecore Approach. In Proc. of the 2007 Workshop on the Interaction between Operating Systems and Computer Architecture, 2007.

[LA09] Jiuxing Liu and Bulent Abali. Virtualization polling engine (vpe): Using dedicated cpu cores to accelerate i/o virtualization. In Proceedings of the 23rd International Conference on Supercomputing, 2009.

[LBYG11] Alexander Landau, Muli Ben-Yehuda, and Abel Gordon. SplitX: Split guest/hypervisor execution on multi-core. In Workshop on I/O Virtualiza- tion, 2011.

[LH02] Rick Lindsley and Dave Hansen. BKL: One Lock to Bind Them All. In Ottawa Linux Symposium, 2002.

[LSM08] D. J. Leith, R.N. Shorten, and G. McCullagh. Experimental evaluation of Cubic-TCP. In The 6th International Workshop on Protocols for Fast Long-Distance Networks (PFLDnet 2008), 2008.

[MAC+08] Dutch T. Meyer, Gitika Aggarwal, Brendan Cully, Geoffrey Lefebvre, Michael J. Feeley, Norman C. Hutchinson, and Andrew Warfield. Par- allax: Virtual disks for virtual machines. In Proceedings of the ACM SIGOPS/EuroSys European Conference on Computer Systems, 2008.

[Mol07] Ingo Molnar. Kvm/net, paravirtual network de- vice. http://kvm.vger.kernel.narkive.com/hNALI5yI/ announce-kvm-net-paravirtual-network-device, January 2007.

84

Technion - Computer Science Department - M.Sc. Thesis MSC-2017-19 - 2017 [NLH05] Michael Nelson, Beng-Hong Lim, and Greg Hutchins. Fast transparent migration for virtual machines. In Proceedings of the Annual Conference on USENIX Annual Technical Conference, 2005.

[PACM11] Vern Paxson, Mark Allman, H.K. Jerry Chu, and Sargent Matt. Computing TCP’s Retransmission Timer. RFC 6298, RFC Editor, June 2011.

[PG74] Gerald J. Popek and Robert P. Goldberg. Formal requirements for vir- tualizable third generation architectures. Communications of the ACM (CACM), 1974.

[Poe09] Lennart Poettering. Measuring lock contention. http://0pointer.de/ blog/projects/mutrace.html, 2009.

[PS12] Darko Petrovic and Andre Schiper. Implementing virtual machine replica- tion: A case study using xen and kvm. In IEEE International Conference on Advanced Information Networking and Applications, 2012.

[QEM17] QEMU. Qemu networking documentation. http://wiki.qemu.org/ Documentation/Networking, 2017.

[RLM13] Luigi Rizzo, Giuseppe Lettieri, and Vincenzo Maffione. Speeding up packet I/O in virtual machines. In ACM/IEEE Symposium on Architectures for Networking and Communications Systems (ANCS), 2013.

[Rus08] Rusty Russell. virtio: towards a de-facto standard for virtual i/o devices. In ACM SIGOPS Operating Systems Review (OSR), 2008.

[RW11] Mendel Rosenblum and Carl Waldspurger. I/O Virtualization. 9(11):30–39, Nov 2011.

[Tsi09] Michael S. Tsirkin. vhost net: a kernel-level virtio server. https://lwn. net/Articles/346267/, 2009.

[VMS+16] Sander Vrijders, Vincenzo Maffione, Dimitri Staessens, Francesco Salvestrini, Matteo Biancani, Eduard Grasa, Didier Colle, Mario Pickavet, Jason Barron, John Day, and Lou Chitkushev. Reducing the complexity of virtual machine networking. IEEE Communications Magazine, 2016.

[Vmw09] Vmware. Performance evaluation of vmxnet3 virtual network device. http: //www.vmware.com/pdf/vsp_4_vmxnet3_perf.pdf, 2009.

85

Technion - Computer Science Department - M.Sc. Thesis MSC-2017-19 - 2017 ההבדל האמיתי בין וירטואליזציה מודעת ללא מודעת של התקני קלט/פלט מרובי הספק

ארתור קייאנובסקי

Technion - Computer Science Department - M.Sc. Thesis MSC-2017-19 - 2017

Technion - Computer Science Department - M.Sc. Thesis MSC-2017-19 - 2017 ההבדל האמיתי בין וירטואליזציה מודעת ללא מודעת של התקני קלט/פלט מרובי הספק

חיבור על מחקר

לשם מילוי חלקי של הדרישות לקבלת התואר מגיסטר למדעים במדעי המחשב

ארתור קייאנובסקי

הוגש לסנט הטכניון – מכון טכנולוגי לישראל אב התשע״ז חיפה אוגוסט 2017

Technion - Computer Science Department - M.Sc. Thesis MSC-2017-19 - 2017

Technion - Computer Science Department - M.Sc. Thesis MSC-2017-19 - 2017 המחקר בוצע בהנחייתו של פרופ. דן צפריר, בפקולטה למדעי המחשב בטכניון.

תודות

עבודה זו מוקדשת לסבא שלי, בן-ציון קייאנובסקי ז"ל , שהלך לעולמו בזמן ביצוע המחקר המוצג כאן. סבא שלי נלחם בגבורה נגד הנאצים במלחמת העולם השנייה. אילולא אנשים כמוהו לא היינו נמצאים כאן היום. אני מודה לאישתי היקרה אסיה, על התמיכה האינסופית. בלעדיה לא הייתי מצליח לסיים את המחקר הזה. אני מודה למנחה שלי, פרופ' דן צפריר, על העזרה וההכוונה לאורך הדרך. אני מודה לקרן הלאומית למדע )מענק מס. 605/12( על התמיכה הכספית הנדיבה בהשתלמותי.

אני מודה לטכניון על התמיכה הנדיבה במשך השתלמותי.

Technion - Computer Science Department - M.Sc. Thesis MSC-2017-19 - 2017

Technion - Computer Science Department - M.Sc. Thesis MSC-2017-19 - 2017

תקציר

וירטואליזציה הופכת בשנים האחרונות ליותר ויותר פופולרית, שכן יותר ויותר מחשוב נעשה במודל מחשוב הענן. מכונות ווירטואליות משתמשות בהתקני קלט/פלט )ק/פ בקיצור( וירטואליים לביצוע פעולות הק/פ שלהן. התקני הק/פ הנפוצים ביותר כיום הם התקני ק/פ מודעים או פרא-וירטואליים, אשר מתוכננים במיוחד לסביבות וירטואליות, ולא מממשים ממשק של התקן פיזי כלשהו. התקני ק/פ פרא-וירטואליים משלבים ביצועים טובים עם יכולת תיווך בין המכונה הווירטואלית לחומרה הפיזית. יכולת תיווך זו נחוצה בשימושים רבים כמו הגירה חיה )live migration(, קונסולידציה ואגרגציה של התקני ק/פ, בין השאר. למרות זאת, להתקני ק/פ פרא-וירטואליים יש חסרונות, גם בעבור המשתמש וגם בעבור מפתחי ההיפרויזור )hypervisor(. המשתמש נדרש להתקין מנהל התקן ייעודי, בעבור ההתקן הפרא-וירטואלי, בכל מעבר בין היפרויזורים, מכיוון שכיום רוב ההיפרויזורים תומכים בהתקנים פרא-וירטואליים שונים האחד מהשני. ומפתחי ההיפרויזורים נדרשים לממש ולתמוך במנהלי התקנים לכל מערכות ההפעלה הפופולאריות. החסרונות המתוארים אינם קיימים בהתקני ק/פ שאינם מודעים או מחקים )emulated(, אשר מממשים ממשק זהה לממשק של התקן ק/פ פיזי כלשהו. כמו התקנים פרא-וירטואליים, התקנים מחקים מספקים יכולת תיווך בין המכונה הוירטואלית לחומרה הפיזית, אבל מכיוון שהתקנים אלה מממשים ממשק של התקן פיזי קיים, המשתמש לא נדרש להתקין מנהל התקן חדש בכל מעבר בין היפרויזורים, שכן מנהלי ההתקן המותקנים במערכת ההפעלה בעבור ההתקן הפיזי יעבדו גם עם ההתקן הוירטואלי המחקה. מסיבה זו, מפתחי ההיפרויזור אינם נדרשים לפתח מנהלי התקן ייחודיים בעבור כל מערכות ההפעלה במכונות הוירטואליות. למרות היתרונות הנ"ל של התקני ק/פ מחקים, משתמשים בהם לעיתים רחוקות בהתקנות, מכיוון שהביצועים שהם מספקים נמוכים משמעותית מאלה שמספקים התקני ק/פ פרא-וירטואליים.

א

Technion - Computer Science Department - M.Sc. Thesis MSC-2017-19 - 2017 התפישה הרווחת היא שהסיבה האמיתית להבדל בביצועים בין התקנים מחקים ופרא-וירטואליים היא המספר הגדול של יציאות וירטואליזציה שנגרמות בעת ריצת התקן מחקה, לעומת מספר היציאות הקטן שנגרמות בעת ריצת התקן פרא-וירטואלי. כדי לבדוק תפיסה רווחת זו בעבור עומסים מרובי הספק, אנו מציגים מודל שמנסה להעריך את ההספק המקסימלי שניתן להפיק מהתקן הרשת המחקה e1000 בהשוואה להתקן הפרא-וירטואלי virtio-net, תחת ההנחה שהיציאות הן ההבדל היחיד בין שני ההתקנים. השתמשנו בהתקני תקשורת, מכיוון שמכל סוגי התקני הק/פ, התקני תקשורת מתמודדים עם ההספקים הגבוהים ביותר, ורצינו לראות את התוצאות הקיצוניות ביותר האפשריות. המודל שלנו מנבא שvirtio-net מגיע להספק הגדול פי 1.13 מזה של e1000, אולם המדידות שלנו מראות שvirtio-net מגיע להספק הגדול פי 20 מזה של e1000. תוצאות אלה מרמזות שבניגוד להנחה הרווחת בספרות, כמות היציאות אינה הסיבה להבדל הגדול בין התקני ק/פ מחקים להתקני ק/פ פרא-וירטואליים. המשכנו לחקור את ההבדלים בין e1000 לvirtio-net, בניסיון לגלות את הגורמים להפרש הגדול בהספק שמפיקים שני ההתקנים. במחקר שלנו בחרנו קונפיגורציה של מכונה וירטואלית שמתקשרת ישירות עם המכונה הפיזית עליה היא רצה, תוך הקצאת מעבד יחיד למכונה הוירטואלית. קונפיגורציה זו אינה כוללת התקן תקשורת פיזי, וכל נתוני התקשורת מועברים בין המכונה הוירטואלית לפיזית דרך הגרעין של המכונה הפיזית. בחרנו קונפיגורציה זו, מכיוון שזו הקונפיגורציה הפשוטה ביותר שיכולנו לחשוב עליה, על מנת להמעיט בסיבוכים הנובעים מהקונפיגורציה, כדי שנוכל להתרכז בהבדלים האמתיים בין ההתקנים הוירטואליים. אנו מציגים את ההבדלים בין שני ההתקנים, שאינם קשורים לוירטואליזציה, שמצאנו, הפוגעים בביצועי e1000. עבור כל הבדל שכזה אנו מציגים שיפור לe1000 שמבטל, ככל שהצלחנו, את הפגיעה בביצועי e1000. השיפורים שהצענו מגדילים משמעותית את ההספק של e1000. הצלחנו להקטין את הפרש ההספק בין virtio-net לe1000 מפי 20 לפי 1.2 בעבור מסרים גדולים. תוצאות אלה מראות שe1000 יכול להפיק הספק שהוא קרוב הרבה יותר לזה של virtio-net משהתפישה הרווחת מנבאת. באופן יותר כללי התוצאות מצביעות על כך שהתפישה הרווחת שהתקנים פרא-וירטואליים מניבים הספק גבוה משמעותית מהתקנים מחקים אינה נכונה, כמו גם התפישה הרווחת שריבוי היציאות בהתקנים מחקים היא הסיבה לביצועים הנחותים משמעותית מאלה של התקנים פרא- וירטואליים במצבים של הספק גבוה. בהמשך העבודה הרחבנו את הקונפיגורציה על ידי הקצאת 2 מעבדים להיפרויזור לצורך הרצת המכונה הוירטואלית. הרחבה זו אפשרה לנו שימוש בגישת מעבד צד. גישת מעבד צד שהוכחה בעבר כאפקטיבית בהקטנת יציאות וירטואליזציה עבור התקנים פרא-וירטואליים, מעולם לא שימשה למימוש מלא של התקן ק/פ מחקה. בעבודה זו, למרות שביטלנו רק חלק מהיציאות של התקן e1000 באמצעות מעבד צד, קיבלנו שיפור הספק משמעותי. המימוש החלקי שלנו למעבד צד, בתוספת לשיפורים האחרים שמצאנו בקונפיגורציה של מעבד יחיד, הקטינו את הפרש ההספק בין -virtio net לe1000 מפי 25 לפי 1.25 בעבור מסרים גדולים. תוצאות אלה שוב הראו שניתן באמצעות התקנים מחקים להגיע להספקים קרובים לאלה של התקנים פרא-וירטואליים, בניגוד לתפישה הרווחת.

ב

Technion - Computer Science Department - M.Sc. Thesis MSC-2017-19 - 2017