The Real Difference Between Emulation and Paravirtualization Of

The Real Difference Between Emulation and Paravirtualization of High-Throughput I/O Devices Arthur Kiyanovski Technion - Computer Science Department - M.Sc. Thesis MSC-2017-19 - 2017 Technion - Computer Science Department - M.Sc. Thesis MSC-2017-19 - 2017 The Real Difference Between Emulation and Paravirtualization of High-Throughput I/O Devices Research Thesis Submitted in partial fulfillment of the requirements for the degree of Master of Science in Computer Science Arthur Kiyanovski Submitted to the Senate of the Technion | Israel Institute of Technology Av 5777 Haifa August 2017 Technion - Computer Science Department - M.Sc. Thesis MSC-2017-19 - 2017 Technion - Computer Science Department - M.Sc. Thesis MSC-2017-19 - 2017 This research was carried out under the supervision of Prof. Dan Tsafrir, in the Faculty of Computer Science. Acknowledgements I would like to dedicate this thesis to my late grandfather, Ben-Zion Kiyanovski, who passed away while I was doing the research for this thesis. My grandfather fought courageously against the Nazis in World War II. Without people like him, none of us were here today. I would like to thank my dear wife Assya, for her infinite support, without it I wouldn't have been able to finish this research. I would like to thank my advisor, Prof. Dan Tsafrir, for his help and guidance along the way. The research leading to the results presented in this paper was partially supported by the Israel Science Foundation (grant No. 605/12). The generous financial help of the Technion is gratefully acknowledged. Technion - Computer Science Department - M.Sc. Thesis MSC-2017-19 - 2017 Technion - Computer Science Department - M.Sc. Thesis MSC-2017-19 - 2017 Contents List of Figures Abstract 1 Abbreviations and Notations 3 1 Introduction 5 2 Background 9 2.1 TCP Essentials . 9 2.1.1 TCP Checksum Offloading . 9 2.1.2 TCP Segmentation Offloading . 9 2.1.3 TCP Congestion Control . 10 2.1.4 TCP SRTT . 11 2.2 Linux Network Stack Implementation Essentials . 13 2.2.1 The Socket Buffer . 13 2.2.2 NAPI . 13 2.3 QEMU Essentials . 13 2.3.1 Main Threads of QEMU . 14 2.3.2 The qemu global mutex . 14 2.4 The Intel Pro/1000 PCI/PCI-X NICs (Bare Metal E1000) . 14 2.4.1 Control Registers . 15 2.4.2 Main Actions During Normal Operation of the Bare Metal E1000 16 2.5 The QEMU Emulated Intel Pro/1000 PCI/PCI-X NIC (E1000) . 17 2.5.1 Interrupt Coalescing . 18 2.6 The QEMU Virtio-Net Paravirtual NIC . 19 2.6.1 Interrupt and Kick Supression . 19 2.6.2 TX Interrupts . 19 2.7 Virtio-Net TCP Send Sequences in Throughput Workloads . 20 2.7.1 Virtio-Net Dual Core Send Sequence . 20 2.7.2 Virtio-Net Single Core TCP Throughput Send Sequence . 21 Technion - Computer Science Department - M.Sc. Thesis MSC-2017-19 - 2017 3 Motivation 27 3.1 Interposition . 27 3.2 Emulated I/O Devices . 28 3.3 Paravirtual I/O Devices . 28 3.4 Emulated vs Paravirtual Devices . 28 3.4.1 Guest Modification . 28 3.4.2 Performance . 29 3.5 Emulated vs Paravirtual NICs in Different Hypervisors . 30 3.6 Emulation vs Paravirtualization Comparison Model . 30 4 Experimental Setup 35 4.1 Hardware Setup . 35 4.2 Benchmarks . 35 4.2.1 Single Core Throughput Benchmark . 35 4.2.2 Dual Core Throughput Benchmark . 36 5 Single Core Configuration 37 5.1 Baseline Comparison . 38 5.2 Removal of TCP Checksum Calculation . 38 5.3 Removal of TCP Segmentation . 39 5.4 Improved Interrupt Coalescing . 40 5.4.1 ITR and TADV Conflict . 41 5.4.2 Static Set Interrupt Rate . 42 5.4.3 Interrupt Rate Considering ITR . 43 5.4.4 Evaluation . 44 5.5 Send from the I/O Thread . 44 5.5.1 Interrupt Coalescing in Virtio-Net . 47 5.6 Exposing PCI-X to Avoid Bounce Buffers . 47 5.7 Dropping Packets to Improve TSO Batching in Linux Guests . 48 5.8 Vectorized Send . 53 5.9 SRTT Calculation Algorithm Bug in Linux . 54 5.9.1 SRTT Calculation in the Linux Kernel . 55 5.9.2 Bug Description . 56 5.9.3 Effects of the Bug . 56 5.9.4 Bug Fix . 57 5.10 Final Throughput Comparison . 58 5.11 Improvements Summary . 60 6 Initial Work on a Dual Core Configuration 65 6.1 Baseline Comparison . 65 6.2 Scalability of the Emulated E1000 . 66 6.3 Sidecore . 68 Technion - Computer Science Department - M.Sc. Thesis MSC-2017-19 - 2017 6.3.1 Partial Sidecore Implementation . 69 7 Related Work 75 8 Future work 77 8.1 Challenges on the Way to Full Sidecore Emulation of E1000 . 78 8.1.1 ICR . 78 8.1.2 IMC and IMS . 79 9 Conclusion 81 Technion - Computer Science Department - M.Sc. Thesis MSC-2017-19 - 2017 Technion - Computer Science Department - M.Sc. Thesis MSC-2017-19 - 2017 List of Figures 2.1 CWND values over time, for two TCP connections with the same source and destination, one starting transmission at t=0, the other at t=100[sec], and both using the Cubic congestion avoidance algorithm . 12 2.2 Baseline Eh register exits . 17 2.3 Eh emulation of ICR reading. Some implementation details have been removed . 18 2.4 Vh dual core setup, single batch send sequence . 22 2.5 Vh single core setup, single batch send sequence . 24 2.6 Baseline Vh exits . 25 3.1 Google search results illustrating the problems with vmware tools . 29 3.2 Throughput comparison of Eb emulation vs a paravirtual NIC in different hypervisors. In QEMU/KVM and Virtual Box the paravirtual device is virtio-net, and in Vmware Workstation it is vmxnet3 . 31 5.1 Throughput comparison between baseline Vh and baseline Eh . 39 5.2 Change in throughput caused by removing the calculation of TCP checksum in Eh ................................... 40 5.3 Change in throughput caused by removing the TCP segmentation code in Eh ...................................... 41 5.4 Throughput difference achieved when using the 2 types of interrupt coalescing heuristics described in subsections 5.4.2 (static), 5.4.3 (ITR sensitive) . 44 5.5 Change in throughput caused by using the improved static interrupt coalescing setting in Eh ........................... 45 5.6 Change in throughput caused by moving the sending of packets from the VCPU thread to the I/O thread in Eh ................... 46 5.7 Change in throughput caused by enabling PCI-X mode in Eh . 49 5.8 Vh compared to Eh with improvements 1-5 . 50 5.9 Throughput and CWND values over time, with Eh, running netperf TCP STREAM with default 16KB message size. At around 25 seconds packet dropping is enabled. 52 Technion - Computer Science Department - M.Sc. Thesis MSC-2017-19 - 2017 5.10 Change in throughput caused by adding the packet dropping for better TSO batching in Eh ............................. 53 5.11 Change in throughput caused by adding the packet dropping for better TSO batching in Vh ............................. 54 5.12 Change in throughput caused by using vectorized sending in Eh . 55 5.13 The routine in Linux kernel 3.13 that calculates the new SRTT given the RTT of the currently ACKed packet and the previous SRTT (irrelevant code omitted) . 56 5.14 The code of tcp rtt estimator() after fixing the bug in SRTT calculation (irrelevant code omitted) . 57 5.15 tp->srtt values as they react to RTT values over time, in both the original implementation of tcp rtt estimator() on the left and the fixed version on the right . 58 5.16 Difference in throughput of Eh with all improvements but packet dropping, with the SRTT bug and after fixing it . 58 5.17 Difference in throughput of Vh with the SRTT bug and after fixing it . 59 5.18 Throughput comparison of the best versions of Vh and Eh . 59 5.19 Throughput increase achieved by adding our improvements to Eh . 62 5.20 Throughput increase caused by each of the improvements added to Eh out of the highest achieved throughput . 63 6.1 Throughput comparison between baseline Vh and baseline Eh when running our dual core basic throughput benchmark . 66 6.2 Throughput difference between the best version of Vh when run on a single core vs when run on 2 cores . 67 6.3 Throughput difference between the best version of Eh when run on a single core vs when run on 2 cores . 68 6.4 Throughput difference between the best single core version of Eh and the best partial sidecore-emulated version of Eh . 71 6.5 Throughput difference between the partial sidecore-emulated best version of Eh and the best version of Vh when running our dual core basic throughput benchmark . 72 6.6 Throughput difference between the partial best sidecore-emulated version of Eh and the best version of Vh when running our dual core basic throughput benchmark without packet dropping . 73 8.1 State machine of the interrupt raising algorithm for sidecore emulating IMC and IMS . 80 Technion - Computer Science Department - M.Sc. Thesis MSC-2017-19 - 2017 Abstract Emulation of high-throughput Input/Output (I/O) devices for virtual machines (VM) is appealing because an emulated I/O device works out of the box without the need to install a new device driver in the VM when moving the VM from one hypervisor to another. The problem is that fully emulating a hardware device can be costly due to multiple virtualization exits. Installations therefore often prefer to use paravirtual I/O devices, which reduce the number of exits by making VMs aware that they are being virtualized at the cost of the need to install a new device driver when moving from one hypervisor to another.

Load more