Simuboost: Scalable Parallelization of Functional System Simulation

SimuBoost: Scalable Parallelization of Functional System Simulation Zur Erlangung des akademischen Grades eines Doktors der Ingenieurwissenschaften von der KIT-Fakultät für Informatik des Karlsruher Instituts für Technologie (KIT) genehmigte Dissertation von Dipl.-Inform. Marc Rittinghaus aus Iserlohn Tag der mündlichen Prüfung: 19.07.2019 Hauptreferent: Prof. Dr. Frank Bellosa Karlsruher Institut für Technologie Korreferent: Prof. Dr. Hans P. Reiser Universität Passau KIT – Universität des Landes Baden-Württemberg und nationales Forschungszentrum in der Helmholtz-Gemeinschaft www.kit.edu Abstract Gathering detailed run-time information such as memory access traces in operating system and security research often involves functional full system simulation (FFSS). The simulator runs the workload of interest in a virtual machine (VM), gradually interpreting or translating instructions so that they operate on the state of the VM and allow for comprehensive instrumentation. While functional full system simulation is a powerful tool, a severe limitation is its immense slowdown. For QEMU, we have measured average slowdowns of 30x and 60x for plain simulation and tracing of memory accesses, respectively. Simulators offering more advanced instrumentation capabilities can even be an order of magnitude slower. This quickly renders functional simulation impractical for long-running, networked, or interactive workloads. Furthermore, the slowdown creates unrealistic timing behavior whenever activities external to the virtual machine (e.g., I/O) are involved. In this thesis, we present SimuBoost, a method for drastically accelerating functional full system simulation. SimuBoost runs the workload in a fast and interactive hardware-assisted virtual machine while periodically taking checkpoints. These checkpoints then serve as starting points for simulations, enabling to simulate and analyze each interval simultaneously in one job per interval. Heterogeneous deterministic replay guarantees that the simulations repeat the exact same execution as in the hardware-assisted run, including interactions and recorded timing. Our prototype is able to significantly reduce the run time of functional full system simulation while providing full interactivity. Simulating an entire kernel build completes in just 16% more time than needed to run the same workload in a regular hardware-assisted virtual machine. SimuBoost is able to maintain this performance even with full instrumentation for memory tracing. This thesis represents the first project to apply the concept of partitioning and parallelization of execution time to interactive full system virtualization in a manner that allows for immediate parallel functional simulation. We complement iv the practical implementation with a performance model to formally describe the properties of the acceleration method and predict speedups. In contrast to previous work, SimuBoost places a strong focus on scalability beyond the limits of a single physical machine. It therefore makes heavy use of virtual machine checkpointing technology. In this course, we present two novel methods for efficiently and effectively reducing the size of periodic checkpoints. Contents 1 Introduction 1 1.1 Contributions . 4 1.2 Scope of This Thesis . 5 1.3 Underlying Publications and Theses . 6 1.4 Organization . 7 2 Background 9 2.1 Virtualization . 9 2.1.1 Virtual Machines . 11 2.1.2 Conclusion and Terms . 16 2.2 Virtualization Techniques . 17 2.2.1 Processor Virtualization . 18 2.2.2 Memory Virtualization . 30 2.2.3 I/O Virtualization . 38 2.2.4 Conclusion and Terms . 41 2.2.5 Case Study: QEMU/KVM . 42 2.3 Checkpointing . 46 2.3.1 Pre- and Post-Copy . 47 2.3.2 Data Exclusion . 49 2.3.3 Data Deduplication . 51 2.3.4 Data Compression . 52 2.3.5 Other Techniques . 53 2.3.6 Conclusion . 54 2.4 Deterministic Replay . 55 2.4.1 Homogeneous Replay . 57 2.4.2 Heterogeneous Replay . 59 2.4.3 Multiprocessor Replay . 61 2.4.4 Conclusion . 63 Contents vii 3 Functional Full System Simulation 65 3.1 Assessment of Simulation Speed . 67 3.2 Acceleration Techniques . 69 3.2.1 Optimizing the Execution Engine . 69 3.2.2 Reducing the Observation Space . 70 3.2.3 Parallelizing Multicore Simulations . 72 3.2.4 Parallelizing the Simulation Time . 73 3.3 Conclusion: Limitations of the State of the Art . 75 4 SimuBoost 77 4.1 Goals . 78 4.2 Approach . 79 4.2.1 State Deviation . 81 4.3 Comparison with Related Work . 82 4.4 Conclusion . 85 5 Performance Model 87 5.1 Optimal Setup . 88 5.1.1 Parallel Simulation Time and Speedup . 89 5.1.2 Optimal Interval Length . 91 5.1.3 Optimal Number of Nodes and Efficiency . 93 5.2 Constrained Setup . 94 5.2.1 Parallel Simulation Time and Speedup . 95 5.2.2 Optimal Interval Length . 98 5.3 Conclusion . 99 6 Continuous Checkpointing 101 6.1 Checkpoint Creation . 103 6.1.1 Incremental Checkpointing . 105 6.1.2 Dirty Logging Techniques . 114 6.1.3 Dirty Logging Granularity . 119 6.1.4 Design and Implementation . 122 6.2 Checkpoint Storage . 125 6.2.1 SimuBoost Extension for Simutrace . 126 6.3 Checkpoint Loading . 128 6.3.1 Sparse Checkpoints . 129 6.4 Conclusion . 133 7 Checkpoint Distribution 135 7.1 Pulling versus Pushing . 137 7.2 Checkpoint Data Reduction . 139 7.2.1 Data Deduplication . 144 7.2.2 Delta Compression . 148 7.2.3 Device State Compression . 151 7.3 Multicast Checkpoint Distribution . 155 7.4 Conclusion . 159 viii Contents 8 Heterogeneous Deterministic Replay 161 8.1 General Architecture . 162 8.1.1 Landmark . 164 8.1.2 Replay Boundary . 168 8.1.3 Evaluation . 170 8.2 Simulation Refining . 174 8.2.1 Status Flag Computation . 175 8.2.2 Read-Write Instructions . 176 8.2.3 MMU-induced Non-Determinism . 177 8.2.4 Atomic Instructions (ARM only) . 180 8.2.5 Miscellaneous . 181 8.3 Conclusion . 183 9 Evaluation 185 9.1 Evaluation Setup . 185 9.1.1 Hardware and Software Configuration . 189 9.1.2 Benchmark Scenarios . 191 9.2 Speedup . 192 9.3 Scalability and Efficiency . 195 9.4 Performance Model . 202 9.5 Conclusion . 204 10 Conclusion 207 10.1 Limitations and Future Work . 208 A Deutsche Zusammenfassung 211 B Additional Figures and Data 213 Lists 223 Tables . 223 Figures . 223 Bibliography . 226 Chapter 1 Introduction A common approach to gathering detailed run-time information about applications is to run the software of interest in a functional, that is instruction-level, simulator such as Valgrind [190] or Pin [162]. In contrast to regular execution, where instructions run natively on the CPU, a functional simulator executes applications in a virtual machine (VM). It gradually interprets or translates instructions so that they operate on the state of the VM instead of the physical machine. This way, the execution becomes completely transparent and can be freely instrumented. Whenever functional simulation is involved, operating-system-centric research usually depends on functional full system simulation (FFSS), which includes system libraries, services, drivers, and privileged kernel-mode components. FFSS has been demonstrated to be very effective for OS debugging [137], security analyses [287], and collecting detailed execution traces [286]. Google engineers identified over 20 kernel security vulnerabilities in Windows 8 by analyzing the memory access patterns at the system call interface [133]. In a similar analysis, we revealed a critical code execution vulnerability in the Xen hypervisor [279]. However, full system simulation is not only an important tool in operating system research. It has been shown that even for application-level studies simulating the application without the underlying operating system can be highly inaccurate [53, 135, 169, 292]. While functional full system simulation has proven to be very powerful, a well- known limitation is its immense slowdown. Depending on the required level of detail and the degree of instrumentation, running a workload with FFSS is up to multiple orders of magnitude slower compared to native execution. We have measured average slowdowns of 30x and 60x with QEMU [38] for plain simulation and tracing of memory accesses, respectively. While a kernel build takes less than 13 mins natively, running the exact same execution takes around 5 hours with QEMU; and 10 hours when tracing memory accesses. A single tracing run of povray even takes almost 2 days, whereas a non-instrumented execution 2 Introduction completes in less than 19 mins natively. With Simics [163], a full system simulator offering more advanced instrumentation capabilities, we measured a slowdown of up to 1000x when hooks for memory tracing are installed [219]. Similar slowdowns for functional simulation have been reported by other researchers [58, 135, 164, 200, 286]. In practice, this slowdown creates severe obstacles for comprehensive use of functional full system simulation: Interactivity Scenarios that should capture interactivity with a human user or an external network device are not feasible. A single keystroke can quickly take from multiple seconds up to minutes until being fully processed, making human-user interactions cumbersome and unnatural. Network protocols such as TCP,in turn, react to the slowdown with throttling and timeouts. Accuracy of Results Since the simulation considerably slows down the simulated applications and operating system, activities dependent on events external to the virtual machine such as I/O operations appear to complete faster – a phenomenon called time dilation [169]. This distorts measurements and produces unrealistic execution behavior. Coverage Evaluating a test scenario in full length can take considerable time, forcing researchers to reduce coverage. The authors of the Google study summarize that the slowdown was the primary restrictive factor, which limited the coverage of their analysis to the system boot phase.

Simuboost: Scalable Parallelization of Functional System Simulation

Using EMC VNX Storage with Vmware Vsphere Techbook CONTENTS

5 Ways Vmware Vsphere Improves Backup and Recovery

Introduction to Virtualization Virtualization

IBM Virtual Machine Facility/370 : Systems Introduction

System Administration Guide

Virtual Machine Benchmarking Kim-Thomas M¨Oller Diploma Thesis

Virtualization with Cisco UCS, Nexus 1000V, and Vmware Technology Design Guide

Understanding Abstraction and Virtualization

Block-Level Storage Virtualization

Implementation of Nested Virtual Laboratory System

Dynamic Load Balancing of Virtual Machines Hosted on Xen

A Taxonomy of Virtual Machines