Yu Zhang Performance Improvement of Hypervisors for HPC Workload

Yu Zhang Performance Improvement of Hypervisors for HPC Workload Yu Zhang Performance Improvement of Hypervisors for HPC Workload Universitätsverlag Chemnitz 2018 Impressum Bibliografische Information der Deutschen Nationalbibliothek Die Deutsche Nationalbibliothek verzeichnet diese Publikation in der Deutschen Nationalbi- bliografie; detaillierte bibliografische Angaben sind im Internet über http://www.dnb.de abrufbar. Das Werk - ausgenommen Zitate, Cover, Logo TU Chemnitz und Bildmaterial im Text - steht unter der Creative-Commons-Lizenz Namensnennung 4.0 International (CC BY 4.0) http://creativecommons.org/licences/by/4.0/deed.de Titelgrafik: Yu Zhang Satz/Layout: Yu Zhang Technische Universität Chemnitz/Universitätsbibliothek Universitätsverlag Chemnitz 09107 Chemnitz https://www.tu-chemnitz.de/ub/univerlag readbox unipress in der readbox publishing GmbH Am Hawerkamp 31 48155 Münster http://unipress.readbox.net ISBN 978-3-96100-069-2 http://nbn-resolving.de/urn:nbn:de:bsz:ch1-qucosa2-318258 Performance Improvement of Hypervisors for HPC Workload Dissertation zur Erlangung des akademischen Grades Dr.-Ing Herr M. Eng Yu Zhang geboren am 7.Juli 1980 in Huhhot, China Fakultät für Informatik an der Technischen Universität Chemnitz Dekan : Prof. Dr.-rer. nat. Wolfram Hardt Gutachter : Prof. Dr.-Ing. habil. Matthias Werner Prof. Dr.-Ing. Alejandro Masrur Tag der Einreichung : 30.01.2018 Tag der Verteidigung : 03.07.2018 Zhang, Yu Performance Improvement of Hypervisors for HPC Workload Dissertation, Fakultät fur¨ Informatik Technical University of Chemnitz, Januar 2019 Acknowledgments With eight years efforts, the PhD. study is slowly drawing to an end. Many people have rendered their help, expressed concern, and encouraged me. All these had strengthened my confidence to overcome the difficulties all along the way and go so far to create this work. First, I sincerely thank my former supervisor, Prof. Wolfgang Rehm, who had offered me the opportunity to pursue the PhD study in TU Chemnitz, arranged the equipments and environment that are necessary for scientific research. More importantly, under the guidance of Prof. Rehm, I was ushered to the domains of Computer Architecture, HPC and System Virtualization. Efforts on a topic of these aspects lead to this dissertation. Second, I would like to express my heartful gratitude to my current supervisor, Prof. Matthias Werner, who has sincerely rendered his academic supervision on my research topic and arranged a very comfortable environment for me to complete this research task. Without Prof. Werner's generous support, I cannot image to complete this research. As an earnest scholar and upright man, Prof. Werner really impressed me. My thanks extend to my colleagues in the two professors research groups. RenéOertel, Nico Mittenzwey, Hendrik Nöll and Johannes Hiltscher had rendered me invaluable suggestions on many issues, from equipment, publication to professional technical skills to my stay in Chemnitz. With his patience and experience, Dr. Peter Tröger helped me to avoid quite a lot of incorrect ideas for research and drafting each chapter of this dissertation. I benefited enormously from his comments, feedback, and the conversation with him. It is always my best memory to attend the conferences of EMS 2013 in Manchester, BigData- Science'14 in Beijing, Euro-Par VHPC'15 in Vienna, and ISC VHPC'17 in Frankfurt. I really feel gratitude for many unknown reviewers who evaluate my contributions positively to give me the chances to contact with world's excellent researchers and scholars. During these conferences, the chairs, Prof. David Al-Dabass, Dr. Alvin Chin, and Dr. Michael Alexander, had very nice talks with me. Dr. Andrew J. Younge, whom I get to know at ISC VHPC'17, gave me valuable constructive comments for a portion of this manuscript. Prof. Andreas Goerdt encourages me to keep on even at the tough moments. Other friends, Wang Jian, Chou Chih-Ying, and Bai Qiong supported me all through the past eight years. They are true friends with whom I can share joy and frustration. Furthermore, I would like to mention the China Scholarship Council (CSC), whose financing makes most of this research possible and the Deutscher Akademischer Austauschdienst (DAAD), who also sponsored me in the framework of a research project in the last year. This manuscript is finally proofread by Ms. Jennifer Hofmann and Ms. Christine Jakobs. They took great patience to scrutinize the lengthy text, correct the grammatical and typograph- ical errors, punctuation, spelling, and inconsistencies. Their efforts had helped me to produce a significantly clearer and more professional scientific work. A few lines of words could never be enough to express my gratitude. Finally, I express my gratitude to my parents and my brother, who silently supported me all through the years with their love, concern and understanding from thousands of miles away. I dedicate this book to my father, who loved me so much but left me forever before the submission. Abstract The virtualization technology has many excellent features beneficial for today's high-performance computing (HPC). It enables more flexible and effective utilization of the computing resources. However, a major barrier for its wide acceptance in HPC domain lies in the relative large performance loss for workloads. Of the major performance-influencing factors, memory management subsystem for virtual machines is a potential source of performance loss. Many efforts have been invested in seeking the solutions to reduce the performance overhead in guest memory address translation process. This work contributes two novel solutions - \DPMS" and\STDP". Both of them are presented conceptually and implemented partially for a hypervisor - KVM. The benchmark results for DPMS show that the performance for a number of workloads that are sensitive to paging methods can be more or less improved through the adoption of this solution. STDP illustrates that it is feasible to reduce the performance overhead in the second- dimension paging for those workloads that cannot make good use of the TLB. Zusammenfassung Virtualisierungstechnologie verfugt¨ uber¨ viele hervorragende Eigenschaften, die fur¨ das heutige Hochleistungsrechnen von Vorteil sind. Es ermöglicht eine flexiblere und effektivere Nutzung der Rechenressourcen. Ein Haupthindernis fur¨ Akzeptanz in der HPC-Domäne liegt jedoch in dem relativ großen Leistungsverlust fur¨ Workloads. Von den wichtigsten leistungsbeeinflussenden Faktoren ist das Speicherverwaltung-Subsystem fur¨ virtuelle Maschinen eine potenzielle Quelle der Leistungsverluste. Es wurden viele Anstrengungen unternommen, um Lösungen zu finden, die den Leistungsaufwand beim Konvertieren von Gastspeicheradressen reduzieren. Diese Arbeit liefert zwei neue Lösungen DPMS\ und STDP\. Beide werden konzeptionell vorgestellt und teilweise fur¨ einen Hyper- " " visor - KVM - implementiert. Die Benchmark-Ergebnisse fur¨ DPMS zeigen, dass die Leistung fur¨ eine Reihe von pagingverfahren-spezifischen Workloads durch die Einfuhrung¨ dieser Lösung mehr oder weniger verbessert werden kann. STDP veranschaulicht, dass es möglich ist, den Leistungsaufwand im zweidimensionale Paging fur¨ diejenigen Workloads zu reduzieren, die die von dem TLB anbietende Vorteile nicht gut ausnutzen können. Contents List of Figures xvii List of Tables xix List of Abbreviations xxi 1 Introduction 1 1.1 Background . 1 1.1.1 HPC and its Major Concerns . 1 Performance . 1 Power Consumption and Efficiency . 3 Hardware Utilization and Efficiency . 5 1.1.2 System Virtualization . 6 Overview of Virtualization . 6 Virtualization's Benefits . 7 Virtualization's Limitations . 7 1.1.3 Exploit Virtualization in HPC . 9 Customizable Execution Environment . 9 System Resource Utilization Enhancing . 9 Power and Hardware Efficiency Enhancing . 9 1.2 Problem Description . 10 1.3 Thesis Contribution to the Current Research . 10 1.4 Outline of the Thesis . 11 2 Related Work 13 2.1 System Virtualization . 13 2.2 Virtualization in HPC . 17 2.3 Performance for Memory Virtualization . 21 2.4 Summary . 23 3 A Study of the Performance Loss for Memory Virtualization 25 3.1 HPC Workload Deployment Patterns . 25 3.2 Benchmark Selection for Intra-node Pattern . 27 3.3 Benchmark Platform . 29 3.3.1 Intel Platform . 29 Intel CoreTM i7-6700K (Skylake) . 29 Intel Xeon E5-1620 (Ivy Bridge-E) . 30 3.3.2 AMD Platform . 30 xiii Contents AMD FXtm-8150 (Bulldozer) . 30 3.3.3 System Software . 31 3.4 Performance for Memory Paging . 31 3.5 Summary . 38 4 DPMS - Dynamic Paging Method Switching and STDP - Simplified Two Dimensional Paging 39 4.1 Reflections and Solutions . 40 4.1.1 Dynamic Paging Method Switching . 40 4.1.2 STDP with Large Page Table . 40 4.2 DPMS - Dynamic Paging Method Switching . 41 4.2.1 Performance Data Sampling . 41 4.2.2 Data Processing . 42 4.2.3 Decision Making . 42 4.2.4 Switching Mechanism . 43 4.3 STDP - Simplified Two-Dimensional Paging . 46 4.3.1 Revisiting the Current Paging Scheme . 46 4.3.2 Restructured Page Table . 50 4.3.3 Page Fault Handling in TDP . 50 4.3.4 Adaptive Hardware MMU . 52 4.4 Summary . 52 5 Implementation 55 5.1 QEMU-KVM Hypervisor Analysis . 55 5.1.1 About QEMU . 56 Main Components . 59 5.1.2 About KVM . 60 Guest Creation and Execution . 60 5.2 Parameter Study for DPMS . 62 5.3 DPMS on QEMU-KVM for x86-64 . 70 5.3.1 Performance Data Sampling . 70 5.3.2 Data Processing . 73 5.3.3 Decision Making . 73 5.3.4 Switching Mechanism . 74 5.3.5 Repetitive Mechanism . 79 5.3.6 PMC Mechanism in QEMU-KVM Context . 80 5.3.7 DPMS for Multi-Core Processor . 81 5.4 STDP on QEMU-KVM for x86-64 . 82 5.4.1 Restructured Page Table . 83 5.4.2 Adaptive MMU for TDP . 85 5.5 Summary . 87 6 Experiments and Evaluation 89 6.1 Objective ........................................ 89 6.2 Testing Design . 90 6.2.1 Functional Correctness Test . 91 Performance Data Sampling . 91 Decision Making . 91 Paging Method Switching . 91 6.2.2 Performance Test . 92 xiv Contents 6.3 Test and Benchmark Results of DPMS .

Yu Zhang Performance Improvement of Hypervisors for HPC Workload

GPU Developments 2018

CTL RFP Proposal

Fax86: an Open-Source FPGA-Accelerated X86 Full-System Emulator

Mass-Producing Your Certified Cluster Solutions

NVM Express and the PCI Express* SSD Revolution SSDS003

System Design for Telecommunication Gateways

Accelerate Your AI Journey with Intel

Ramp: Research Accelerator for Multiple Processors

Computer Architectures an Overview

Extracting and Mapping Industry 4.0 Technologies Using Wikipedia

Selected Project Reports, Spring 2005 Advanced OS & Distributed Systems

Fast Setup and Integration of ABAQUS on HPC Linux Cluster and the Study of Its Scalability