Automatic Generation of Models of Microarchitectures

Automatic Generation of Models of Microarchitectures Dissertation zur Erlangung des Grades des Doktors der Ingenieurwissenschaften der Fakultät für Mathematik und Informatik der Universität des Saarlandes von Andreas Abel Saarbrücken 2020 Tag des Kolloquiums: 12. Juni 2020 Dekan: Prof. Dr. Thomas Schuster Prüfungsausschuss: Vorsitzender: Prof. Dr. Thorsten Herfet Berichterstatter: Prof. Dr. Jan Reineke Prof. Dr. Wolfgang J. Paul Dr. Boris Köpf Akademischer Mitarbeiter: Dr. Roland Leißa Abstract Detailed microarchitectural models are necessary to predict, explain, or optimize the performance of software running on modern microprocessors. Building such models often requires a significant manual effort, as the documentation provided by hardware manufacturers is typically not precise enough. The goal of this thesis is to develop techniques for generating microarchitectural models automatically. In the first part, we focus on recent x86 microarchitectures. We implement a tool to accurately evaluate small microbenchmarks using hardware performance counters. We then describe techniques to automatically generate microbenchmarks for measuring the performance of individual instructions and for characterizing cache architectures. We apply our implementations to more than a dozen different microarchitectures. In the second part of the thesis, we study more general techniques to obtain models of hardware components. In particular, we propose the concept of gray-box learning, and we develop a learning algorithm for Mealy machines that exploits prior knowledge about the system to be learned. Finally, we show how this algorithm can be adapted to minimize incompletely specified Mealy machines—a well-known NP-complete problem. Our implementation outperforms existing exact minimization techniques by several orders of magnitude on a number of hard benchmarks; it is even competitive with state-of-the-art heuristic approaches. Zusammenfassung Zur Vorhersage, Erklärung oder Optimierung der Leistung von Software auf modernen Mikroprozessoren werden detaillierte Modelle der verwendeten Mikroarchitekturen benötigt. Das Erstellen derartiger Modelle ist oft mit einem hohen Aufwand verbunden, da die erforderlichen Informationen von den Prozessorherstellern typischerweise nicht zur Verfügung gestellt werden. Das Ziel der vorliegenden Arbeit ist es, Techniken zu entwickeln, um derartige Modelle automatisch zu erzeugen. Im ersten Teil beschäftigen wir uns mit aktuellen x86-Mikroarchitekturen. Wir entwickeln zuerst ein Tool, das kleine Microbenchmarks mithilfe von Per- formance Countern auswerten kann. Danach beschreiben wir Techniken, um automatisch Microbenchmarks zu erzeugen, mit denen die Leistung einzelner Instruktionen gemessen sowie die Cache-Architektur charakterisiert werden kann. Im zweiten Teil der Arbeit betrachten wir allgemeinere Techniken, um Hard- waremodelle zu erzeugen. Wir schlagen das Konzept des “Gray-Box Learning” vor, und wir entwickeln einen Lernalgorithmus für Mealy-Maschinen, der bekannte Informationen über das zu lernende System berücksichtigt. Zum Abschluss zeigen wir, wie dieser Algorithmus auf das Problem der Minimierung unvollständig spezifizierter Mealy-Maschinen übertragen werden kann. Hierbei handelt es sich um ein bekanntes NP-vollständiges Problem. Unsere Implementierung ist in mehreren Benchmarks um Größenordnungen schneller als vorherige Ansätze. Acknowledgements First and foremost, I would like to thank my advisor, Prof. Jan Reineke. He gave me the freedom to explore my own ideas and was always available for discussions and to provide guidance. I’m looking forward to continuing working with him! I would also like to thank Prof. Wolfgang Paul and Dr. Boris Köpf for reviewing my thesis, and Prof. Thorsten Herfet for acting as the chair of the examination board. Finally, I would like to thank my current and former colleagues at the Real- Time and Embedded Systems Lab and the Compiler Design Lab. In particular, I would like to thank Dr. Roland Leißa for serving as the academic assistant on my examination board. Contents 1 Introduction 13 1.1 Contributions and Structure of This Thesis . 14 1.2 Publications . 19 2 nanoBench: A Low-Overhead Tool for Running Microbench- marks on x86 Systems 21 2.1 Introduction . 21 2.2 Background . 23 2.2.1 Hardware Performance Counters . 23 2.2.2 Assembler Instructions . 25 2.3 Features . 25 2.3.1 Example . 26 2.3.2 Generated Code . 27 2.3.3 Running the Generated Code . 27 2.3.4 Kernel/User Mode . 29 2.3.5 Interface . 29 2.3.6 Loops vs. Unrolling . 29 2.3.7 Accessing Memory . 30 2.3.8 Warm-Up Runs . 30 2.3.9 noMem Mode . 30 2.3.10 Performance Counter Configurations . 31 2.3.11 Execution Time of nanoBench . 31 2.3.12 Supported Platforms . 32 2.4 Implementation . 32 2.4.1 Accurate Performance Counter Measurements . 32 2.4.2 Generating Code . 33 2.4.3 Kernel Module . 34 2.4.4 Allocating Physically-Contiguous Memory . 34 2.5 Related Work . 35 2.6 Conclusions and Future Work . 36 3 uops.info: Characterizing the Latency, Throughput, and Port Usage of Instructions on x86 Microarchitectures 39 3.1 Introduction . 40 3.2 Related Work . 42 3.2.1 Information Provided by the Manufacturers . 42 3.2.2 Measurement-Based Approaches . 43 3.3 Background . 44 9 CONTENTS 3.3.1 Pipeline of Intel Core CPUs . 44 3.3.2 Pipeline of AMD Ryzen CPUs . 46 3.4 Definitions . 46 3.4.1 Latency . 47 3.4.2 Throughput . 47 3.4.3 Port Usage . 48 3.5 Algorithms . 49 3.5.1 Port Usage . 49 3.5.2 Latency . 52 3.5.3 Throughput . 59 3.6 Implementation . 61 3.6.1 Details of the x86 Instruction Set . 61 3.6.2 Measurements on the Hardware . 62 3.6.3 Analysis Using IACA . 63 3.6.4 Machine-Readable Output . 63 3.7 Evaluation . 63 3.7.1 Experimental Setup . 64 3.7.2 Hardware Measurements vs. Documentation . 64 3.7.3 Hardware Measurements vs. IACA . 67 3.7.4 Interesting Results . 69 3.8 Limitations . 76 3.9 Conclusions and Future Work . 77 4 Characterizing Cache Architectures 79 4.1 Introduction . 79 4.2 Background . 81 4.2.1 Cache Organization . 81 4.2.2 Replacement Policies . 82 4.3 Cache-Characterization Tools . 86 4.3.1 CacheInfo . 86 4.3.2 CacheSeq . 88 4.3.3 Replacement Policies . 90 4.3.4 Age Graphs . 92 4.3.5 Test for Adaptive Policies . 92 4.4 Results . 94 4.4.1 L1 Data Caches . 94 4.4.2 L2 Caches . 96 4.4.3 L3 Caches . 104 4.4.4 Resetting the Replacement Policy State . 109 4.4.5 Implementation Costs . 111 4.5 Related Work . 111 10 CONTENTS 4.5.1 Microbenchmark-Based Cache Analysis . 111 4.5.2 Influence of the Replacement Policy on Performance Prediction Accuracy . 113 4.5.3 Security Aspects of Replacement Policies . 114 4.6 Conclusions and Future Work . 115 5 Gray-Box Learning of Serial Compositions of Mealy Machines 117 5.1 Introduction . 118 5.2 Problem Statement . 119 5.2.1 Basic Notions . 119 5.2.2 The Gray-Box Learning Problem . 120 5.3 Preliminaries . 121 5.4 Approach . 122 5.4.1 Observation Tables . 123 5.4.2 Inference Algorithm . 126 5.5 Implementation . 128 5.5.1 Computing the Partitions . 128 5.5.2 Reachability of the Error State . 131 5.5.3 Checking if Two Machines are Right-Equivalent . 131 5.5.4 Handling Counterexamples . 132 5.6 Evaluation . 132 5.7 Related Work . 134 5.8 Conclusions and Future Work . 135 5.A Appendix: Proofs for Chapter 5 . 136 6 MeMin: SAT-Based Exact Minimization of Incompletely Specified Mealy Machines 139 6.1 Introduction . 139 6.1.1 Outline . 141 6.2 Definitions . 141 6.2.1 Basic Definitions . 141 6.2.2 Problem Statement . 143 6.2.3 General Approach . 143 6.3 Related Work . 144 6.4 Approach . 146 6.4.1 Incompatibility Matrix . 147 6.4.2 Encoding as a SAT Problem . 147 6.4.3 Computing a Partial Solution . 149 6.5 Implementation . 149 6.5.1 Dealing with Partially Specified Outputs . 149 11 CONTENTS 6.5.2 Dealing with Partially Specified Inputs . 150 6.5.3 Undefined Reset States . 150 6.6 Evaluation . 150 6.6.1 Benchmarks . 151 6.6.2 Evaluation of MeMin . 155 6.6.3 Other Tools . 155 6.6.4 Experimental Setup . 158 6.7 Conclusions and Future Work . 158 6.A Appendix: Complete Benchmark Results . 159 7 Summary, Conclusions, and Future Work 165 7.1 Summary and Conclusions . 165 7.1.1 Models of Recent Microarchitectures . 165 7.1.2 General Models . 166 7.2 Future Work . 167 Bibliography 169 Index 197 12 1 Introduction Modern microprocessors are among the most complex man-made systems. As a consequence, it is becoming increasingly difficult to predict, explain, or optimize the performance of software running on such microprocessors. As a basis, one needs detailed models of their microarchitectures..

Automatic Generation of Models of Microarchitectures

Integrating Program Optimizations and Transformations with the Scheduling of Instruction Level Parallelism*

MBP4ASG41M-VS3.Pdf

The Microarchitecture of the Pentium 4 Processor

Evolution of Microprocessor Performance

Conroe and Allendale Electrical, Mechanical, and Thermal

Multiprocessing Contents

Intel® Core™ Microarchitecture • Wrap Up

Nt* and Rtl* INT 2Eh CALL Ntdll!Kifastsystemcall

The Intel X86 Microarchitectures Map Version 2.0

ECE 571 – Advanced Microprocessor-Based Design Lecture 16

A Performance Analysis Tool for Intel SGX Enclaves

Intel PSU Cage Replacement Process Support Guide