Efficient and Secure Context Switching and Migrations for Heterogeneous Multiprocessors

University of Southampton Research Repository Copyright © and Moral Rights for this thesis and, where applicable, any accompanying data are retained by the author and/or other copyright owners. A copy can be downloaded for personal non-commercial research or study, without prior permission or charge. This thesis and the accompanying data cannot be reproduced or quoted extensively from without first obtaining permission in writing from the copyright holder/s. The content of the thesis and accompanying research data (where applicable) must not be changed in any way or sold commercially in any format or medium without the formal permission of the copyright holder/s. When referring to this thesis and any accompanying data, full bibliographic details must be given, e.g. Thesis: Author (Year of Submission) "Full thesis title", University of Southampton, name of the University Faculty or School or Department, PhD Thesis, pagination. Data: Author (Year) Title. URI [dataset] UNIVERSITY OF SOUTHAMPTON FACULTY OF ENGINEERING AND PHYSICAL SCIENCES Electronics and Computer Science Efficient and Secure Context Switching and Migrations for Heterogeneous Multiprocessors by Ilias Vougioukas Thesis for the degree of Doctor of Philosophy June 3, 2019 UNIVERSITY OF SOUTHAMPTON ABSTRACT FACULTY OF ENGINEERING AND PHYSICAL SCIENCES Electronics and Computer Science Doctor of Philosophy EFFICIENT AND SECURE CONTEXT SWITCHING AND MIGRATIONS FOR HETEROGENEOUS MULTIPROCESSORS by Ilias Vougioukas Modern mobile processors are constrained by their limited energy resource and demanding applications that require fast execution. Single core designs use voltage/frequency throttling techniques that allow the system to switch between performant and efficient configurations to address this problem. Heterogeneous multicores also need to decide on which core to run rather than just adjust voltage and frequency. Consequently, they remain an open subject in terms of near optimal design and operation. This thesis investigates the performance and energy trade-off when migrating between heterogeneous cores and presents designs that enable low overhead transitions between cores through a series of contributions. The first contribution is based on a novel methodology, that can directly compare the execution of heterogeneous cores. With it, an in-depth investigation of the effects on the memory system when migrating between such cores is conducted. The analysis reveals that heterogeneous multiprocessor system slowdown is asymmetrical. In-Order core performance relies on the memory subsystem, while Out-of-Order execution depends on accurate speculation. A proposed design minimises migration overheads in In-Order cores without making the design prohibitively complex to implement. This is achieved by only sharing the larger caches and the translation state. The second contribution is a branch predictor design that transfers state when a migration occurs between heterogeneous cores. The design eliminates the warm up for Out-of-Order cores by transferring only minimal state. This improves post migration accuracy, potentially enabling better performance and energy efficiency. Finally, security has become a major concern for shared or transferable components in multicore systems, namely the branch predictor. The third contribution in this thesis investigates mitigation techniques of recently discovered side channel attacks. The proposed design flushes all but the most useful branch predictor state, ensuring isolation with minimal performance loss. Contents Declaration of Authorship xiii Acknowledgements xv Acronyms xix Nomenclature xxi 1 Introduction 1 1.1 Research Motivation ............................... 5 1.2 Research Questions ................................ 7 1.3 Contributions ................................... 8 1.4 Outline ...................................... 9 2 Heterogeneous Multiprocessors 11 2.1 Heterogeneous Multiprocessor Types ....................... 11 2.1.1 ISA Based Classification ......................... 12 2.1.2 Commercial and Conventional HMPs . 13 2.1.3 Reasons for HMP Usage ......................... 14 2.2 HMP Architecture and Microarchitecture .................... 15 2.2.1 HMP Space Exploration ......................... 15 2.2.2 Architectural Optimisation in Conventional Designs . 17 2.2.3 Merging HMP Cores ........................... 18 2.2.4 Unconventional Approaches to HMP Design . 23 2.3 HMP Management ................................ 29 2.3.1 Estimation and Modelling Techniques . 30 2.3.2 Load Balancing and Work Stealing ................... 31 2.3.3 Management Using Learning ....................... 31 2.4 Discussion ..................................... 32 3 Heterogeneous Multiprocessor Memory Subsystem Optimisation 33 3.1 Simulation vs Hardware Experimentation .................... 34 3.2 Exploration in Heterogeneous Architectures ................... 35 3.2.1 Exhaustive Exploration .......................... 35 3.2.2 State Space Reduction .......................... 36 3.2.3 Simulation Methodology for First-Order Migrations . 38 3.3 Heterogeneous Multiprocessor Organisation ................... 41 3.3.1 HMP Resource Sharing .......................... 42 v vi CONTENTS 3.3.2 Internal Core Components........................ 46 3.3.3 Comparing Parallel Simulations..................... 46 3.4 Experimental Setup ................................ 48 3.4.1 Infrastructure ............................... 48 3.5 Migration Cost .................................. 51 3.5.1 Scenario A: No Sharing ......................... 52 3.5.2 Scenario B: Share Caches ........................ 53 3.5.3 Scenario C: Share Caches and TLB ................... 54 3.5.4 Scenario D: Share the L2 Cache and TLB . 55 3.5.5 Comparison of memory sharing schemes . 57 3.6 Efficient Migrations ................................ 57 3.7 Discussion ..................................... 59 4 Branch Predictor State Transfer for Heterogeneous Cores 61 4.1 Branch Predictor Overview ............................ 63 4.1.1 The Bimodal Predictor .......................... 64 4.1.2 The TAGE Predictor ........................... 65 4.1.3 Perceptron Predictors ........................... 67 4.2 Branch Predictor State Transfer Mechanism ................... 69 4.3 Impact of the Branch Predictor State ....................... 71 4.3.1 Calculating Potential Performance Gains . 71 4.3.2 Measuring Accuracy Improvements ................... 73 4.3.3 Calculating the transfer overheads .................... 76 4.4 Summary ..................................... 77 5 Security Considerations for Shared Microarchitecture 79 5.1 Speculative side-channels ............................. 81 5.1.1 Branch predictor side-channels ...................... 81 5.1.2 Mitigation techniques ........................... 82 5.1.3 Threat model ............................... 83 5.2 Predictor Flexibility ................................ 83 5.2.1 Steady-state and transient accuracy ................... 84 5.2.2 The Branch Retention Buffer ....................... 84 5.2.3 Perceptron amplified / reinforced TAGE . 87 5.3 Experimental Setup ................................ 88 5.4 Results ....................................... 89 5.4.1 Quantifying transient accuracy ...................... 89 5.4.2 ParTAGE ................................. 91 5.4.3 ParTAGE results ............................. 94 5.5 Discussion ..................................... 96 6 Conclusions 99 6.1 Summary of Conclusions ............................. 100 6.2 Research Questions Evaluation . 102 6.3 Future work .................................... 103 6.3.1 Research into HMP Scheduling . 103 6.3.2 Investigating Memory Sharing for HMPs with Emerging Accelerators . 103 CONTENTS vii 6.3.3 Extending Branch Predictor State Transfer to Other Designs . 104 6.3.4 Securing Other Shared Resources . 104 A Supplementary Material 105 A.1 Migration Overhead Study ............................ 105 A.2 State Space Reduction to Linear Proof . 107 B Performance and Energy Efficiency 111 B.1 Performance .................................... 111 B.1.1 Parallel Processor Performance . 112 B.1.2 Benchmarking .............................. 115 B.2 Power and Energy Efficiency . 116 B.2.1 Processor Power Dissipation . 117 B.2.2 Methods of Power Loss Mitigation . 119 B.2.2.1 DVFS ............................. 119 B.2.2.2 Core Migrations . 119 B.3 Performance - Energy Trade-offs . 119 B.3.1 Pareto Optimality ............................. 120 B.3.2 The Pareto Frontier . 121 E − D C Selected Publications 123 Bibliography 158 List of Figures 1.1 Moore’s Law depiction of exponential increase in transistor count.......2 1.2 A visual depiction of Dennard’s scaling breakdown...............3 1.3 The gradual decrease of power per single transistor in processor........4 2.1 A depiction of clumpiness in HMPs....................... 15 2.2 Approximation of the Pareto frontier of an HMP using PaTH.......... 17 2.3 Improved scheduling for HMPs using a Predictive Controller.......... 19 2.4 Performance comparison between Oracle, Reactive and Predictive Controllers. 20 2.5 Layout and wiring of two merged OoO cores.................. 22 2.6 Code reusability based on locality........................ 23 2.7 CoreFusion architecture fuses independent cores together............ 24 2.8 Comparison of OoO and InO performance for a range of SMT configurations. 25 2.9 The MorphCore microarchitecture........................ 26 2.10 Core and memory delay and frequency as functions of voltage......... 27 2.11 The TRIPS architecture.............................

Efficient and Secure Context Switching and Migrations for Heterogeneous Multiprocessors

Microprocessors History of Computing Nouf Assaid

Webcore: Architectural Support for Mobile Web Browsing

Historical Perspective and Further Reading 162.E1

Performance and Energy Efficient Network-On-Chip Architectures

Computer Architecture: Dataflow (Part I)

EVA: an Efficient Vision Architecture for Mobile Systems

The Tablet That Can Replace Your Laptop What Makes Surface 3 the Tablet That Can Replace Your Laptop?

6Th Generation Intel® Core™ Processors Based on the Mobile U-Processor for Iot Solutions (Intel® Core™ I7-6600U, I5-6300U, and I3-6100U Processors)

PART I ITEM 1. BUSINESS Industry We Are

Appendix A: Microprocessor Data Sheets

The Ipad Comparison Chart Compare All Models of the Ipad

NVIDIA Tegra 4 Family CPU Architecture 4-PLUS-1 Quad Core