University of Southampton Research Repository

Copyright © and Moral Rights for this thesis and, where applicable, any accompanying data are retained by the author and/or other copyright owners. A copy can be downloaded for personal non-commercial research or study, without prior permission or charge. This thesis and the accompanying data cannot be reproduced or quoted extensively from without first obtaining permission in writing from the copyright holder/s. The content of the thesis and accompanying research data (where applicable) must not be changed in any way or sold commercially in any format or medium without the formal permission of the copyright holder/s.

When referring to this thesis and any accompanying data, full bibliographic details must be given, e.g.

Thesis: Author (Year of Submission) "Full thesis title", University of Southampton, name of the University Faculty or School or Department, PhD Thesis, pagination.

Data: Author (Year) Title. URI [dataset]

UNIVERSITY OF SOUTHAMPTON

FACULTY OF ENGINEERING AND PHYSICAL SCIENCES Electronics and Science

Efficient and Secure Context Switching and Migrations for Heterogeneous Multiprocessors

by

Ilias Vougioukas

Thesis for the degree of Doctor of Philosophy

June 3, 2019

UNIVERSITY OF SOUTHAMPTON

ABSTRACT

FACULTY OF ENGINEERING AND PHYSICAL SCIENCES Electronics and Computer Science

Doctor of Philosophy

EFFICIENT AND SECURE CONTEXT SWITCHING AND MIGRATIONS FOR HETEROGENEOUS MULTIPROCESSORS

by Ilias Vougioukas

Modern mobile processors are constrained by their limited energy resource and demanding applications that require fast execution. Single core designs use voltage/frequency throttling techniques that allow the system to between performant and efficient configurations to address this problem. Heterogeneous multicores also need to decide on which core to run rather than just adjust voltage and frequency. Consequently, they remain an open subject in terms of near optimal design and operation. This thesis investigates the performance and energy trade-off when migrating between heterogeneous cores and presents designs that enable low overhead transitions between cores through a series of contributions.

The first contribution is based on a novel methodology, that can directly compare the execution of heterogeneous cores. With it, an in-depth investigation of the effects on the memory system when migrating between such cores is conducted. The analysis reveals that heterogeneous multiprocessor system slowdown is asymmetrical. In-Order core performance relies on the memory subsystem, while Out-of-Order execution depends on accurate speculation. A proposed design minimises migration overheads in In-Order cores without making the design prohibitively complex to implement. This is achieved by only sharing the larger caches and the translation state.

The second contribution is a design that transfers state when a migration occurs between heterogeneous cores. The design eliminates the warm up for Out-of-Order cores by transferring only minimal state. This improves post migration accuracy, potentially enabling better performance and energy efficiency.

Finally, security has become a major concern for shared or transferable components in multicore systems, namely the branch predictor. The third contribution in this thesis investigates mitigation techniques of recently discovered side channel attacks. The proposed design flushes all but the most useful branch predictor state, ensuring isolation with minimal performance loss.

Contents

Declaration of Authorship xiii

Acknowledgements xv

Acronyms xix

Nomenclature xxi

1 Introduction 1 1.1 Research Motivation ...... 5 1.2 Research Questions ...... 7 1.3 Contributions ...... 8 1.4 Outline ...... 9

2 Heterogeneous Multiprocessors 11 2.1 Heterogeneous Multiprocessor Types ...... 11 2.1.1 ISA Based Classification ...... 12 2.1.2 Commercial and Conventional HMPs ...... 13 2.1.3 Reasons for HMP Usage ...... 14 2.2 HMP Architecture and ...... 15 2.2.1 HMP Space Exploration ...... 15 2.2.2 Architectural Optimisation in Conventional Designs ...... 17 2.2.3 Merging HMP Cores ...... 18 2.2.4 Unconventional Approaches to HMP Design ...... 23 2.3 HMP Management ...... 29 2.3.1 Estimation and Modelling Techniques ...... 30 2.3.2 Load Balancing and Work Stealing ...... 31 2.3.3 Management Using Learning ...... 31 2.4 Discussion ...... 32

3 Heterogeneous Multiprocessor Memory Subsystem Optimisation 33 3.1 Simulation vs Hardware Experimentation ...... 34 3.2 Exploration in Heterogeneous Architectures ...... 35 3.2.1 Exhaustive Exploration ...... 35 3.2.2 State Space Reduction ...... 36 3.2.3 Simulation Methodology for First-Order Migrations ...... 38 3.3 Heterogeneous Multiprocessor Organisation ...... 41 3.3.1 HMP Resource Sharing ...... 42

v vi CONTENTS

3.3.2 Internal Core Components...... 46 3.3.3 Comparing Parallel Simulations...... 46 3.4 Experimental Setup ...... 48 3.4.1 Infrastructure ...... 48 3.5 Migration Cost ...... 51 3.5.1 Scenario A: No Sharing ...... 52 3.5.2 Scenario B: Share Caches ...... 53 3.5.3 Scenario C: Share Caches and TLB ...... 54 3.5.4 Scenario D: Share the L2 and TLB ...... 55 3.5.5 Comparison of memory sharing schemes ...... 57 3.6 Efficient Migrations ...... 57 3.7 Discussion ...... 59

4 Branch Predictor State Transfer for Heterogeneous Cores 61 4.1 Branch Predictor Overview ...... 63 4.1.1 The Bimodal Predictor ...... 64 4.1.2 The TAGE Predictor ...... 65 4.1.3 Perceptron Predictors ...... 67 4.2 Branch Predictor State Transfer Mechanism ...... 69 4.3 Impact of the Branch Predictor State ...... 71 4.3.1 Calculating Potential Performance Gains ...... 71 4.3.2 Measuring Accuracy Improvements ...... 73 4.3.3 Calculating the transfer overheads ...... 76 4.4 Summary ...... 77

5 Security Considerations for Shared Microarchitecture 79 5.1 Speculative side-channels ...... 81 5.1.1 Branch predictor side-channels ...... 81 5.1.2 Mitigation techniques ...... 82 5.1.3 Threat model ...... 83 5.2 Predictor Flexibility ...... 83 5.2.1 Steady-state and transient accuracy ...... 84 5.2.2 The Branch Retention Buffer ...... 84 5.2.3 Perceptron amplified / reinforced TAGE ...... 87 5.3 Experimental Setup ...... 88 5.4 Results ...... 89 5.4.1 Quantifying transient accuracy ...... 89 5.4.2 ParTAGE ...... 91 5.4.3 ParTAGE results ...... 94 5.5 Discussion ...... 96

6 Conclusions 99 6.1 Summary of Conclusions ...... 100 6.2 Research Questions Evaluation ...... 102 6.3 Future work ...... 103 6.3.1 Research into HMP Scheduling ...... 103 6.3.2 Investigating Memory Sharing for HMPs with Emerging Accelerators . 103 CONTENTS vii

6.3.3 Extending Branch Predictor State Transfer to Other Designs ...... 104 6.3.4 Securing Other Shared Resources ...... 104

A Supplementary Material 105 A.1 Migration Overhead Study ...... 105 A.2 State Space Reduction to Linear Proof ...... 107

B Performance and Energy Efficiency 111 B.1 Performance ...... 111 B.1.1 Parallel Performance ...... 112 B.1.2 Benchmarking ...... 115 B.2 Power and Energy Efficiency ...... 116 B.2.1 Processor Power Dissipation ...... 117 B.2.2 Methods of Power Loss Mitigation ...... 119 B.2.2.1 DVFS ...... 119 B.2.2.2 Core Migrations ...... 119 B.3 Performance - Energy Trade-offs ...... 119 B.3.1 Pareto Optimality ...... 120 B.3.2 The Pareto Frontier ...... 121 E − D C Selected Publications 123

Bibliography 158

List of Figures

1.1 Moore’s Law depiction of exponential increase in ...... 2 1.2 A visual depiction of Dennard’s scaling breakdown...... 3 1.3 The gradual decrease of power per single transistor in processor...... 4

2.1 A depiction of clumpiness in HMPs...... 15 2.2 Approximation of the Pareto frontier of an HMP using PaTH...... 17 2.3 Improved scheduling for HMPs using a Predictive Controller...... 19 2.4 Performance comparison between Oracle, Reactive and Predictive Controllers. 20 2.5 Layout and wiring of two merged OoO cores...... 22 2.6 Code reusability based on locality...... 23 2.7 CoreFusion architecture fuses independent cores together...... 24 2.8 Comparison of OoO and InO performance for a range of SMT configurations. 25 2.9 The MorphCore microarchitecture...... 26 2.10 Core and memory delay and frequency as functions of voltage...... 27 2.11 The TRIPS architecture...... 28

3.1 Sampling execution with a higher frequency reveals micro-phases...... 35 3.2 HMP system exploration creates an exponential state space...... 36 3.3 Merged states with respect to exploration depth...... 37 3.4 The state space of HMPs broken down in orders of migrations...... 38 3.5 Exploration of first-order migrations...... 39 3.6 A visual representation of the forking methodology...... 40 3.7 The implementation of the methodology in gem5...... 41 3.8 Different types of sharing schemes examined for HMP systems...... 42 3.9 Generalised view of a nested translation...... 44 3.10 Migration cost with private TLB and caches...... 52 3.11 Migration cost when sharing all caches...... 53 3.12 Migration cost when sharing all the caches and TLB...... 54 3.13 Migration cost when sharing the L2 cache and TLB...... 55 3.14 InO and OoO performance degradation with respect to migration frequency.. 56 3.15 Percentage of beneficial migrations when being energy constrained...... 58 3.16 Percentage of beneficial migrations when being EDP constrained...... 59

4.1 Retaining the branch predictor state eliminates negative migration effects... 62 4.2 The architecture of a bimodal predictor...... 64 4.3 The states of a bimodal predictor...... 65 4.4 A depiction of a TAGE branch predictor...... 66 4.5 The evolution of perceptron designs...... 67

ix x LIST OF FIGURES

4.6 The design of the perceptron predictor...... 69 4.7 The transfer mechanism between cores...... 70 4.8 Performance cost when migrating without branch predictor state...... 73 4.9 Component wise misprediction rate of 64kB TAGE...... 74 4.10 Increase in accuracy when trasnfering state for TAGE...... 75 4.11 CPI breakdown when migrating with and without branch predictor state.... 76

5.1 The transient accuracy of a branch predictor...... 80 5.2 The branch retention buffer mechanism...... 84 5.3 [The BRB entry mechanism...... 85 5.4 TAGE MPKI with respect to each components size...... 86 5.5 The ParTAGE predictor...... 87 5.6 Transient analysis of bimodal predictors...... 90 5.7 A comparison between TAGE and perceptron...... 91 5.8 Comparing bimodal and perceptron designs below 1.25KB as base predictors. 92 5.9 Calibrating the small perceptron...... 93 5.10 MPKI of ParTAGE variants...... 94 5.11 Comparison of predictors at transient state...... 95 5.12 The strady-state of ParTAGE compared to TAGE and Perceptron...... 96

A.1 OoO performance degradation for shared and private caches...... 105 A.2 InO performance degradation for shared and private caches...... 107 A.3 A representation of the HMP states that can be merged...... 108

B.1 Examples of instruction level parallelism exploitation...... 113 B.2 Amdahl’s Law of diminishing returns when 60% of execution is parallel.... 115 B.3 Chip dynamic and static power dissipation progression over the years..... 118 B.4 The formation of a Pareto frontier from Pareto optimal points...... 120 List of Tables

2.1 Specifications and performance of recent HMP core examples...... 14 2.2 Composite Core parameters...... 18

3.1 Access latency for TLBs, caches and memory in-line with the conducted simulations 45 3.2 The specifications of the system used in the experiments in this chapter..... 49

4.1 The specifications of the system used in the experiments...... 72 4.2 The evaluated branch predictors...... 75

5.1 The evaluated branch predictors...... 88 5.2 The frequency of context on a Google Pixel phone...... 89 5.3 Acuraccy of various bimodal predictors...... 89

xi

Declaration of Authorship

I, Ilias Vougioukas , declare that the thesis entitled Efficient and Secure Context Switching and Migrations for Heterogeneous Multiprocessors and the work presented in the thesis are both my own, and have been generated by me as the result of my own original research. I confirm that:

this work was done wholly or mainly while in candidature for a research degree at this • University;

where any part of this thesis has previously been submitted for a degree or any other • qualification at this University or any other institution, this has been clearly stated;

where I have consulted the published work of others, this is always clearly attributed; • where I have quoted from the work of others, the source is always given. With the exception • of such quotations, this thesis is entirely my own work;

I have acknowledged all main sources of help; • where the thesis is based on work done by myself jointly with others, I have made clear • exactly what was done by others and what I have contributed myself;

parts of this work have been published as: Vougioukas et al. (2017), Vougioukas et al. • (2018) and Vougioukas et al. (2019)

Signed:......

Date:...... 24/06/2019

xiii

Acknowledgements

First and foremost, I would like to express deep gratitude to my industrial supervisor, Dr. Stephan Diestelhorst, for his excellent guidance during my PhD. His knowledge and experience have been vital in shaping a significant part of the concepts, ideas and goals of this project, while his mentoring and personal effort have been truly invaluable.

I would like to equally thank my professor and academic supervisor Dr. Geoff V. Merrett, for his insight, advice and close guidance from an academic standpoint. A special mention should be made to Dr. Bashir Al-Hashimi and Dr. Andreas Hansson, for granting me the opportunity to pursue a PhD.

Special thanks to Andreas Sandberg and Nikos Nikoleris that heavily influenced my research, and helped me develop during my PhD.

Additionally I would also like to thank all the colleagues that have contributed, one way or another, in my development over the past months. Most notably these include, in no particular order, Rene´ DeJong, Rodrigo Gonzalez-Alberquilla, Sascha Bischoff, Stephen Kyle, Rachael Smith, Dimitris Revythis, Konstantinos Psomas, Lucy Gonzalez, Nida Darr and Charlotte Christopherson.

To my family, that sacrificed so much to grant me all possible opportunities to succeed in life. Mom, Dad, Dino, Gerry, Pope and Lak I am eternally grateful.

As a final note, the work contained in this dissertation has been completed as part of a collaboration between the University of Southampton and Arm Research. The work has been funded by both the EPSRC, in the form of CASE award, as well as by Arm, without which it would not have been possible to complete.

xv

Αφιερωμένο στο Καλλάκι, τον Λάκη και τον Κώστα.

xvii

Acronyms

ALU . AMAT Average Memory Access Time.

BP Branch Predictor. BTB Branch Target Buffer.

CAM Content Addressable Memory. CISC Complex Instruction Set Computer. CMP Core Multi-Processor. CPI . CPU . CS Context Switches.

DL1 Data Level Cache. DLP Data Level Parallelism. DRAM Dynamic Random-Access Memory. DVFS Dynamic Voltage Frequency Scaling.

EDP Energy Delay Product.

GPU Graphics Proccesing Unit.

HMP Heterogeneous Multi-Processor. HPC High Performance Computing.

IL1 Instruction Level Cache. ILP Instruction Level Parallelism. InO In-Order execution. IP Internet Protocol. IPC . IPS . ISA Instruction Set Architecture. xix xx Acronyms

L1 Level 1 Cache. L2 Level 2 Cache. LLC Last Level Cache. LSB Least Significant Bit. LSQ Load Store Queue. LSU Load Store Unit.

MIPS Million Instructions Per Second. MLP Memory Level Parallelism. MPKI Mispredictions Per Kilo Instructions.

OoO Out-of-Order execution. OS .

PA Physical Address. PC Program . PHT Pattern History Table. PID Proportional Integral Derivative Controller. PMC Performance Monitoring Counter.

RISC Reduced Instruction Set Computer. ROB Re-Order Buffer.

SIMD Single Instruction Multiple Data. SMT Simultaneous Multi-Threading. SoC System-on-Chip. SPO Strong Pareto Optimum. SRAM Static Random-Access Memory.

TLB Translation Lookaside Buffer. TLP Level Parallelism.

VA Virtual Address. VLIW Very Long Instruction Word.

WaR Write after Read. WaW Write after Write. WPO Weak Pareto Optimum.

XML Extensible Markup Language. Nomenclature

CL Load capacitance.

VDD Source voltage. Delay. D Energy. E Power. P Charge. Q overall Overall speedup. S θ Threshold. f Frequency.

xxi

Chapter 1

Introduction

Digital have had an unfathomable rate of progress since the first processing system emerged in 1946 (Goldstine and Goldstine, 1996). Back then, computers filled entire wings of government buildings and institutions to do mathematical computations impossible by humans. Since then, a gradual shift has happened, from massive specialised machines for particular tasks to small general purpose devices used for all aspects of everyday life.

In fact, many of these devices go unnoticed today as, apart from the mainstream personal computers and that we are accustomed to, processors are hidden in cars, planes, toys and even in our bodies (Pannuto et al., 2016). It is this exact phenomenon of extreme progress in performance, size and capabilities that has caused computers to greatly affect our lives and play a pivotal role in human enabling, especially with the emergence of connected and mobile platforms.

As these mobiles devices are becoming increasingly more widespread and popular, their compu- tational demands have increased. This has, in consequence, changed them from simple embedded systems to almost all-round high-performance powerhouses. In terms of the workloads that today’s devices are designed to handle, the field is equally diverse ranging from multimedia, encryption, networking, advanced artificial intelligence etc. (Mittal, 2016).

Up until recently, technology was able to meet future demands notably due to one observation known as Moore’s law (Moore, 1998; Chien and Karamcheti, 2013). Moore’s law predicted in 1965 (and adjusted in 1975) that transistor density roughly doubles every 18 months. The observation proved to be quite insightful, as it still stands true today. The importance of Moore’s law stems from the direct link between transistor density and processor performance.

The reasons why performance increases with larger integration is manifold. Primarily, at a smaller scale gates can switch at a faster rate. This happens because the distance needed to travel inside the chip decreases while electron speed remains constant. Additionally, in smaller implementations the capacitance is reduced which also contributes to faster switching times of logical units and ultimately higher operating frequencies (Weste and Eshraghian, 2010).

1 2 Chapter 1 Introduction

72-core Phi AMD Epyc Xeon Haswell-E5 Xbox Scorpio Centriq 2400 POWER8 10B 61-core Xeon Phi Xeon Westmere-EX Xeon Nehalem-EX AMD Ryzen POWER7 Xeon Dual-Core 2 Apple A8X Opteron i7 Haswell-E 1B POWER 6 Core i7 Itanium 2 9M cache Snapdragon 835 AMD K10 Core 2 Duo i7 Gulftown i7 Broadwell-E Itanium 2 Cell SPARC T3 i7 Ivy Bridge 100M AMD K8 AMD Bulldozer Pentium 4 Barton Atom AMD K7 AMD K6-III Pentium III ARM Cortex A9 10M AMD K6 Pentium II AMD K5 Pentium Transistor Count Transistor 1M 80486 ARM 700 Intel 80386 ARM 3 Intel 80286

100k ARM 9TDMI 68000 Intel 80186 ARM 2 ARM 1 10k RCA 1802 Motorola 6809 Intel 8085 Intel 8008 MOS 6502 Intel 4004 1000 1970 1980 1990 2000 2010 2020 Year Figure 1.1: The exponential increase in transistor count as noted by Gordon Moore. Data collected from Berkeleys extended database (Danowitz et al., 2012).

To complement speed increase, smaller transistors also allow for more complex designs to be implemented that can significantly improve overall system speed. For example, increasing cache size effectively reduces the frequency with which data is fetched from memory and, as a direct consequence, reduces execution stalls due to memory accesses (Hennessy and Patterson, 2006; Jacob et al., 2008). Similarly, advanced concepts such as multiple execution units and branch prediction are used in larger architectures to further improve system performance by unlocking more Instruction Level Parallelism (ILP).

Another interesting observation, Dennard’s scaling, showed that the power per transistor was steadily decreasing, which on a macro-scale meant that processor power roughly remained the same from generation to generation (Kim et al., 2003; Kamil et al., 2008; Dennard et al., 2007). Combining the two “laws”, it was evident that there was a constant increase in performance without having any extra energy trade off cost. With all this in mind, processors were able to cruise along by doubling their density periodically while at the same time increasing the operating frequency.

However, an upper speed limit was reached, over which frequency scaling was no longer realistic. The problem appeared as the electrical power required to operate a Central Processing Unit (CPU) started to become prohibitive (Dennard et al., 2007; Pell and Mencer, 2011). Technically, the reasons why this break down occurs is that leakage current and thermal losses both rise as frequency increases. In essence, while it is possible to create a system operating at higher Chapter 1 Introduction 3

2 10k Sun's surface Predicted power density post-multicore

Predicted power density pre-multicore 72-core Xeon Phi i7 Broadwell-E i7 Haswell-E 1k 61-core Xeon Phi i7 Ivy Bridge Xeon Westmere-EX SPARC T3 POWER 6 i7 Gulftown Core i7 ) Core 2 Duo Opteron 2 Itanium 2 9M cache 100 Itanium 2 Cell Pentium 4 AMD K8 AMD Ryzen AMD K7 Centriq 2400 Pentium II Atom Intel 8085 AMD K10 Intel 8086 Apple A12 Pentium III Xeon Intel 8088 AMD K6-III AMD Epyc Barton POWER8 10 Intel 8008 Intel 80286 ARM 9TDMI Pentium Intel 8080 AMD K6 Apple A11 Intel 80186 ARM 2 Dual-Core Itanium 2 Power Density (W/cm Intel 4004 ARM 3 AMD K5 ARM Cortex A9 Motorola 6800 ARM 1 POWER7 Apple A7 Apple A10 Intel 80486 MOS 6502 Intel 80386 Snapdragon 835 Xeon Nehalem-EX Xbox Scorpio 1 Zilog Z80 AMD Bulldozer Apple A8X Motorola 6809 ARM 700 Xeon Haswell-E5

0.1 RCA 1802

1970 1980 1990 2000 2010 2020 Year Figure 1.2: A visual depiction of Dennard’s scaling breakdown. Data collected from Berkeleys extended database (Danowitz et al., 2012).

frequencies, the solution is not economically viable since cooling and power restrictions make such implementations significantly more expensive. This signified the start of the breakdown in Dennard’s scaling (Dennard et al., 2007).

To get around this, engineers started looking at alternate designs that could operate at lower frequencies while retaining performance levels (Bischoff, 2016; Spiliopoulos et al., 2011; Eyer- man and Eeckhout, 2011). This lead to multi-core architectures, which can exploit parallelism in processes to achieve the same level of performance (Hennessy and Patterson, 2006; Ceze et al., 2006; Chaudhry et al., 2005, 2009a,b; Rangan et al., 2009). Hence, the multi-core era allowed technology to advance following roughly the same trend as before by exploiting the inherent parallelism in workloads as well as the multi-threaded nature of operation. In spite of the challenge of continuously scaling out the number of cores while maintaining memory bandwidth and efficiently managing caches and traffic, multiprocessors have consistently met performance expectations in the recent years (Gutierrez et al., 2014).

With the recent surge in core count, interconnection has become more critical as the bandwidth and latency issues are increasingly harder to solve (Chen et al., 2012; Sankaralingam et al., 2003, 4 Chapter 1 Introduction

2006). Furthermore, the effects of the collapse of Dennard’s scaling have gotten unavoidably prominent, with new designs experiencing power surges well beyond nominal thermal constraints and the safe operating temperature range. The above constraints lead to the realisation that future processors would only operate parts of their design while the rest remains inactive. The practice of turning off parts of the design that are temporarily not used is commonly referred to as dark silicon (Sadeghi et al., 2015; Patsilaras et al., 2012). Recent studies have shown that at a 7nm scale of integration, dark silicon may surmount to 50% - 80% depending upon the specifics of the workloads, the design, and the implementation (Esmaeilzadeh et al., 2012). The ramifications of this effect are so significant that it has affected not only energy restricted systems such as mobile devices, but also systems that focus in performance such as High Performance Computing (HPC) or servers for example (Mittal, 2016; Kamil et al., 2008; Sarma et al., 2015).

Intel 8008 Intel 8080 Intel 4004 Intel 8085 Motorola 6809 Motorola 6800 100μ Zilog Z80 Intel 8086 MOS 6502 Intel 80186 Intel 8088 ARM 1 ARM 2 Motorola 68000 Intel 80286

10μ ARM 3 Intel 80386 Pentium II Pentium AMD K6-III

RCA 1802 Intel 80486 AMD K5 ARM 9TDMI AMD K6 Pentium 4 Barton AMD K8 1μ ARM 700 AMD K7 Cell POWER7 Pentium III AMD K10 SPARC T3 Itanium 2 POWER 6 Core i7 AMD Bulldozer Core 2 Duo 61-core Xeon Phi Itanium 2 9M cache POWER8 Atom i7 Haswell-E Opteron 100n ARM Cortex A9 Dual-Core Itanium 2 i7 Broadwell-E Xeon

Power Density (W/transistor) 72-core Xeon Phi i7 Gulftown Xeon Nehalem-EX AMD Ryzen Xeon Westmere-EX i7 Ivy Bridge 10n Xeon Haswell-E5 AMD Epyc Xbox Scorpio Centriq 2400

Apple A7 Apple A12 Apple A8X Apple A10 Snapdragon 835 1n Apple A11

1970 1980 1990 2000 2010 2020 Year Figure 1.3: The gradual decrease of the power consumed by a single transistor in processor. Data collected from Berkeley’s CPU extended database (Danowitz et al., 2012).

Modern processors adapted to these circumstances by utilising multiple cores of different char- acteristics, commonly referred to as Heterogeneous Multi-Processors (HMPs) (Yu et al., 2013; Sarma and Dutt, 2015; Navada et al., 2013; Mittal, 2016). With HMPs a different set of im- pediments need to be surmounted in order to exploit their inherent performance capabilities. In Chapter 1 Introduction 5 single core architectures performance relies in correct scheduling of tasks aiming at reducing idle CPU time. They are relatively straight forward to understand and have been researched in detail in the past (Stallings, 2011; Tanenbaum and Bos, 2014). In multi-core designs the challenges are somewhat more complicated as performance relies on distributing the workload that can be executed in parallel amongst the available cores, while also dealing with common resources, cache coherency and efficiently utilising memory (Gutierrez et al., 2014; Yu et al., 2013; Lukefahr et al., 2012). Furthermore, efforts in developing parallel schedulers has proved to be significantly more complex and less fruitful than their sequential counterparts (Chronaki et al., 2015; Yoo et al., 2015; Vet et al., 2013; Bhattacharjee and Martonosi, 2009b; Padmanabha et al., 2015).

In addition to the general multi-core issues that still remain relevant to some degree, in HMPs it is also crucial to designate the best suited type of core for a specific application. This adds an extra dimension of complexity to the problem, as every core has different characteristics and therefore different behaviour for workload archetypes . Research in the field suggests that there is still much to understand and improve, both in terms of workload classification but also in potential architectural redesign to cater better to common workload bottlenecks (Fallin et al., 2014; Rotenberg et al., 2013; Lukefahr et al., 2012, 2014). Especially in terms of future HMP architecture, one decisive matter for future implementations is whether the current trend of generic type cores are preferred over many accelerator type cores that are tailored at handling specific types of workloads (Navada et al., 2013).

1.1 Research Motivation

From a historical perspective, there has been tremendous progress toward better performing and more efficient processors. However, the future seems to increasingly depend on better resource management as scaling seems to be getting steadily harder to achieve (Esmaeilzadeh et al., 2012; Toumey, 2016; Hennessy and Patterson, 2006; Chien and Karamcheti, 2013). An effort to smartly use systems to expend exactly the desired level of performance and energy is seen through the use of Dynamic Voltage Frequency Scaling (DVFS) (Bischoff, 2016; Miftakhutdinov et al., 2012; Gopireddy et al., 2016). In the case of frequency and voltage scaling, the behaviour, trade-offs and gains have been thoroughly observed and used, to the point that one can argue that current systems and policies achieve close to optimal results.

On the other hand, with the introduction of HMPs, a similar problem has surfaced, that of deciding on which core to run in order to maximise performance or energy efficiency. The complexity of these systems however, coupled with their diversity, significantly perplexes the task of finding an optimal scheduling policy or even identifying how close current policies have approached the optimum. Furthermore, potential benefits are tied to the transfer overheads that occur during migrations of execution. Altering the architecture can change the cost of migrations, making the optimal schedule dependant on the interconnection of processing units in an HMP. 6 Chapter 1 Introduction

Exploring the performance/efficiency of an HMP system is therefore an open, multifaceted problem that has is shown to be rather complex (Kumar et al., 2004; Venkat, 2019; Prodromou et al., 2019). Many studies have tried to improve on HMP performance and improve their efficiency with limited success. In the survey work conducted by Mittal(2016) that thoroughly goes through all attempts at HMP design, solutions are topical, optimising for specific cases.

The reason for this slow advancement pace, as shown in Chapter2, is that currently there is no feasible way to exhaustively analyse schedules of systems that migrate between cores, and no means to measure the effect of proposed architectural modifications in a quantifiable manner. Current methods are based on heuristics and base their validation on a selection of results. As such, there is no way of truly assessing how far from optimal current designs really are, nor is it possible to design future systems with an a priori knowledge of how close they are to the theoretical maximum.

The motivation therefore for this work is to design more efficient and secure HMP systems, using more deterministic and quantifiable methods. To achieve this, several fundamentals need to be thoroughly understood, for instance the effect of migrations in heterogeneous systems, the components that can be shared between cores, and the complications that emerge when the aforementioned changes are implemented. Many of these questions have been addressed in recent work, albeit with simplifications that do not compare to realistic current or future implementations. Additionally finding a way to identify the components that contribute to critical bottlenecks in HMPs can provide significant insight both in hardware design and in software, accentuating on their strengths and revealing their weaknesses and ultimate limitations, and lead to notable improvements in today’s technology.

On the hardware front, this can enable new heterogeneous core designs with different architectural characteristics that address the shortcomings of today’s designs. One example of research that can derive from the above analysis would be to determine migration granularity for different architectures, for instance. Having the ability to fully explore a system’s possible behaviour can also lead to new designs no longer guided by an intuitive process, but rather based on a quantitative method. Accelerator type cores are also possibly beneficial for optimised execution even though the market need for more generic designs that are cheaper to manufacture poses for some challenging future dilemmas.

From a software perspective, many interesting approaches can evolve from this research, both in terms of improving the way schedulers influence migrations but also identifying the necessary hardware infrastructure needed to fully enable software. A common constraint when trying to apply DVFS and smart migrations today is the limited amount of Performance Monitoring Counters (PMCs) that help probe the system state. An in depth analysis of HMP systems can provide insight into which are the principle components to make educated migration decisions in scheduling policies that can overcome the above limitations. Chapter 1 Introduction 7

1.2 Research Questions

As it is clear from the above analysis, HMP scheduling and design are fields of research wide open that can lead to a fundamental understanding of heterogeneity. Additionally, existing system simulation tools allow us to do in-depth analyses at system level, accurately enough to explore without the need of hardware fabrication. The merit of this is that it is possible to examine multiple different designs as well as do exhaustive exploration of execution on existing and potential systems with minimal resources. The aim of this thesis, therefore, is to present areas where current HMP design can be improved and present solutions, primarily focusing on the efficient and secure transfer of execution between heterogeneous cores.

Consequently, it addresses the following research questions:

1. What are the best metrics to evaluate the quality of HMP designs? What is an unam- biguous way to compare different architectures? These questions focus on investigating prior research in the field, identifying potential areas of improvements for existing designs and methods for managing HMPs to achieve near optimal performance. Critical part of this effort is to find a way of establishing a baseline that can compare heterogeneous cores that are deliberately dissimilar. Direct comparison can facilitate when transfer of execution is necessary between different types of cores.

2. How does the affect the trade-off between design complexity and improved operation for heterogeneous multiprocessors? This question focuses on identifying the memory components that lead to higher performance and energy efficiency. Sharing all components is not trivial as some caches are latency sensitive and therefore add a lot of overhead to the execution when shared. Identifying the structures that can be shared to benefit the system is critical of future heterogeneous multiprocessors.

3. How does the sharing of internal core components affect the behaviour of heteroge- neous systems? This question explores architectural and microarchitectural modifications to the core that can be beneficial to the behaviour of heterogeneous systems. Components such as the register file, the branch predictor and the pipeline are tailored to a specific core design. It is critical to identify which of these resources contribute most to slowdown caused when migrating between cores, and devise solutions to increase performance.

4. How do shared resources in HMPs affect the overall system security and what design changes are needed to ensure safe operation? Shared resources in multicore systems contribute significantly to high performance but introduce dangerous security exploits, namely remote timing side-channels that use shared caches and branch predictor state to compromise secure operation. Recent findings show such attacks are feasible in systems with shared resources that transfer execution (Lipp et al., 2018; Kocher et al., 2018; Evtyushkin et al., 2018; Khasawneh et al., 2018; Prout et al., 2018), so ensuring secure operation in HMPs should be a high priority. 8 Chapter 1 Introduction

1.3 Contributions

The above questions have been approached through several contributions which were made as part of this thesis. These contributions allow HMP systems to be more accurately analyzed and identify the components that contribute to overheads when transferring exection between cores. The novel designs presented can bring near optimal heterogeneous systems closer to reality.

In detail, the contributions are:

1. A novel and versatile methodology that is able to compare performance between the heterogeneous cores and conduct various experiments to uncover what the potential, in terms of performance and energy savings of each design. The work was presented in CODES + ISSS conference (Best Paper nominee) and published in the ACM Transactions on Embedded Computing Systems (Vougioukas et al., 2017) and is reported in Chapter3 in further detail.

2. A heterogeneous architecture that shares the Level 2 Cache (L2) and Translation Lookaside Buffer (TLB) that demonstrates a performance increase of up to 7% and energy savings up to 6% . Furthermore, the design is more efficient as on average 50% of the execution operates under a lower Energy Delay Product (EDP). This is reported in Chapter3 and presented in CODES + ISSS conference and published in the ACM journal Transaction on Embedded Computing Systems (Vougioukas et al., 2017).

3. A novel microarchitectural mechanism is presented in Chapter4, that transfers branch predictor state between heterogeneous cores to eliminate the hardware migration overheads. The design is able to transfer minimal state between cores, even the when the microarchitec- tural implementations and the sizes of the branch predictors do not match. When combined with the above contributions to the shared memory hierarchy for heterogeneous systems, Out-of-Order execution (OoO) cores can recover as much the 35% of the performance that would be otherwise lost in frequent migrations. This is currently submitted for publication at the IEEE transactions on computer-aided design of integrated circuits and systems (TCAD).

4. A new structure that fortifies branch prediction from side channels by retaining a minimal part of the predictor state per context. The component requires little to no additional area overhead to isolate contexts, is applicable to both heterogeneous and homogeneous multiprocessors, and can reduce the misprediction rate by 20% when compared to other isolation techniques. This work is published in HPCA ’19 (Vougioukas et al., 2019) and explained in Chapter5.

The publications for all the above contributions are provided at AppendixC. Chapter 1 Introduction 9

1.4 Outline

The breakdown of the rest of the thesis, outlining the key points of each chapter is provided below:

Chapter2 investigates recent studies relevant to HMP optimisation for performance and energy efficiency. Here alternate architectures are reviewed, highlighting the significant improvements over past and current designs Additionally on the software optimisation side, voltage/frequency scaling and core migrations are analysed, both in terms of online policies and offline space exploration, to maximise hardware capabilities.

Chapter3, focuses on the contributions that aim at understanding the effect migrations have on performance and energy efficiency of heterogeneous systems. To fully construe the behaviour of these designs, a novel simulation methodology is presented, along with the necessary infrastruc- ture, that can quantify migration effects and overheads in HMPs through direct comparisons. The practicality of this technique enables in depth studies previously not feasible, and is a used as a basis for the experiments conducted as part of this thesis.

The methodology is used for a variety of experiments to identify beneficial memory component sharing between cores, quantify the overhead for different migration frequencies and calculate the maximum possible benefits when migrating. The second contribution in this chapter presents a new memory architecture for heterogeneous processors that, with minimal sharing accomplishes the same effect as fused architectures, but for a fraction of the design complexity.

In Chapter4, microarchitectural improvements are detailed, concerning the efficient transfer of branch predictor state between different core types. This removes the “cold” start effects that occur after a migration and dramatically improve performance, especially for high frequency migrations that move the execution to an OoO core. As different cores have different prediction needs, an analysis of the techniques that allow state to be transferred between different types of branch predictor designs in a meaningful way.

In Chapter5 a closer look is taken at the security implications that emerge when sharing components. Side-channels to the caches and the branch predictor have been shown to be possible, and hard to address in current hardware. Focusing on the branch predictor a mechanism is described that fortifies context switches and migrations from a subset of side-channel attacks, without degrading performance and adding significant area overhead.

Finally, Chapter6 summarises the insights and contributions presented, and provides a broader view of the impact of the work presented throughout this manuscript. Concluding remarks also assert on what future studies in this field should cover, based on the findings shown.

Chapter 2

Heterogeneous Multiprocessors

In recent years, an abundance of research has focused on making CPUs more efficient under a variety of applications and settings. A considerable amount of this work explores Core Multi- Processor (CMP) optimisation. For this reason, a holistic overview of research in the field is covered, analysing the recent contributions and identifying their possible shortcomings and potential areas for future development with special focus on HMPs or methods directly applicable to HMPs. This chapter reviews work which is organised into three main areas:

Heterogeneous multiprocessor types; • Heterogeneous multiprocessor architecture and microarchitecture; • Heterogeneous multiprocessor scheduling. •

2.1 Heterogeneous Multiprocessor Types

Since a multitude of terms exist to describe heterogeneous processors, some basic concepts are described here, in order to help keep a common convention throughout the thesis. For this reason, a brief description of the basics of all HMP designs is presented, with the aim of focusing the scope to a narrower research area.

Heterogeneous computing, as a general term, refers to systems that feature multiple dissimilar types of processors. The foundation of this concept is to improve efficiency and performance by adding cores tailored for specific applications instead of multiple copies of the same all-round core. By adding dissimilar processors it is possible to specialise their capabilities to tackle specific bottlenecks and cover a wider range of workload types.

Today, a plethora of diverse heterogeneous processor designs exist, however behind the extreme differences all of them follow the same basic principles. As the design of heterogeneous cores has only recently been widely adopted, a lot of notions surrounding them have been loosely 11 12 Chapter 2 Heterogeneous Multiprocessors defined. Srinivasan et al.(2011), Koufaty et al.(2010) and Khan and Kundu(2010) all propose classifications for asymmetrical processors.

In the first two cases the distinguishing factors are the instruction set and the microarchitecture of the cores in the processors, while in Khan and Kundu(2010) the argument is made that heterogeneity can either be deliberate, meaning that the hardware is configured in such a way that will always have an inherent asymmetry (different instruction set, microarchitecture, general purpose - accelerator type core collaboration etc.), or it can be extemporaneous in the sense that identical cores can vary based on the online configuration. Examples of the latter case are when cores are running at different DVFS levels or when hardware is dynamically reconfigured, creating clusters of asymmetric operation. This distinction is very useful as it allows the inclusion of reconfigurable systems, which under the other categorisation methods are excluded.

2.1.1 ISA Based Classification

The main distinction according to both Srinivasan et al.(2011) and Koufaty et al.(2010) in HMP architectures is what Instruction Set Architecture (ISA) each core type uses. A clear distinction can be made between single-ISA cores, that despite their architectural differences use the same ISA throughout, and heterogeneous-ISA cores where each type of core operates under a different ISA. The question of which design is better is not evident, as each offers a different set of advantages and disadvantages, aiming at exploiting different conveniences.

Single-ISA HMPs

The most common approach is to operate all cores under the same instruction set so that migrations between the cores can happen as seamlessly as possible. With the same ISA cores simply continue execution without any binary translation or Operating System (OS) overhead (Kumar et al., 2003). The simplicity of this concept allows these systems to be easily designed with no additional process when migrating between cores (Kumar et al., 2004). The shift in execution is seamless with the heterogeneous cores being completely transparent to the running applications. The OS, depending on the design, can either have knowledge of the underlying heterogeneity or not. Having the OS know the underlying hardware configuration can be done to improve scheduling decisions, while other systems might decide to not expose the cores specifics, making hardware more complicated but the OS simpler.

Heterogeneous-ISA HMPs

On the other hand, heterogeneous-ISA HMPs that do not share the same ISA between dissimilar cores, supposedly can exploit their wider diversity. In theory, this can lead to a more tailored execution when using the correct core for each workload. This can be seen at Devuyst et al. Chapter 2 Heterogeneous Multiprocessors 13

(2012); Barbalace et al.(2014); Kumar et al.(2006); Venkat and Tullsen(2014, 2016), where it is claimed that different ISAs lead to larger divergence which is desirable, as it allows the system to perform better at a larger range of workloads.

However, as the ISA is not common between the cores a binary translation step is required which adds significant overhead. Furthermore, process migration across heterogeneous ISAs requires a program state transformation of the registers and memory, which is architecture dependent. This makes any further expansion to other ISAs and designs a costly process.

Devuyst et al.(2012) propose to use analysis to store most of the code in an architecture- independent format. This is claimed to mitigate the overhead of remapping on the new core and consequently on the new ISA. As this cannot be done throughout the whole code, equivalence is achieved in “windows” of the execution, which revolve around function calls. To support instant migration, an overhead is ultimately paid caused by a binary translation, until a function call is reached. After the function call code can continue to execute on the new ISA.

Venkat and Tullsen(2014) show that heterogeneous-ISAs improves the effectiveness of HMPs. Furthermore, they expand on this idea by analysing distance between two ISAs, taking into account code density, register pressure, instruction complexity, floating point, Single Instruction Multiple Data (SIMD) support and dynamic instruction count. To validate their proposal they use three different ISAs (Thumb 32-bit, -64 and Alpha64) with a memory management scheme and compare with single-ISA system in the SPEC CPU2006 integer and floating point benchmarks. The results show noteworthy improvement against their homogeneous and single- ISA counterparts, however the design is limited to operate with minimal migrations between different ISA domains as the binary translation is too costly Venkat(2019); Prodromou et al. (2019).

Srinivasan et al.(2011) explores, amongst other designs, the prospect of heterogeneous instruction sets. The approach here is to identify portions that can be offloaded to accelerator type cores (using their unique ISA). The results in this study do not present any compelling reason to support heterogeneous-ISAs.

Sometimes heterogeneous-ISA HMPs are also termed as “OS-capable heterogeneous-ISA sys- tems” to specifically differentiate systems that include the Graphics Proccesing Unit (GPU) and other accelerator type core designs. In the CPU-GPU case, the graphical unit handles only graphics specific tasks and cannot run a full fledged operating system where, in theory, heterogeneous-ISA systems expect all cores to be somewhat capable of this (Mittal, 2016).

2.1.2 Commercial and Conventional HMPs

From the above, it is evident that many studies have been conducted to explore whether heterogeneous-ISA systems are preferable. In spite of this, no viable design currently exists in 14 Chapter 2 Heterogeneous Multiprocessors the industry, as their complexity do not make them practical. For this reason, focus is primarily on the more conventional single-ISA HMPs, which proves to be a simple yet efficient approach.

This can be seen in numerous studies (Baum et al., 2016; Yu et al., 2013; Khubaib, 2014; Sarma and Dutt, 2015; Lukefahr et al., 2014; Yoo et al., 2015; Chronaki et al., 2015; Tomusk and O’Boyle, 2013), where the main focus is the analysis or comparison with one of the most widespread family of HMPs currently in market, big.LITTLE Arm processors. These processors bundle cores of the same ISA but with significantly different architectural characteristics as shown in table 2.1 below.

Cortex A15 Cortex A7 Cortex A57 Cortex A53 ISA Armv7 (32-bit) Armv7 (32-bit) Armv8 (32/64-bit) Armv8 (32/64-bit) Frequency 1.7GHz 1.3GHz 1.9GHz 1.3GHz Pipeline OoO InO OoO InO Issue width 3 2 3 2 Fetch width 3 2 3 2 Pipeline stages 15 to 24 8 to 10 15 to 24 8 to 10 L1 I-cache 32KB/2-way/64B 32KB/2-way/32B 48KB/3-way/64B 32KB/2-way/64B L1 D-cache 32KB/2-way/64B 32KB/4-way/64B 32KB/2-way/64B 32KB/4-way/64B L2 cache (64B) 512KB-4MB/16-way 128KB-1MB/8-way 512KB-2MB/16-way 128KB-2MB/16-way Performance (MB/s) 99.69 77.93 155.29 109.36 Energy (mWh) 19.75 10.56 27.72 17.11 Performance/Energy 5.04 7.38 5.60 6.39

Table 2.1: Examples of four heterogeneous cores, coupled as big (OoO) and Little (InO) (Pricopi et al., 2013). The last three rows show results on XML Parsing benchmark (BaseMark OS II) (Mittal, 2016)

In the big.LITTLE setup, two types of cores exist, the Out-of-Order execution (OoO) high performance ones and the power efficient In-Order execution (InO) ones. Each can exist alone or grouped into clusters featuring multiples of the same type of core. Usually some cache is shared amongst cores (most commonly Level 2 cache) in the same cluster, while data sharing with cores in different clusters is either done at the Last Level Cache (LLC) or by traversing through the interconnect.

2.1.3 Reasons for HMP Usage

The diverse nature of HMPs allows them to tailor execution to a wide range of applications and achieve specific goals, such as improved performance or energy efficient operation. Throughout many studies (Srinivasan et al., 2016; Fedorova et al., 2009; Yoo et al., 2015; Yu et al., 2013; Khubaib, 2014) it has been shown that as a general case large cores (OoO, deeper and wider pipelines etc.) usually provide a better Energy Delay Product (EDP) in computationally heavy workloads over the simpler architectures. On the other hand, small cores outclass the high performance designs when applications are memory-intensive, as the many atomic operations and the limited reuse life of data mitigates performance scaling, making energy consumption the crucial factor. Chapter 2 Heterogeneous Multiprocessors 15

In reconfigurable systems, their flexibility allows to cater to specific workload needs, for example when there are memory, instruction or data level parallelism (MLP, ILP, DLP) phases. This has the added benefit of reducing the problem of over-provisioning hardware and ultimately maximising hardware usage (Hill and Marty, 2008).

2.2 HMP Architecture and Microarchitecture

As heterogeneity is a relatively recent direction in , the question of what archi- tecture is optimal has been repeatedly posed. This proves to be a rather overloaded question with a large configuration space of the specific architecture of cores, their optimal number as well as their topology. As such a significant amount of work with various proposals have been made to find the best possible core design and system topology that yield the most performance/energy efficiency.

2.2.1 HMP Space Exploration

A lot of interest has circulated in selecting the best combination of cores from an abstract perspective.

● C1 SELECTION S1

C2a ●● C3 Normalized Time ● C4 1 2 3 4 1.0 1.1 1.2 1.3 1.4 1.5 1.6 Normalized Power

● C1 SELECTION S2

● C2b

● C3 Normalized Time ● C4 1 2 3 4 1.0 1.1 1.2 1.3 1.4 1.5 1.6 Normalized Power

Figure 2.1: A clumpy (top) and a balanced (bottom) selection of cores covering the design space. Figure reprinted from Tomusk et al.(2014).

Tomusk and O’Boyle(2013) consider that since the design space is forbiddingly large to explore, a better approach is to investigate weak heterogeneity. According to the authors, instead of exploring vastly different architectures, tweaking minor values in the same design can lead to smaller yet still interesting trade-offs. A limit study fluctuating Re-Order Buffer (ROB) size is completed in a simulated environment to track variations in EDP. Results show minor variation in performance between cores but emphasise in the flexibility in designing migration scheduling policies. 16 Chapter 2 Heterogeneous Multiprocessors

Tomusk et al.(2014) investigate how cores in HMP systems should be designed to achieve the best possible performance and energy. The study argues the concept of “clumpiness” where the underlying philosophy is that heterogeneity itself needs to be directed in such a way to cover the execution / design space in a balanced manner. Clumpiness is a metric that measures the Euclidean distance between the points on an energy - delay plane.

To validate the proposal, a simulator is used, which performs an in depth search, for all the possible designs of cores on a encryption benchmark. After all the cores have been positioned and the Pareto front formed, a selection based on clumpiness can be done to uniformly cover the design space and offer better tailored execution. This as seen in figure 2.1.

The concept of using diverse cores is further supported in Tomusk et al.(2015) where, apart from trying to select based on uniformity, flexibility is also proposed in order to improve behaviour. As a decisive factor for core selection the Kolmogorov - Smirnoff statistical non-parametric test is used, allowing a subset of cores to be selected not only based on uniformity, but also taking into account performance flexibility.

Navada et al.(2013) also considers that conventional HMP designs are not ideal as the cores are too similar and as such could be considered monotonic. Instead, one way to improve the design is to use genetic algorithms to develop non-monotonic cores and fine tune the characteristics of each. The genetic algorithm uses an objective function based on results extracted from SPEC to guide the design. Furthermore, the observation is made that all generated HMPs converge to the pattern of having an all round core and shaping the rest in the form of some type of accelerator. This design aims at improving performance by identifying and classifying workloads and migrating their execution to the corresponding specialised core.

Lukefahr et al.(2014) approach the matter differently and try to fully explore the state space, albeit with some degree of tolerable error. While they focus on a big.LITTLE type architecture, they attempt to find the inherent maximum potential. To accomplish this the Pareto frontier of optimal performance/energy efficiency is approximated using a new method (PaTH). PaTH splits a workload into quanta, of size 1k instructions, executes and collects performance/energy statistics for each quantum under all core / DVFS configurations.

As the state space is too large to explore exhaustively, the study assumes that quanta are indepen- dent of each other. Working from this assumption, multiple quanta are bundled into segments, effectively creating significantly smaller subsets that are explorable. For each segment the Pareto frontier is estimated by finding the best and worst case schedulers with similar energy and delay profile.

The assumption that quanta are independent also implies that segments are too and, therefore, their frontiers can be joined with a convolution. This results in a Pareto frontier of the whole state space with a bounded error. The simplifications in this approach however, prohibits this to be applied to actual designs that do not share the whole microarchitecture. This happens as the assumptions taken only hold in ideal systems where the register file, the branch predictor and Chapter 2 Heterogeneous Multiprocessors 17

(a) Pareto frontiers are approximated using the best and worst cases for segments.

(b) All the segment Pareto frontiers are convoluted into one that defines the entire workload.

Figure 2.2: Approximation of the Pareto frontier of an HMP using PaTH. Figure reprinted from Lukefahr et al.(2014). other components which are directly tied to the core are shared or manage to transfer their state with negligible overheads. The shaping of the Pareto frontier with PaTH is shown in Figure 2.2. Related concepts are explained in more detail in AppendixB.

2.2.2 Architectural Optimisation in Conventional Designs

Apart from the system level approach and ground up HMP redesign, a notable amount of research is conducted in optimising specific design and features of the existing hardware or proposing specific architectural improvements tackling specific issues.

Gutierrez et al.(2014) compares the performance and energy efficiency of private and shared LLC. The study observes that private LLCs affect the upper bound of migration granularity as, after a point, LLC misses impact the system so much that performance is hindered. This happens because during a migration from one core to another in single thread, the cache starts with no data stored (cold) and therefore lots of misses will incur.

Shared caches provide a larger working set and completely remove the LLC warm-up penalty due to migration, but cannot be powered off to save energy. On the other hand, private LLCs get 18 Chapter 2 Heterogeneous Multiprocessors penalised by the migration misses but also benefit from being able to power down to preserve energy. The experimentation results indicate that when migrations occur at a relatively large window of instructions (100k and above), performance degradation is negligible in private LLCs over shared ones. The comparison with shared LLCs indicates that private last level caches are preferable in situations of coarse migrations.

To further improve the efficiency Gutierrez et al.(2014) propose some alterations to the coherence protocol that enable ownership forwarding on read snoops in order to reduce the write-backs to the powered off core and memory.

The study shows that performance does not suffer at the 1M instruction granularity while energy savings are significant. Also by extending the cache protocol they manage to increase energy efficiency.

Architectural Feature Parameters 3 wide [email protected] 12 stage pipeline big μEngine 128 ROB entries 160 entry register file Tournament branch predictor (Shared) 2 wide [email protected] 8 stage pipeline LITTLE μEngine 32 entry register file Tournament branch predictor (Shared) 32 KB L1 iCache, 1 cycle access (Shared) 32 KB L1 dCache, 1 cycle access (Shared) Memory System 1 MB L2 Cache, 15 cycle access 1024MB Main Mem, 80 cycle access

Table 2.2: Experimental Composite Core parameters used in Lukefahr et al.(2012).

2.2.3 Merging HMP Cores

Another approach to HMP design optimisation focuses on merging heterogeneous cores into a hybrid design with a shared execution front-end and separate back-ends. Notable studies discussed below show that significant benefits can be extracted, assuming the designs can be implemented.

Merging InO and OoO cores

Lukefahr et al.(2012) introduces the concept of CompositeCores, which are cores that share the same front end but have different back ends (μEngines). The different μEngines are modelled similar to Arm’s big.LITTLE architecture and have significantly different characteristics. The big one is an OoO, which has a wide and deep pipeline and is used for high performance, while the Chapter 2 Heterogeneous Multiprocessors 19

LITTLE μEngine is an InO design with a narrow/shallow pipeline targeted for power efficiency. As cores share virtually all other hardware resources (caches, branch predictors fetch units etc.), context switches between the μEngines are kept to the absolute minimum transferring only the register state. As the overhead is negligible fine grain migrations as fine as 1k instructions are made possible.

To manage the switch the composite core features a special controller called the Reactive Online Controller is proposed, which maximises energy savings for a defined performance loss. The controller is comprised of the following components:

The performance estimator, that uses a model used to predict the performance of the big • μEngine based on the LITTLE μEngine and vice versa. The parameters of the estimator are calculated through a regression analysis. In general, the model considers that branch mis- predictions heavily affect big performance whereas the LITTLE μEngine suffers primarily from L2 cache misses, being unable to mask the latency.

The threshold controller, that is essentially a Proportional Integral Derivative Controller • (PID) regulating Cycles Per Instruction (CPI) error based on the performance estimators assessment.

The switching controller, which decides whether or not to perform a migration from one • μEngine to another, or continue operation unaffected.

Figure 2.3: The lower accuracy shown by a reactive controller vs that of a predictive controller for a fine-grained system. Ideally, points with low performance differences (area shaded grey) could be run on LITTLE for energy gains with little performance loss, while the rest should be run on the big back-end. Reprinted from Padmanabha and Lukefahr(2013).

To validate the theoretical design, CompositeCores were introduced into gem5, a full system simulator (Binkert et al., 2011), to calculate the performance and the energy consumption during operation, and to identify the minimum length of migrations (quantum length). The comparison is conducted between the big, high performance core and the HMP CompositeCore system in terms of performance, using an oracle which selects the LITTLE core when its performance degradation is below 5% compared to big. The method claims energy savings of 18% for 95% of maximum performance. 20 Chapter 2 Heterogeneous Multiprocessors

Based on CompositeCores, Padmanabha and Lukefahr(2013) build on the pre-existing design by adding a predictive trace-based switching controller. The idea behind this addition is that, with ultra fine grain switching, reacting to current behaviour is not enough to make correct decisions. This is due to the fact that, in the scope of a few instructions, there is increased volatility in performance, as shown in Figure 2.3.

Instead, with the predictive controller, it is possible to capture traces that detect data flow patterns named super-traces. These super-traces are then used to predict the upcoming traces and estimate which core would better accommodate the workload. According to the study, the overheads added in hardware and energy are minimal, while the energy savings compared to a OoO core were measured through gem5 to be around 15%, which is close to the architecture optimum of 18% previously mentioned. This can also be seen in Figure 2.4

Figure 2.4: Performance of oracle (top), reactive (middle) and predictive (controller). The predictive controller adheres closer to the optimum. Reprinted from Padmanabha and Lukefahr(2013).

A second improvement in the CompositeCores design is proposed in Padmanabha et al.(2015) called DynaMOS which permits migrating between the cores while keeping the schedule. Dy- naMOS is a Dynamic Migration OoO scheduler that is claimed to increase energy savings by 32% for only 5% performance loss compared to only big.

The main idea is that the InO core can not identify ILP and MLP whereas a OoO core can actually reorder instructions and mask long latency. The key idea is to record, or memoize, an OoO schedule by executing an instruction trace on a big core and replay the memoized schedule on the InO core for future iterations of the trace. In cases, therefore, where there is a repeatable Chapter 2 Heterogeneous Multiprocessors 21 scheduling pattern in the OoO core, it would be beneficial to use the InO core to capitalise on the periodicity and boost energy savings.

To accomplish the above, the CompositeCore design is augmented using a schedule trace cache, an online controller and hardware to enable an extra mode of operation in LITTLE called(OinO). Initially, the big μEngine identifies traces that are memoizable and stores them in the schedule trace cache. The LITTLE μEngine can turn into OinO mode to use the memoized traces to boost performance. The online controller decides whether big or LITTLE should run in the near future, and in the case of LITTLE whether it should run in regular In-Order mode or switch to OinO.

When the execution exhibits phased behaviour DynaMOS is claimed to significantly improve energy efficiency, being able to save up to 32% energy over execution only on big, with only 5% performance degradation.

The proposal admits certain caveats, however, which are listed as:

False register dependencies: In OoO mode Write after Write (WaW) and Write after • Read (WaR) are handled by . In OinO renamed register convention must be maintained for correctness.

Complex traces: Sometimes instructions in traces across branches may have been re- • ordered causing divergence in execution. The OinO must rollback before the start in divergence.

Load/Store aliasing: Loads that may be speculatively moved before older stores cause • aliasing. OinO uses a simplified Load Store Queue (LSQ) to surmount this issue.

Unsupported : Interrupts in OinO mode are treated as missspeculation events. • This causes the pipeline and the LSQ to get flushed and restart the trace in regular InO mode.

Overall, CompositeCores oversimplify the design of hybrid systems, assuming that transferring the register state can be done with minimal added complexity, while the overhead in terms cycles is according to the authors negligible. These claims are unrealistic in practical implementations as routing extra wires adds capacitance which effectively introduces more delay, especially at high frequencies. In addition, the extra logic needed to connect or transfer the register file also contributes to this delay while a clock mismatch between two cores can lead to synchronisation problems.

This has been addressed by Rotenberg et al.(2013); Forbes et al.(2016), where the authors try to implement on hardware a system resembling CompositeCores. Their design manages to transfer threads between two heterogeneous cores using two mechanisms, fast thread migration and Cache-Core Decoupling. Fast thread migration is uses a “swap unit” to enable one-cycle register file exchange and a swap clock to synchronise the inbound and outbound cores. The 22 Chapter 2 Heterogeneous Multiprocessors

(a) An illustration of the chip placement. (b) An illustration of the chip routed.

Figure 2.5: Layout and wiring of two merged OoO cores (Rotenberg et al., 2013; Forbes et al., 2016). cache-core decoupling mechanism ensures that caches from the core that was active before the switch remain easily accessible to the migrating core.

Details of their implementation show that even though the sharing of Level 1 Cache (L1) and the register file is possible, the design must be oversimplified to achieve this. In particular the two cores used to perform the migration do not differ much, as their two core implementation feature a 1-way OoO core and a 2-way OoO core. The two cores are in effect too similar negating the benefits of heterogeneous systems, as they use up an equal amount of area (Figure 2.5), and power, while having similar performance. Furthermore, as the wiring increases so does the delay in the transfer logic, forcing the operating frequency to drop to 180MHz, which is significantly slower today’s heterogeneous processors.

Core merging with multiple back-ends for fine grain heterogeneity

Similarly Fallin et al.(2014) proposes multiple heterogeneous back-ends into one core, the heterogeneous block architecture. The idea closely resembles CompositeCores in the sense that all cores share the same front-end but perform blocks of instruction in different back-ends. The motivation of the study is based on the observation that at a finer granularity phases (tens to hundreds instructions), inherent characteristics are much more defined and detailed, in comparison to the coarser phases. Hence, instead of categorising in terms of compute and memory bound, smaller code blocks can also be classified as consistent or volatile dynamic instruction schedule, high or low ILP etc.

Another observation is that, at a finer granularity execution blocks can be assumed to be atomic and consequently they can be executed on separate back-ends. These atomic blocks enable two Chapter 2 Heterogeneous Multiprocessors 23 major improvements, parallel execution of code blocks and the specialised assignment to tailored back-ends.

Parallel execution of blocks is feasible with distinct back-ends which operate similar to superscalar and OoO but in a higher level of abstraction. Instead of reads and writes, code blocks use live-ins and live-outs to track dependencies between the blocks. If the notion of atomicity is maintained between blocks and there are no live-in/live-out dependencies, code blocks are committed all at once. In the case where this does not hold true, the dependent blocks are flushed and re-executed.

Additionally the ability to steer workloads to specific back-ends with certain properties, is claimed to be critical in performance improvement. However as code blocks are assigned online to back-ends, most of the time an extra step of morphing the code to better suit the architecture is necessary. The correct selection of back-ends is further simplified by the observation that schedules can be reused because code in close vicinity tends to have similar behaviour. This can be seen in Figure 2.6

1 0.8 same-schedule chunks 0.6 0.4 different-schedule chunks Fraction of 0.2 0 Dynamic Code Chunks 0 50 100 150 200 250 Benchmark (Sorted by Y-axis Value)

Figure 2.6: Code reusability based on locality. Figure taken from Fallin et al.(2014).

This study is evaluated through a design that consists of 3 separate back-ends, an OoO, an InO and a Very Long Instruction Word (VLIW) core. In the front-end, blocks are being assessed with metadata being stored in a Block Info Cache, which identifies the dependencies and the block specialisation. This information is then used to steer the blocks to the correct core type. Similar to Padmanabha and Lukefahr(2013) memoization is used to to exploit OoO scheduling on VLIW cores and therefore increase performance and reduce expended energy.

An in-house full-system simulator is used to measure the improvement reports on average 36% power reduction compared to a conventional 4-wide OoO core with only 1% of performance degradation.

2.2.4 Unconventional Approaches to HMP Design

From the above it is obvious that there is significant effort to drive cores closer and sometimes merge them without sacrificing heterogeneity in the process. Another unconventional approach that is being developed is reconfigurable, altering cores, built to adapt to workloads with different demands. 24 Chapter 2 Heterogeneous Multiprocessors

In this category Ipek et al.(2007) propose core fusion, a reconfigurable multiprocessor using smaller cores that combine into larger more complex ones. The design is claimed to have:

Greater software flexibility and diversity. Simple cores are ideal for fine grain parallelism • while high performance can be achieved easily by merging them ad-hoc into larger cores.

Smoother phase transitioning. The transition from small power efficient cores can be • done incrementally and bidirectionally, allowing for “more” core architectures.

Greater utilisation of hardware using a single design. As the same core design is repli- • cated through this multiprocessor, core fusion is essentially a modular structure that can strictly adhere to the workload needs without over-consuming or over-performing.

Optimised design for parallel code. The ability to isolate cores allows for improved • execution in parallel code. Fusing comes at a minimal overhead that is acceptable as parallelism is argued to yield more benefits.

Core independence and fault tolerance. Cores can run independently and and detect, • isolate or correct bugs and hard faults without the need for additional hardware.

L2 $

L1L1 d-d-$$ L1L1 d-$d-$ L1L1 d-$d-$ L1L1 d-$d-$ L1 d-$d-$ L1 d-$d-$ L1 d-$ L1 d-$

CORECORE CORECORE CORECORE CORECORE CORECORE CORECORE CORE CORE

L1 i-$i-$ L1 i-$i-$ L1 i-$i-$ L1 i-$i-$ L1 i-$i-$ L1L1 i-$i-$ L1 i-$ L1 i-$

Figure 2.7: CoreFusion architecture showing two fused cores and two independent cores. One of the fused cores is comprised of four while the other of two independent cores. Figure reprinted from Ipek et al.(2007).

To accomplish the aforementioned flexibility, several non-conventional designs are proposed such as, fusable instruction and data caches, collective instruction fetches amongst cores, instruction steering and renaming, collective execution, distributed memory access and collective commits.

Furthermore, two hardware reconfiguration instructions are added to the ISA that merge cores together and break them up, called FUSE and SPLIT accordingly. Both these instructions Chapter 2 Heterogeneous Multiprocessors 25 conditionally execute, when the circumstances under which they are called allow it. The FUSE operation can be completed after the end of a parallel and the start of a sequential region, if a fuse request has been made. If the sequential region is not long enough or the cores are not eligible to fuse the instruction is instead executed as a no operation. Otherwise, all the involved cores are flushed along with their respective instruction caches, their fetching mechanisms fused and reconfigured. Data caches are managed by the coherence protocol and need no special reconfiguration. 2.0

1.5

1.0

Out-of-Order 0.5 In-order

Speedup over OoO w/ 1 thread 0.0 0 2 4 6 8 Number of SMT Threads Figure 2.8: Comparison of OoO and InO performance across a range of SMT configu- rations. Beyond a certain point of SMT threads OoO and InO cores exhibit the same behaviour. Figure taken from Khubaib(2014).

The SPLIT operation happens before a parallel region. In this case, The pipeline is drained and copy instructions transfer the state to a selected core (Core 0). When the transfer completes, the designated core continues execution while the rest of the previously merged cores remain available for the application.

To measure the design’s performance, it is compared against static homogeneous and heteroge- neous designs using a full-system simulator (SESC:SuperESCalar Simulator). While the results do not show the design to be superior to more conventional ones, it stresses which components impact performance and exemplify its versatility.

To amend CoreFusion’s problems Khubaib et al.(2012); Khubaib(2014) propose MorphCore, a fused design that uses a 4-wide OoO core with the ability to decompose into small In-Order Simultaneous Multi-Threading (SMT) cores. The approach, which arguably stems from the opposite end of CoreFusion, observes that small cores that bundle to create more complex designs suffer from impaired high performance. Instead by ensuring high performance, it is more feasible to dynamically reduce complexity, in order to mitigate energy losses during slow or parallel phases. 26 Chapter 2 Heterogeneous Multiprocessors

The underlying issues that MorphCore tries to tackle is the extreme disparity in power consump- tion between large OoO designs and smaller designs. The extra power is stated to be justified only when there is enough ILP to increase core throughput, otherwise bulky designs tend to be very inefficient. To ameliorate the design another observation is made, that SMT designs can achieve OoO performance in cases of high parallelism as shown in Figure 2.8

The architecture of the MorphCore (shown in Figure 2.9) is basically an extended SMT OoO core with an In-Order mode. The main idea behind this approach is that when a certain number of active threads is over a threshold, the core is reconfigured to work as an InO SMT core. This can be done by disabling or swapping OoO components for the InO ones to save energy and maximise Thread Level Parallelism (TLP). Switching between InO and OoO has similar overheads to CoreFusion as pipeline draining and component enabling/disabling is needed.

Figure 2.9: The MorphCore microarchitecture as shown in Khubaib et al.(2012).

To validate the architecture, a cycle-level simulator was used coupled with McPAT to collect power data. The results show similar performance between a MorphCore and a 2-way OoO design on single thread and 22% improvement in multi-threaded workloads.

Gopireddy et al.(2016) is another unconventional approach to heterogeneous and many-core design. The motivation in this work is to identify main weaknesses in current designs and, by making small changes to hardware, manage to solve them. In that sense, the main dilemma when running on HMPs is identified to be the varying execution modes that need to have high performance while maintaining respectable energy efficiency. The claim is that DVFS and heterogeneity, while useful, are sub-optimal because both methods lack flexibility and leave a large portion of the hardware unused.

To amend the above a design of many scalable cores is proposed, borrowing from both concepts of voltage/frequency throttling and heterogeneous design. Similar to MorphCore the starting point of design is a high performance OoO core with two modes, High Performance (HP) and energy efficiency (EE) mode. In HP mode the core runs as a regular OoO core with standard DVFS capabilities. However, in EE mode a number of changes allow the design to operate consuming less energy. To achieve this three main concepts are applied. Chapter 2 Heterogeneous Multiprocessors 27

(a) Logic and storage delay for various voltage (b) Logic and storage frequency for various volt- levels age levels

Figure 2.10: Core and memory delay and frequency as functions of voltage. Reprinted from Gopireddy et al.(2016).

The first concept enforced is, the decoupling between the supply voltage in the pipeline of logic and storage structures. This disentanglement stems from the fact that making voltage flexible storage turns out to be very inefficient at nominal power values. For this reason logic voltage is lowered to the point of acceptable delay.

With storage and logic decoupled, the second concept is to adjust the voltage of storage so that the frequency is twice as fast as the operating frequency of logic. To do this, a small increase in storage voltage is proposed, that significantly boosts frequency. As far as energy is concerned, the difference is rather small as most of the power dissipated is due to leakage. The faster storage execution is claimed to improve Instructions Per Cycle (IPC) and the overall performance and energy consumption of the core. The decoupling of voltage and frequency for logic and storage is shown in Figure 2.10.

The third concept is to leverage this higher storage speed by fusing parts of the pipeline in EE mode. As the pipeline depth reduces without affecting the core frequency, the IPC of the core improves. In the same manner, storage structures such as the branch predictor, the ROB, the Load Store Unit (LSU) and the register files are increased in size consuming some of the time slack but contributing to more ILP and MLP which in turn increase IPC.

When evaluating the design ScalCores are shown to have several advantages over regular HP OoO cores, mainly:

For the same power restraints they increase performance (and reduce execution time) by • 31% on average.

For the same power restraints the energy consumed improves by 48%. • The ability to quickly switch between HP and EE mode, allowing for finer granularity • switches. 28 Chapter 2 Heterogeneous Multiprocessors

− − − − 0 31 32 63 64 95 95 127 MM M M ICache−M M M M M Stitch Table M M M M Control

DRAM Interface − MMM M ICache 0 DCache−0 LSQ0 Inst Operands Frame 127 MM M M MM M M M M M M − ICache 1 DCache−1 LSQ1 . M M M M M M M M M M M M L2 Cache . M M M M M M M M M M M M . ICache−2 DCache−2 LSQ2 Frame 1 DRAM Interface MMM DRAM Interface M MMM M MMM DRAM Interface M Frame 0 ICache−3 − MM M M DCache 3 LSQ3

M M M M M M M M Next block Block Control Router

MMM DRAM Interface M Predictor

(a) TRIPS Chip (b) TRIPS Core (c) Execution Node

Figure 2.11: The TRIPS architecture taken from Sankaralingam et al.(2003).

Sankaralingam et al.(2003) present a fully reconfigurable design which is described as a poly- morphous architecture. The goal in this study is to strike a balance between the fine-grained parallel executions with small well connected cores and more complex workloads with larger, more “logical” cores.

The architecture is comprised of four reconfigurable OoO (16-wide) cores that can execute ILP code efficiently, and 32KB reconfigurable memory tiles. When needed, these cores can break down into smaller simpler designs to exploit TLP or DLP. Additionally, the design can reconfigure memory banks into or synchronisation buffers to accommodate producer consumer communication in neighbouring cores and exploit MLP. TLP is exploited by space sharing the Arithmetic Logic Unit (ALU) array between threads, while DLP is obtained when, in predictable control flow, multiple small cores synchronise and execute unrolled loops.

As the design faced several challenges in distributed instruction fetches, extra logic is added to accommodate for control protocols etc. In follow up research, Sankaralingam et al.(2006) show that the design is indeed feasible and workarounds do not inhibit performance severely. For distributed operations the design shows acceptable behaviour, however many challenges still exist, further mitigating overhead in cases of propagating communication outside neighbouring partitioned cores.

Karpuzcu et al.(2009) try a different approach to many-core architectures. Instead of trying to comply with the power budget dictated by the breakdown of the Dennard scaling, a many-core design explores the prospect of sacrificing core life expectancy by extended operation above nominal power to boost performance. To implement this a new scheme called for Aging Management (DVSAM) is used, which labels most cores as throughput cores operated only on parallel workloads, while the rest are marked as expendable cores and can be overclocked up to 16% more frequency to deal with sequential bottlenecks.

According to the study, in order for the above hypothesis to be feasible two requirements need to be fulfilled. The primary requirement is an accurate model of core ageing that can help push the boundaries of performance and consumption. The study shows that the model is directly linked to the manufacturing technology used and therefore needs customisation for each unique implementation. Additionally, ageing is also considered as a potential restriction for Chapter 2 Heterogeneous Multiprocessors 29 future designs when cores might wear out at a faster pace. To this extent, the authors propose expendable cores to be gradually overclocked to maximise their performance gains without sacrificing too much of their life expectancy.

The second requirement necessary is a processor design suited for this use. The design presented is BubbleWrap, a 32 core homogeneous multiprocessor specifically designed to accommodate DVSAM. The few expendable cores are used much like sequential accelerators, with the extra restriction that only one at a time is used until it completely dies out. Reflecting on the above, while the BubbleWrap processor is a homogeneous design, there is a clear distinction in the different ways that its sub-components are utilized, arguably allowing it to be classified as a heterogeneous execution design.

2.3 HMP Management

Apart from improvements in the design and architecture, resource management plays arguably an equally, if not more, significant role in performance and efficiency. This can be seen in many studies (Yoo et al., 2015; Yu et al., 2013; Patsilaras et al., 2012) that attempt to approach the best possible utilization of the cores. Furthermore, even though significant progress has been made towards better utilization, new studies (Baum et al., 2016) still show that the field has not yet saturated and significant gains can be acquired by investing in researching it. Taking this under consideration, it is crucial to explore the progress made in managing HMP systems.

Bischoff(2016) remarks that finding the Pareto frontier in terms of DVFS points requires simplification of the state space and the merging of states so that it can be fully explored. Online DVFS governors have shown to consistently achieve near optimal decisions. In addition, in this work it is shown that finer grain voltage/frequency throttling inhibits performance and efficiency. Lukefahr et al.(2014) attempt to similarly explore HMP state space, however in doing so, an oversimplified system is used, disregarding cache warm-up and migration overheads. To this extent, no methodology to determine optimal core scheduling or a realistic migration granularity limit from the literature review.

In conjunction with the above, when comparing the architectural optimisation with the software improvements in managing the already existing hardware, there is great disparity in the effec- tiveness. In Sarma et al.(2015) it is shown that even basic awareness at the software level of the underlying hardware, can lead to over 50% improvement in energy efficiency over a hardware agnostic Linux kernel.

Navada et al.(2013) propose a hardware/software synergy model, where accelerator type cores comprise the architecture and a steering algorithm schedules workloads on the designated core. The scheduler, in this case, measures stalls and subsequently finds the best core to handle the respective bottlenecks. This relatively simple methodology is claimed to significantly improve behaviour, with up to 25% energy savings with 2% performance degradation and presumably 30 Chapter 2 Heterogeneous Multiprocessors reaching 99% of optimal behaviour. This indicates that hardware and software co-design might be necessary to fully exploit HMPs.

2.3.1 Estimation and Modelling Techniques

The most common approach to managing HMPs is through modelling and by using estimates of performance and energy efficiency to make adjustments. Gupta et al.(2012) take a closer look into heterogeneous systems and note that apart from the actual cores a large portion of the energy is consumed by the “uncore”, the neighbouring subsystem shared by all the design such as the shared caches, interconnects, power control logic etc.

To accurately estimate a system’s performance and energy efficiency a model is proposed for big and small cores taking into account the components classified as uncore. The model in turn is used in different types of workloads, from intermittent mobile applications and sustained low applications like media playback, to multi-threaded parallel workloads. Through the experimental evaluation, it is shown that uncore power performance is just as important as the actual core as far as energy efficiency is concerned. To fully utilise HMPs, the study concludes that uncore aware schedulers are needed.

The concept of an uncore is also used by Chen et al.(2012) to model a system and apply DVFS techniques. The estimation technique is based on Average Memory Access Time (AMAT), which is shown to directly affect both performance and energy efficiency. For many-core System-on- Chips (SoCs) reducing AMAT can lead to significant performance/energy improvements. A PID based DVFS controller is used to steer the system towards a better operation state. The results show up to 27% energy reduction for 3.7% performance drop.

On the same concept of heterogeneity aware management, Shelepov et al.(2009) develop a scheduler for HMP systems. The algorithm, called heterogeneous aware signature scheduler (HASS), attempts to improve scheduling on heterogeneous cores with architectural signatures. These signatures are a compact composition of metadata including information on “memory- boundedness”, available ILP, clock sensitivity etc. To reduce run-time overhead, the signatures are generated offline and bundled with the application binary.

The static nature of HASS presents the method with some limitations that hinder its performance. Firstly, the static nature of the algorithm is susceptible to inaccuracies caused by dynamic changes to application behaviour. In addition, the signatures need to be specifically tailored for each application, which adds a small overhead on the application development side. Finally, the dynamic phases of applications cannot be identified and, as a consequence, exploit heterogeneity with HASS. Testing the scheduler shows that it performs close to the static oracle, the best possible static schedule.

Becchi and Crowley(2008) explore the scheduling decisions for HMP architectures, and in specific the difference in performance and behaviour between static and dynamic scheduling. To Chapter 2 Heterogeneous Multiprocessors 31 validate this, two dynamic policies are used and compared to a near optimal static scheduler. The system features three different cores on the same ISA. The study shows significant improvement when using a dynamic scheduler over the static one, mainly because the system can adapt to workload changes which happens often.

2.3.2 Load Balancing and Work Stealing

A study focusing solely on dynamic schedulers with interesting insight is Chronaki et al.(2015). The proposal is a dynamic scheduler, called Criticality Aware Task Scheduler (CATS), that is HMP-aware and maximises efficiency. CATS defines tasks as critical when they are in the longest path in a dependency graph and assigns them to fast cores. As the paths in the dependency graph change dynamically, the scheduler is able to track the change and accordingly adjust and migrate tasks. This way critical tasks get to execute in high performance cores while the little, low-power cores handle the non-critical workloads.

To maximise the efficiency of the method, tasks can either be clustered into groups, separated randomly or even in families that reduce transfer overheads. CATS was validated both on real hardware as well as in simulation, with significant improvements ranging from 30% up to 170%.

Kondo et al.(2007); Vet et al.(2013); Bhattacharjee and Martonosi(2009b) extend on the above concept of weighing and balancing the criticality of threads and propose task stealing. Kondo et al.(2007) uses the concept of thread fairness to improve performance and avoid unbalanced cores. Instead of migrating the workloads to different cores, the algorithm focuses on using DVFS, in order to improve fairness and reduce memory retention. Through fairness it is claimed that multiprocessor architectures have increases throughput compared to their non balanced counterparts. Bhattacharjee and Martonosi(2009b) try to implement load balancing by using currently existing PMCs to track memory statistics and determine thread criticality.

Akram et al.(2016) identify that wherever there is garbage collection, it is beneficial to track its criticality, so that it never stalls system operation. When identifying imminent bottlenecks or urgent garbage collection the scheduler moves the task to the high performance cores to avoid the aforementioned stalls. According to the study this can have notable benefits of up to 16% better performance and 20% improved energy-delay product.

2.3.3 Management Using Learning

Another DVFS approach (Sasaki et al., 2007) tries to combine static information, such as breakpoints in execution that signify phase changes, with PMCs such as cache misses, branch predictor misses, IPC etc. To predict, the model is based on a statistical regression analysis using the static and dynamic metadata. This way, the relation between performance and and PMCs is slowly learned beforehand. At run-time code for predicting performance is inserted into the 32 Chapter 2 Heterogeneous Multiprocessors execution. The voltage/frequency level is decided based on the desired predicted performance degradation of the analysis.

Tarsa et al.(2014) try an unconventional method of managing HMPs. The study suggests using a deep neural network to train on data from hardware performance counters in order to estimate periods of low system throughput. To improve the accuracy of the neural network and reduce training overhead they use hierarchical sparse coding with a support vector machine for the classification. Sparse coding is used for de-noising, while the hierarchical representation captures complex patterns from few samples. This happens as the data vectors are classified in the transformed feature domain, which is claimed to uncover typically “hidden” patterns.

The simulations show that the method improves prediction and behaviour over linear regression models and achieve accuracy of over 94% when training on one workload. The results show that deep learning and neural networks can be a promising approach in future HMP management. In order for this to happen however, obstacles need to be surmounted in dealing with over-fitting, diverse workloads and handling user interactive applications.

2.4 Discussion

In this chapter, based on the above analysis, it is clear that there is a lot of progress yet to be made in heterogeneous processor design and management. On the architectural side, many designs have been proposed to improve performance while maintaining a low energy footprint, however the performance/energy improvements are minimal while the complexity of new features seem prohibitive. The conclusion, therefore, drawn from the collective results from above studies indicates that HMP architecture and topology has not yet been fully understood and researched. On the software side, the studies have shown more progress in terms of improving energy savings. The concept of phases within workloads arises, indicating that their correct prediction can lead to improved efficiency, when assigned to the correct core. Furthermore, it has been observed that even simple changes in scheduling policies can lead to significantly more efficient system operation. Reversing the perspective, even the most optimised hardware will perform inefficiently with making the wrong managing decisions. On the other hand, without hardware enabling system awareness, it is impossible to implement optimal processor scheduler.

Consequently, it is vital to redesign hardware to fully exploit the software improvements that have been made. As a result, the remainder of this thesis revolves around finding techniques to quantify the effects of future architectural and microarchitectural changes that can improve system behaviour, particularly in terms of the memory subsystem and the branch predictor. Chapter 3

Heterogeneous Multiprocessor Memory Subsystem Optimisation

To take full advantage of heterogeneous systems it is necessary to have a full understanding of the effects of migrations, as their benefits lie in being able to switch and adapt to the dynamism of workloads. As shown in Chapter2, HMP design and management are still open fields for research. This can be primarily attributed to the inability to quantify the performance of current designs and policies.

Research on actual hardware so far focuses on improving scheduling, while studies on simulated HMPs try to improve design by increasing component sharing between the heterogeneous cores (Shelepov et al., 2009; Butko et al., 2015; Navada et al., 2013; Lukefahr et al., 2012; Padmanabha et al., 2015). Both approaches however, try to improve performance and energy efficiency without identifying the principle factors that cause performance to degrade. As such, scheduling is improved over heuristics while HMP designs that feature aggressive component sharing might not be the most pragmatic way to remove bottlenecks.

This work focuses on understanding all aspects of scheduling and migrations in heterogeneous systems. Unlike prior art that just simulate the overall performance of specific proposed hetero- geneous systems (Navada et al., 2013; Butko et al., 2016; Fallin et al., 2014), in this chapter a novel simulation methodology is presented that enables the precise measurement of different HMP system components and their effect on the overall performance, especially during frequent transfers of execution between cores.

A prominent problem that arises throughout all studies of HMPs is the inability to determine, in an absolute way, a design or policy that could yield improved efficiency. Before explaining the details of the proposed methodology, it is useful to have a necessary understanding as to why in depth research of heterogeneous systems has proven to be so difficult.

33 34 Chapter 3 Heterogeneous Multiprocessor Memory Subsystem Optimisation

3.1 Simulation vs Hardware Experimentation

When it comes to heterogeneous systems, researchers have over the years debated what is the best way to study their characteristics and find ways to improve them. At the core of this discussion lies the question of whether to use actual hardware or revert to simulated systems to draw conclusions (Butko et al., 2015, 2016; Venkat and Tullsen, 2016; Venkat, 2019). Both approaches have their advantages and disadvantages, so choosing one over the other should be done carefully in order to compliment the research rather than hinder it.

Experimentation on actual hardware has two main advantages. Firstly, while simulated devices always have inaccuracies stemming from modelling complexities, physical devices, by definition, are not susceptible to such errors (Walker et al., 2018a). The modelling errors even in accurate simulators can often be in the range of 20% of the actual values (Binkert et al., 2011). This is why hardware is often preferred when accurate measurements are necessary.

Secondly, execution on real hardware can be much faster than that of a simulated system. Depending on the simulation fidelity, execution can be in the order of 100 times slower than running on an identical native platforms. This can be especially useful when many results are needed, as for example when using a machine learning approach which need a lot of training data.

On the other hand, simulations allow for certain experiments to be conducted which are not otherwise feasible (Lukefahr et al., 2012; Venkat, 2019),. The simulations results, while error prone, can often lead to useful conclusions when the models are correlated with actual hardware (Walker et al., 2018b; Reddy et al., 2017; Nikov et al., 2015). Furthermore, unlike real hardware they allow for in depth analysis by retrieving significantly more data than what is possible from a physical device. This happens as the hardware is restricted by a small number of performance counters that get polled with some degree of imprecision, whereas a simulated run can always extract all the data from the system at once. One more crucial aspect that makes simulations more preferable is that they allow the experimentation with future designs and that they enable experiments that can explore all hypothetical means of execution.

To gain a better understanding of heterogeneous systems part of this thesis focuses on quantifying HMP behaviour, finding ways to assess scheduling decisions and identifying the better migration policies. As such, the use of simulated systems provides a clear advantage in the ability to explore all possible scenarios of execution. As an example, we show in Figure 3.1 how an excerpt of execution can be perfectly aligned and used to directly compare the performance between a big and little core. This analysis is feasible in a simulated system as the same simulation can be replicated and perturbed, which in the case above swaps cores. To contrast, as forked execution is not possible in hardware, two separate executions need to be triggered in parallel. However, in hardware as the executions are not fully deterministic, deviations between them will cause them to diverge and therefore the phases will end up misaligned. While a small skew can be Chapter 3 Heterogeneous Multiprocessor Memory Subsystem Optimisation 35

1.8 Sampling every 1 million instructions 1.6 1.4 1.2 1.0 0.8 0.6 0.4 0.2 3.830 3.835 3.840 3.845 2.5 Sampling every 100 thousand instructions 2.0 1.5 1.0

Instructions per cycle 0.5 0.0 3.830 3.835 3.840 3.845 big Billions of instructions LITTLE

Figure 3.1: An excerpt of execution of dijkstra running on two different core types. Sampling at a higher resolution reveals phases that can potentially be exploited to increase efficiency. tolerated when conducting a coarse grain investigation, it is especially severe in the case of fine grain analysis as even a small skew can lead to misaligned and ultimately wrong results.

3.2 Exploration in Heterogeneous Architectures

Current approaches (Fallin et al., 2014; Lukefahr et al., 2012; Navada et al., 2013) compare designs and policies and decide on the preferred setup using empirical evidence, extracted from simulated systems. The reason for doing so is that when considering all the scheduling possibilities for HMPs the state space can become intractably large to assess as is mentioned in Lukefahr et al.(2012, 2014); Padmanabha and Lukefahr(2013); Padmanabha et al.(2015). This leads to a fundamental problem, which stems from the fact that an exhaustive exploration is too expensive to be able to find the configuration and schedule with the best possible performance.

3.2.1 Exhaustive Exploration

To simplify the problem, the following analysis focuses on single thread execution, where the operation of one core happens at a time. Exploring the whole state space of this simplified system offers only two choices, run on big (b) or run on little (L) for each quantum of execution. This still proves to be incredibly complex, because the possible unique states that emerge increase exponentially. As a visual example, a system similar to a big.LITTLE is presented in Figure 3.2. The nodes at the edges of the tree represent states that have never migrated, whereas the inner ones have performed at least one migration in the past. The state space can be expressed mathematically using the depth (N) and width (α) as: 36 Chapter 3 Heterogeneous Multiprocessor Memory Subsystem Optimisation

State space = αN (3.1)

When attempting to perform fine grain migrations, it is evident that very quickly the state space explodes and becomes unmanageable as this increases the tree depth. Increasing the possible choices by adding more, diverse cores and enabling various DVFS levels contributes further to the problem by increasing the tree width. A similar problem is described in Bischoff(2016), where the optimal DVFS scheduling is sought. The solution proposed in that case is to simplify the state space by taking into consideration that, apart from different operating frequencies, identical cores have identical behaviour under the assumption of single threaded execution. In that case, it is possible to linearise the space by treating every state as independent from any previous states.

Start

L b Edges Edges L b L b L b L b L b L b L b L b L b L b L b L b L b L b

Figure 3.2: The state space when exploring the execution of a heterogeneous system fan-out after four steps is 24 = 32.

In the case of HMPs, this approach is not possible as heterogeneous cores execute code in fundamentally different ways. This can be seen for instance when comparing an OoO and an InO core that execute the same core. The reordering of instructions that the OoO core does can leave the the caches, the TLB or the branch predictor in a different state, compared to the execution of an InO core. As such there is no guarantee that at the end of each state the memory subsystem will be in a equivalent state, and therefore cannot the tree states be merged to reduce the complexity of the analysis.

3.2.2 State Space Reduction

Instead of exploring all the possible states, an approximation can be used that, with limited complexity, can still produce useful results. The first step to simplify the state space is based on the observation, that execution in different cores does not affect the stream of instructions upon commit. This can allow the direct comparison of heterogeneous cores that operate at different performance levels and frequencies, as long as epochs are measured in executed instructions and not cycles or time.

In spite of the above, states cannot be merged because, even though the instructions between the different core designs can be synchronised, the caches might be in a very different state. This happens as OoO cores often reorder read and write instructions to mask cache misses. As InO cores do not have the ability to do so, caches can be in a different state even after executing the Chapter 3 Heterogeneous Multiprocessor Memory Subsystem Optimisation 37 same code. This disparity in the cache state can, in certain circumstances, make two states that execute in the same core type have very different outcomes in terms of performance.

To overcome this, the observation is made that systems with private memories have empty or “cold” caches after a migration. This can help us identify the equivalent states and merge them. The affected states are the ones that run the same instructions on the same core with no data in the caches.

UNIQUE UNIQUE 3 STEPS: 6 STATES 1 STEP: 2 STATES b L b L Instructions UNIQUE UNIQUE b LUNIQUE L b L b UNIQUE

2 STEPS: 4 STATES UNIQUE Instructions UNIQUE UNIQUE UNIQUE b L L b L b

b L 4 STEPS: 8 STATES L b L b Instructions

L b L bUNIQUE L b L bUNIQUE UNIQUE UNIQUE L b L b L b L b

Figure 3.3: Depending on the history and the granularity, the complexity of a step changes by 2H. Arrows of the same colour denote merged states, while unique arrows show states that cannot be merged.

To generalise, any state where the cache contents do not affect the operation and the core executes the same instructions can be merged. As mentioned above migrating to a core with cold caches is on case that generates equivalent states. Other cases that simplify the state space are when very volatile workloads constantly thrash their caches. In such workloads performance will not be notably affected by migrations as the system is constantly missing anyway. On the other side of the spectrum, workloads that do not rely on caches to deliver performance can be simulated independently from previous system states, as the cache effects are negligible.

The amount of instructions needed for warm up varies with each workload. This means that a system can be fully explored at a predefined granularity ignoring the cache effects. Considering for example that beyond certain instructions history becomes obsolete (Section 3.5 shows this 38 Chapter 3 Heterogeneous Multiprocessor Memory Subsystem Optimisation happens in most cases approximately after 100k instructions), a simpler model can be constructed that does not fan out of calculable proportions. In that granularity as caches do not affect core performance the model can be reduced to two states per level one running on the big core and on running on the little one.

Based on this logic, if migrations need to happen at double the granularity of full cache warm-up, the tree needs to be twice as deep. However, some states can be simplified, as noted above, when they follow the same pattern after a migration. Extrapolating from this logic, it is possible to calculate that every step needs 2H calculations to fill all states, where H is the number of historical steps and is defined as a natural number that describes the ratio between granularity of analysis over the amount of instructions needed for full cache warm-up (Figure 3.3).

The overall complexity to fully explore the state space can be calculated by multiplying with the number for quotas in the simulation. This would make this algorithm be classified as (N) O assuming that H << N, where N is number of forks. An in depth analysis of the methodology that merges the states and reduces the space can be found in Appendix A.2.

No Migrations

1st order

2nd order b L

L b L b

L b L b Instructions L b L b

L b L b L b L b L b L b L b L b

Figure 3.4: The state space of HMPs broken down in orders of migrations.

3.2.3 Simulation Methodology for First-Order Migrations

While the above analysis can theoretically help approximate with accuracy the generation of the full state space, the methodology is impractical for two reasons. First, in order to reduce the state space, systems have to have private caches as mentioned above. Second, the behaviour of the workload with respect to how it accesses the caches needs to be known before hand to be able to safely merge states, for a given granularity of migrations. These prerequisites make the full state exploration with the simplifications described above inflexible in terms of system design. For this reason a different more versatile approach is needed to argue about scheduling decisions in HMP systems.

Instead of exploring all the possible states, a simpler approximation can be used with limited complexity, but more universal application, which can still produce useful results In Figure 3.4 Chapter 3 Heterogeneous Multiprocessor Memory Subsystem Optimisation 39 the states are categorised by the distance from a migration that has been performed before their execution. Depending on how much complexity can be afforded in the analysis, the amount of migrations explored can be selected.

Hence, to avoid exploring all the states, Figure 3.5 shows a methodology that only explores edges from Figure 3.2 and their immediate spawns that migrate. For the remainder of the thesis this subset of states is referred to as first-order migrations, as the system is not explored beyond one migration.

b L s

L b e L b g d E L b L b L b L b

L b L b L b L b L b L b L b L b

Figure 3.5: For first-order migrations, the faded states are not explored. This can greatly reduce the state space, albeit reducing the scope and accuracy of possible experimentation.

While this limits the scope of possible experiments, the results extracted are based on assumptions that are easy to validate. Using this methodology, it is possible to perform a direct comparison of executions for each phase, between big and little cores. This can be especially insightful when trying to identify how fast migrations can occur and to single out components that contribute more to the migration overheads. Additionally, even though not exhaustive, an approximation can be made about which phases of workloads can be migrated to the opposite core to improve performance or save energy.

The entire process is implemented using gem5, a full system simulator, using the Arm ISA (Binkert et al., 2011). The methodology directly compares execution of systems with and without migrations by forking the simulator and creating two identical simulations, the main or parent and the forked or child simulation. After the simulator splits, the main simulation continues unperturbed, while the fork performs a migration. The design is shown in Figure 3.6 where the uppercase M refers to the original run while the lowercase m symbolises a migration. The basic concept of the methodology is to create an identical copy of a simulation and alter some properties in the newly spawned simulation.

The benefits of the methodology are not restricted to heterogeneous systems, as it can be generalised to cover a broader set of problems. By simulating migrations between homogeneous cores, it is possible to isolate the warm up behaviour of specific components, which are identical between the cores. This happens as in the simulator it is possible to control how much of the cores are shared and therefore analyse the effect of each component separately.

Switching the core in the forked simulation could be achieved in multiple ways to enable different observations. One way to do it would be to signal the OS in the child to issue a migration. The 40 Chapter 3 Heterogeneous Multiprocessor Memory Subsystem Optimisation

Continue run Main run End execution

Warm-up

M M M M M Fork Fork Fork Fork

Start m m m m Period

migrations

Figure 3.6: A visual representation of the forking methodology. The main simulation spawns periodic forks that perform a migration.

OS can either trigger the migration itself, or the forked simulation cause a hardware to raise a signal that a migration must be performed. Both of these methods however would add the OS migration overhead.

From an architecture perspective, the goal of this methodology is to avoid the software overheads of context switching altogether, as those mask any hardware effects. This happens as migrations burden the system with instructions, in the order of millions or more, as the transferring of the state and the scheduling is a complex process.

For this reason, the simulation implementation features a single core system, where forked simulations ”hot-swap” the core leaving the caches and memory intact and keep the OS and the software completely agnostic to the entire process. The hot-swap acts like a migration to a new core, except when execution is resumed the internal components (e.g. pipelines, branch predictors etc.) can be configured to start empty or with information from the swapped out core.

To add versatility and flexibility to the methodology, the memory subsystem can be manipulated post migration to emulate several scenarios which are considered useful. This allows for different levels of sharing to be examined and directly compared to determine the best possible setup.

As the caches are separated from the core in gem5, to emulate sharing the only thing that is necessary is to make sure that, during the hot-swap process, they are not flushed. The TLB on the other hand, is tied to the core design hence, to emulate sharing, the same TLB must be rewired and connected to both cores. When a hot-swap is performed, the system drains in-flight messages, flushes the operating core and afterwards swaps it with the inactive one.

After the fork swaps the core designs, depending on the level of sharing explored, a write-back and invalidate is triggered for the components considered as private. The newly active core Chapter 3 Heterogeneous Multiprocessor Memory Subsystem Optimisation 41 resumes operation with those structures empty and proceeds to warm them up before assuming steady state performance, which emulates the effects of a hardware migration.

CPU OoO Slot ------In-Order

I$ D$

L2

MEMORY

Figure 3.7: The implementation with which the gem5 simulator “hot-swaps” the cores to emulate a system with varying shared resources.

On the software level the system seamlessly continues to operate, completely agnostic of the swap. This can be accomplished because the gem5 simulator can switch the micro-architecture of components in the system, such as the core design, without altering the system architecture state, such as the register values. This is exploited to single out the specific hardware component overheads during migrations. A graphical representation of the methodology infrastructure can be seen at Figure 3.7.

One important detail to note is that, irrespective of the sharing scheme, the register file is always considered to be shared as the simulator does not account for the cost of transferring it. Several hybrid core proposals claim that the state transfer is possible with negligible overhead (Forbes et al., 2016; Rotenberg et al., 2013; Lukefahr et al., 2012). While this is not considered realistic for actual contemporary hardware, it helps to draw useful conclusions. First, it helps assess some best-case scenario when aggressively sharing components and compare the results to many studies that claim fine grain migrations using merged designs are feasible and beneficial (Fallin et al., 2014; Lukefahr et al., 2012, 2016, 2014; Padmanabha et al., 2015; Forbes et al., 2016; Rotenberg et al., 2013). Second, it allows a cleaner observation by singling out of the effects of specific components, and provides insight that could otherwise be masked.

3.3 Heterogeneous Multiprocessor Organisation

Having established a mechanism to assess and compare HMP results, it is worth taking a deeper look into migrations and identify the memory components that provide realistic benefits when shared. In the two previous chapters it was shown that the reasoning behind heterogeneous systems stems from the premise that each workload can be executed by a more suitable, designated 42 Chapter 3 Heterogeneous Multiprocessor Memory Subsystem Optimisation core. However, transferring costs can potentially mask any positive effect, so it is crucial to minimise them as much as possible.

From the above, it is clear that migrations have a pivotal role in the designs and management of heterogeneous systems, and should naturally be thoroughly researched. The remainder of the chapter explores shared memory schemes that aim to minimise the migration costs. The aim is to quantify the migration overhead and calculate the expected benefits assuming a perfect migration policy.

TLB TLB TLB TLB OoO InO OoO InO

I$ D$ I$ D$ D$ I$ L2 L2 L2

MEMORY MEMORY

(a) No sharing. (b) Share caches.

OoO TLB InO OoO TLB InO

D$ I$ I$ D$ I$ D$ L2 L2

MEMORY MEMORY

(c) Share caches and TLB. (d) Share L2 and TLB.

Figure 3.8: Different types of sharing schemes examined for HMP systems.

3.3.1 HMP Resource Sharing

HMPs heavily rely on switching between different core types to reach certain power/performance points and tailor to a diverse set of workloads and environment restrictions, like battery constraints for instance. Consequently, the level of sharing in a system significantly impacts the ability to perform a between cores without significant performance degradation. Even though this can notably reduce costs when migrating, it adds complexity, area and latency that can hurt the performance of the cores. Chapter 3 Heterogeneous Multiprocessor Memory Subsystem Optimisation 43

Based on this, conventional design mandates that heterogeneous cores usually share at most the LLC or the main memory. To achieve high frequency switching, recent academic trends propose radical approaches that bring cores closer together, usually by sharing the Instruction Level Cache (IL1), the Data Level Cache (DL1), the L2 cache and in some cases even some internal components such as the pipeline and the branch predictor (Khubaib, 2014; Lukefahr et al., 2012, 2014; Padmanabha et al., 2015; Lukefahr et al., 2016; Rotenberg et al., 2013; Forbes et al., 2016). The feasibility of such designs however still remains questionable as some parts cannot be shared between drastically different designs, while others (IIL1, DL1 caches) cannot be shared without lowering performance. A more realistic approach, instead of opting for full or no sharing, is to target the features cores can easily share to increase performance when migrating. With this in mind, four different sharing schemes are considered in the evaluation.

Scenario A: No Sharing

The first case, analysed as a baseline, is when cores have private caches and share no other components apart from main memory (Figure 3.8a). Here every migration causes the system to transfer its register file from the retiring to the incoming core before resuming normal operation. In addition the system also has to warm-up the empty or “cold” caches of the new core before it can achieve its maximum performance. The overheads for the no sharing (Tprivate) case can be broken down as: tied to core

Tprivate = Tpip + TBP + Treg + TT LB + T$1 +T$2 (3.2) non-shareable z }| { Where Tpip and TBP are the pipeline| and{z branch predictor} warm-up phases respectively, Treg is the overhead to transfer the register file, TT LB is the extra cost of warming up the TLB and T$1 and T$2 are the penalties induced when starting with cold L1 and L2 caches respectively.

Scenario B: Share Caches

The level of sharing is one where the two cores partly or fully share the caches (Figure 3.8b). This can potentially reduce the warm-up time after a switch. The shared cache overhead is described as:

Tcache = Tpip + TBP + Treg + TT LB (3.3)

However, when workloads have a very small memory footprint, or thrash though data in the cache, the benefits of sharing are limited. Another potential problem with this sharing scheme is that even if caches contain useful data post-migration, empty TLBs can penalise the system significantly.

As systems today commonly operate in a virtual address space, a translation mechanism is required to convert the locations to actual physical addresses. Nested page tables are used to map 44 Chapter 3 Heterogeneous Multiprocessor Memory Subsystem Optimisation

Third-level table

Second-level table Memory Memory D_Page D_Block page First-level table region a Memory D_Table D_Block region a D_Table

a D_Table is a Table descriptor TTBR D_Block is a Block descriptor D_Page is a Page descriptor a Indexed by bits from the input address. Each lookup level resolves additional bits.

Figure 3.9: Generalized view of a stage of address translation (Arm Ltd., 2017a). a Virtual Address (VA) to a VA. The process of deciphering a VA to a Physical Address (PA) is done by a process called a page table walk, which goes through all the page tables sequentially. However, doing the walk is slow, as it requires multiple memory loads. If the processor had to do this look-up every time any instruction accessed memory it would cause a substantial slowdown, especially if the data was not cached.

To reduce the latency of such an operation, the TLB acts as a dedicated cache for this look- up. The TLB only has a few entries stored, where each TLB entry contains both a VA and its corresponding PA bypassing the need for page walk to take place.

If an instruction asks the processor to do some memory operation on a VA, the processor first checks to see whether the TLB contains an entry for that VA. If it does, then that is a TLB hit for the look-up, and since the TLB entry also contains the PA, the processor immediately knows what physical address to use. In this case, the TLB hit latency is only one cycle.

In the case of a TLB miss however, the processor has to laboriously do the virtual-to-physical conversion through a page table walk, shown in Figure 3.9. In the worst case if all page tables reside in memory this can lead to multiple main memory reads, as many as the table levels. The model used for the experiments assumes four levels and an maximum latency of 600 cycles to resolve a TLB miss. Once it finishes doing that conversion, it adds an entry to the TLB so that future conversions of that virtual address will not have to pay the same penalty (Randal E. Bryant, 2015). As it is obvious, TLB misses can be a major contribution to system slowdown if they happen frequently, despite their relatively small access latency. For a sense of scale, Table 3.1 shows the access latency of all the components considered for sharing, measured in cycles.

Scenario C: Share Caches and TLB

The full sharing scheme (Figure 3.8c) assumes that both caches and TLBs are shared among heterogeneous cores, but still needs to warm-up the pipeline, branch predictors and the register Chapter 3 Heterogeneous Multiprocessor Memory Subsystem Optimisation 45

Component Latency (Cycles) TLB 1 IL1$ 4 DL1$ 5 L2$ 14 Main memory 150

Table 3.1: Latency for TLBs, caches and memory in-line with the conducted simulations.

file. Sharing the register file is too complex design-wise, while its relatively small size allow us to ignore the penalty, especially when compared to warming the branch predictors. The pipeline and branch predictors cannot be shared between different core types as functionally they are very different components, for example the state of the speculative, reordering, 4-wide and deeper pipeline in the OoO to a 2-wide InO pipeline with half the depth.

For these reasons for the overhead for full sharing becomes:

Tshare all = Tpip + TBP + Treg (3.4)

Scenario D: Share L2 Cache and TLB

The issue with full system sharing is that while the total slowdown of a system is minimised after a migration, sharing the level 1 caches and the TLBs between the cores adds extra latency that can dramatically reduce the overall performance. This is because components, like the IL1 cache, are very sensitive to latency ( 4 cycles per hit) and even a single extra cycle of latency can lead ' to 25% performance decrease as it is constantly being accessed. Sharing the L1 has this effect as in order to connect two cores to the same cache extra capacitance from wires is introduced and a switching signal is added (e.g. a ), which add latency.

For this reason an alternate architecture is studied that shares the L2 cache and the TLB (Figure 3.8d).

As the cache will not be shared the overhead in this case becomes:

Tmixed = Tpip + TBP + Treg + T$1 (3.5)

Sharing a TLB between cores is not realistically considered practical as the translations, similar to accesses to the IL1 caches, are in the critical path and therefore latency sensitive (Bhattacharjee and Martonosi, 2009b). While this is true many, recent proposals show this can be effectively addressed. Prefetching (Lustig et al., 2013; Pham et al., 2014, 2012) and the use of hierarchical TLBs that share the last level tables (Bhattacharjee and Martonosi, 2009a; Bhattacharjee et al., 2011; Bhattacharjee and Martonosi, 2010) have been shown to be effective in masking the added latency of shared TLBs without losing performance. For this reason, the assessed memory system assumes a monolithic shared TLB. 46 Chapter 3 Heterogeneous Multiprocessor Memory Subsystem Optimisation

One important distinction to make, is that the latency of each component is not equivalent but related to the overhead values described above. For instance, using Table 3.1 it can be seen that a miss in L1 is relatively low impact, but retrieving the desired data that might preside only in the main memory will end up taking hundreds of cycles. As such while the overheads can be decomposed, it is impossible to set an upper bound to them as the worst case depends on the type of workload.

3.3.2 Internal Core Components

While the sharing scheme plays a large part in determining system performance and energy efficiency, the different functionality of cores in heterogeneous systems also affects system behaviour. This is because despite the fact that there is an inherent penalty to switching, migrating to a more suitable core can amortise the cost and boost performance compared to remaining on a less compatible core (Venkat and Tullsen, 2014; Tomusk et al., 2015; Sankaralingam et al., 2003). For this reason, it is important a heterogeneous system is comprised of cores that cover a wide range of applications across a variety of energy/performance points.

For the experimentation a conventional HMP system like the Arm big.LITTLE as a base design for the simulations is considered (Arm Ltd., 2013). One specific aspect of cores that largely affects the performance and overall hardware overhead is the pipeline.The OoO core is a 3-way with deep pipelines, designed to deal with stalls when reading from memory through parallel execution and instruction reordering. On the other hand, the InO core is a 2-way design with significantly fewer pipeline stages, incapable of reordering and fully exploiting ILP.

The deeper and wider the pipeline, the more cycles needed to warm it up. Furthermore, in order to approach steady state performance, the OoO core also needs to fill the re-order buffer and the branch predictor. Both these structures are vital for the cores and, perhaps more importantly, can significantly hinder performance when they are not warmed-up.

The simpler branch predictor and the lack of a reorder buffer of the InO core, while limiting the ability to parallelise instructions and mask memory requests, make it much faster to assume steady performance. Additionally, the simplicity in functionality allows the core’s design to be much smaller and consume less energy. Overall comparisons between the two cores show that the InO core is 3x more power efficient than the OoO core, when clocked at the same frequency (Lukefahr et al., 2012).

3.3.3 Comparing Parallel Simulations

To explore migrations, a main simulation is created that runs on the same core from beginning to end without performing any migrations. After a warm-up interval, the main simulation starts spawning forks periodically until the workload terminates. The period for each sample is measured in instructions instead of time. Using time to determine the sample period can cause the Chapter 3 Heterogeneous Multiprocessor Memory Subsystem Optimisation 47 parallel simulations to diverge when there is a difference in performance, as the simulation with the higher IPC will execute more instructions in the allotted time. Sampling with instructions solves this as, irrespective to the performance or time they take, both runs will end up executing the same code.

The forked simulations examine first-order migrations, which effectively run only for a single period and then terminate. This way, for every period running on the main simulation, there is an equivalent simulation that examines how a system would be affected by a migration at that specific phase. The termination of the forked simulations after one period restricts the methodology to only one migration however, expanding to the more migrations would cause the state space to become too large to compute.

Two separate experiments using the methodology that answer two separate questions are devised:

Migration Cost: Taking into consideration the different sharing schemes, what are the • warm up costs for each core?

Efficient Migrations: What is the best setup (sharing scheme, switching frequency) for • HMPs?

Answering the former question, A migration always switches to the same core type, from OoO to OoO and likewise from InO to InO, for all the sharing schemes described in section 3.3.1. The migration frequency limitations for each core can be calculated by normalising the IPC of the migrations with the corresponding main simulations, averaged over all samples. Applying the arithmetic mean in this case generates a valid result, as all samples run for the same amount of cycles. Consequently the fraction nominators can be immediately added, while their denominators will be simplified out of the formula:

IPCmig. IPC Relative Performance = main , N is the number of samples. (3.6) P N

To determine what the optimal setup is for HMPs another experiment is devised, that switches the core type, from OoO to InO and vice versa, in the migratory simulations. The IPC comparison for this experiment reveals the performance disparity between the two cores in each phase. For this experiment, two migration policies are used to determine beneficial switches, the no trade-off policy and the EDP policy.

The first is when migrating to the opposite core reduces power or execution delay without sacrificing performance or energy respectively, which can be useful in high response situations or when energy is a constrained resource. The second policy examines switches that lead to equal or better EDP, which penalises performance loss more severely, thus expected to favour the OoO core.

To determine whether a migration is beneficial a threshold is used. The threshold is different for the two policies as EDP values delay, or in other words performance, more than the purely 48 Chapter 3 Heterogeneous Multiprocessor Memory Subsystem Optimisation energy constrained approach. This is quantified by calculating the percentage of workload that a migration is beneficial:

N IPCmig. i=1 [ IPC θ] Capacity = main ≥ (3.7) P N The [...] are the Iverson brackets (Graham et al., 1989). [P ] is defined to be 1 if P is true, and 0 if it is false. In the above example only the migrations that are better that the main run are accumulated, either by performing as well with less energy or by outperforming for at most the same consumption. θ is the threshold value for a given policy.

For both experiments the methodology is repeated for different forking periods ranging from fine to coarse granularity (1k, 10k, 100k, 1M and 10M instruction periods). As such, the finest granularity will spawn an impractically large to compute amount of forks to account the same region of execution. Instead, to reduce the data size while still maintaining valid results, samples are randomly chosen using a uniform distribution for each simulation and are limited to 10,000 samples.

3.4 Experimental Setup

To implement the proposed methodology, the gem5 simulator (Binkert et al., 2011) is used and the forking mechanism, described by Sandberg et al.(2015), is modified to be able to fork with instructions and switch the cores. The experimental setup uses a heterogeneous system with models resembling big OoO and a little InO cores, both operating at the same frequency. The simulated system uses the same caches for both the OoO and the InO cores so that a fair comparison between private and shared caches is drawn. The simulator is restored from a previous checkpoint that bypasses the boot-up and for each benchmark is let to run until no warm-up transient effects occur. A detailed description of the system configuration is shown in Table 3.2.

The sharing architecture simplifications and benchmarks have been selected to represent the best possible case for fine-grain migrations. While most sharing schemes are not feasible with today’s implementation techniques, the hypothetical set-ups aim to answer whether it makes sense to pursue fine-grain migrations in HMP systems, regardless of current physical limitations.

3.4.1 Infrastructure

To implement the above methodology and set the basis for all future research, a simulation environment was needed to conduct all of the experiments. The reason for deciding to use a simulator over real hardware, is that a simulator can conduct experiments on multiple setups with relative configuration ease. Furthermore, with them it is easy to probe parts of the system which are impossible to access in hardware, while values extracted in simulation are noise-free. Chapter 3 Heterogeneous Multiprocessor Memory Subsystem Optimisation 49

OoO Core Value Pipeline Width 3-wide (integer) / 2-wide (fl. point) Pipeline Depth 15 ROB Size 60 Branch Predictor Two-Level Operating Frequency 1 GHz In-Order Core Value Pipeline Width 2-wide (integer) / 1 fl.point unit Pipeline depth 8 Branch Predictor Conditional + Indirect Operating Frequency 1 GHz Shared Resources Value L1D Size 32kB L1I Size 32kB L1 Associativity 2-way, 4-5 cycle L2 Size 1MB L2 Associativity 16-way, 14 cycle L2 Prefetcher Stride

Table 3.2: The specifications of the system used in the experiments in this chapter.

On the other hand, simulators are often not a perfect representation of the simulated hardware. This happens because simulating hardware with perfect accuracy is usually much slower than running on the native device. Hence, various simulators exist from the very fast but inaccurate (Sanchez and Kozyrakis, 2013), all the way to ones that have sacrificed speed for accuracy (Binkert et al., 2011; Sandberg et al., 2015).

For all the experiments conducted gem5 (Binkert et al., 2011) is used, which is a cycle accurate full-system simulator that has widespread use. The reasons gem5 is preferable to other simulators are manifold. Firstly, the simulator enables the cycle level simulations, which are necessary for the experimentation proposed. Secondly, gem5 has the ability to fork the simulator and create an identical simulation instance that can then be modified. This is vital for state space exploration as it allows running parallel scenarios of execution on the exact same workload instance, offering the ability to perform direct comparisons.

The simulations run a minimal flavor of Linux and are being evaluated with most of the MiBench suite, a set of micro-benchmarks that features various small workloads ranging from encryption and security to telecommunications and network algorithms that run for approximately 10 billion instructions (Guthaus et al., 2001). These micro-benchmarks have very different characteristics and demonstrate high variation in branch, memory, and integer ALU operations. At the same time, the used datasets can fit in caches most of the time, which should work further in favour of the sharing architectures. The small workload size favours the forking experimentation providing a best case assessment for systems with shared memory. Furthermore, while the workloads are relatively small in size, they are representative mobile workloads. The benchmarks are presented in the following page, while some are omitted due to incompatibility with the simulated set up. 50 Chapter 3 Heterogeneous Multiprocessor Memory Subsystem Optimisation

Automotive and Industrial Benchmarks

The first benchmarks presented are classified as industrial benchmarks. They are workloads meant for use in embedded control systems.

basicmath: performs simple mathematical calculations that do not have specialised hard- • ware support. For instance, angle conversions from degrees to radians and vice versa, or calculating integer square roots.

bitcount: counts the bit manipulation abilities of a processor using an array of integers. To • achieve better coverage it uses five distinct methods of bit manipulation.

qsort: is a popular sorting algorithm operating in a large data set of words. • susan: is an image recognition pattern, used for detecting corners and edges. •

Network

Network benchmarks are common workloads found in embedded processors for networking. The popular implementations are shortest path calculations, table look-ups etc.

dijkstra: is a famous shortest path algorithm of (N 2). To do this, dijkstra creates a large • O graph in an adjacency matrix, which then is used to calculate the shortest path repeatedly looping Dijkstra’s algorithm.

patricia: is a sparse leaf trie data structure benchmark, commonly used in routing tables. • The input of this benchmark is a list of Internet Protocol (IP) traffic from a busy web over the span of 2 hours.

Security

This category of benchmarks focuses primarily on encryption, decryption and hashing. As security is becoming an ever increasing necessity in computer systems, MiBench includes several popular algorithms used in embedded processors.

rijndael: is the algorithm behind the Advanced Encryption Standard. It uses a block • cipher of 128-, 192- or 256-bit keys and blocks. Chapter 3 Heterogeneous Multiprocessor Memory Subsystem Optimisation 51

Consumer

Consumer device benchmarks are based on workloads popular in modern devices like scanners, cameras etc. The primary focus in this category are multimedia applications such as image and music encoding/decoding and basic filtering.

jpeg: is micro-benchmark doing the encoding and decoding of the respective popular • image standard. It is selected as a representative example of image compression and decompression that is commonly used in embedded applications.

Office

This category contains algorithms for office applications like text manipulation, for printers fax machines and word processors.

stringsearch: searches for given words in phrases using a case insensitive comparison • algorithm.

Telecommunications

Telecommunication applications are crucial in modern embedded systems, especially in the ever growing mobile market. Hence, this category focuses on mobile devices applications such as voice encoding / decoding, frequency analysis and checksumming data.

FFT: performs a fast Fourier transform and its inverse on an array of data. This algorithm • has been the cornerstone in digital signal processing and is used to find the inherent frequencies in an imput signal.

adpcm: stands for adaptive differential pulse code modulation and is a variation of the • popular pulse code modulation used to digitally represent sampled analog signals.

3.5 Migration Cost

The first set of experiments compares execution of OoO and InO cores that migrate to identical designs that run uninterrupted. This experiment is designed to measure the migration overhead for the system under each of the memory sharing schemes described above and and from that deduce the cost associated with each memory component. 52 Chapter 3 Heterogeneous Multiprocessor Memory Subsystem Optimisation

1k 10k 100k 1M 10M Instructions 100% 80% 60% 40% 20% 0%

FFT jpeg

Relative Performance qsort susan adpcm patricia rijndael dijkstra Average basicmath bitcounts stringsearch (a) InO performance with no sharing. 1k 10k 100k 1M 10M Instructions 100% 80% 60% 40% 20% 0%

FFT jpeg

Relative Performance qsort susan adpcm patricia rijndael dijkstra Average basicmath bitcounts stringsearch (b) OoO performance with no sharing.

Figure 3.10: Migration cost with no sharing.

3.5.1 Scenario A: No Sharing

In Figures 3.10a and 3.10b the performance of the migrating runs is presented, normalised to the main simulation’s performance for every sample. This effectively allows for a fair comparison between a system performing a migration with private caches and TLBs and one that does not migrate at all. The performance drop at fine granularity suggests that, with private caches, fast switching is not feasible as the overheads dominate. This represents fairly accurately conventional architectures used today, and shows why migrations fall into the category of millions of instructions, even without considering the considerable software overhead.

Across almost all workloads the behaviour is similar showing significant performance degradation especially when performing migrations at a high frequency. One exception to this trend is bitcounts, which does not degrade dramatically especially at at migration periods larger than 1k instructions. This behaviour is mostly attributed to the fact that the specific workload stresses the arithmetic operations and not the memory system or the branch predictor. Comparing the OoO to the InO the performance degradation is similar, both on a per workload assessment and consequently also on average. Chapter 3 Heterogeneous Multiprocessor Memory Subsystem Optimisation 53

1k 10k 100k 1M 10M Instructions 100% 80% 60% 40% 20% 0%

FFT jpeg

Relative Performance qsort susan adpcm patricia rijndael dijkstra Average basicmath bitcounts stringsearch (a) InO performance sharing the caches. 1k 10k 100k 1M 10M Instructions 100% 80% 60% 40% 20% 0%

FFT jpeg

Relative Performance qsort susan adpcm patricia rijndael dijkstra Average basicmath bitcounts stringsearch (b) OoO performance sharing the caches.

Figure 3.11: Migration cost with cache sharing.

3.5.2 Scenario B: Share Caches

When sharing all the caches but not the TLB, performance of the migratory systems visibly increases. One thing to note in Figures 3.11a and 3.11b is that the performance improvement of the InO core is slightly larger than that of the OoO one, particularly at 1k migration period. The deeper pipeline and larger misspeculation penalties seem to be causing a greater negative effect on the OoO core. Smaller sized workloads such as dijkstra show the most improvement as their datasets fit in the caches mitigating much of the migration cost.

Overall, the performance does not justify this complex design, especially for fine-grain switching. Part of the performance loss is due to the wrong speculation due to “cold” structures that do not retain state, i.e. the branch predictor. More interestingly though, this is also due to the fact that even though the caches are shared, the mandatory TLB misses cause the system to potentially access main memory multiple times which adds large penalties. 54 Chapter 3 Heterogeneous Multiprocessor Memory Subsystem Optimisation

1k 10k 100k 1M 10M Instructions 100% 80% 60% 40% 20% 0%

FFT jpeg

Relative Performance qsort susan adpcm patricia rijndael dijkstra Average basicmath bitcounts stringsearch (a) InO performance sharing caches and the TLB. 1k 10k 100k 1M 10M Instructions 100% 80% 60% 40% 20% 0%

FFT jpeg

Relative Performance qsort susan adpcm patricia rijndael dijkstra Average basicmath bitcounts stringsearch (b) OoO performance sharing caches and the TLB.

Figure 3.12: Migration cost when sharing caches and TLB.

3.5.3 Scenario C: Share Caches and TLB

When the system shares all memory components (caches and TLB) another interesting result is revealed in Figure 3.12. While the InO core eliminates almost all of the overhead even at the finest migration frequency, the OoO design still exhibits significant performance losses. This exposes the inability of the OoO core to achieve its maximum potential, as it relies on accurate speculation to achieve high IPC. When the OoO mispredicts it is heavily penalised as the deeper pipeline gets flushed. This result provides novel insight, showing that previous best case studies that report feasible 1k instruction period migrations use an oversimplified system that does not accurately factor in the slowdown caused by the internal core components.

Studies from Section 2.2.1 by Fallin et al.(2014), Lukefahr et al.(2012) that claim that fine grained migrations are feasible are disproved according to Figure 3.12. The above results show that accurately factoring migration costs yields slowdown which cannot be considered negligible. Furthermore, sharing the L1 caches is not pragmatic as the additional delay would hinder the system performance severely. This happens as every instruction, which unavoidably accesses the IL1, adds extra cycles to the execution. This effect is exacerbated in the InO core which is unable to mask the memory access latency through the re-ordering of instructions. Chapter 3 Heterogeneous Multiprocessor Memory Subsystem Optimisation 55

1k 10k 100k 1M 10M Instructions 100% 80% 60% 40% 20% 0%

FFT jpeg

Relative Performance qsort susan adpcm patricia rijndael dijkstra Average basicmath bitcounts stringsearch (a) InO performance sharing the L2 and the TLB. 1k 10k 100k 1M 10M Instructions 100% 80% 60% 40% 20% 0%

FFT jpeg

Relative Performance qsort susan adpcm patricia rijndael dijkstra Average basicmath bitcounts stringsearch (b) OoO performance sharing the L2 and the TLB.

Figure 3.13: Migration cost when sharing L2 and TLB.

3.5.4 Scenario D: Share the L2 Cache and TLB

A more realistic design is shown in Figure 3.13. When sharing the L2 cache and the TLB but keeping the L1 caches private to each core, the results of the InO core show similar behaviour with the previous sharing scheme. When compared to Scenario C, this design shows a decrease in performance for the InO core for migrations with periods smaller or equal to 10k instructions, but similar behaviour at larger periods. The OoO core exhibits almost identical behaviour except for super fine grain migrations, where the performance degradation is less that 10% more.

This behaviour is due to the TLB sharing that helps to avoid accessing main memory to fetch data. taking care of the worst case when no are stored in the caches. When migrating to a new core with a private TLB, the first translation is guaranteed to miss and trigger a walk. Some of the partial translations will be stored in shared caches, but it is highly likely that the upper levels will miss and access Dynamic Random-Access Memory (DRAM). This causes significantly more slowdown (see Table 3.1) than a L1 miss, especially for the OoO core. As mentioned earlier, while the TLB is also a latency sensitive component, optimisations like a shared second level TLB as well as translation prefetching are shown to be practical in Bhattacharjee and Martonosi (2009a), Bhattacharjee and Martonosi(2010) and Bhattacharjee et al.(2011). 56 Chapter 3 Heterogeneous Multiprocessor Memory Subsystem Optimisation

No Sharing Share Caches Share Caches + TLB Share L2 +TLB

100% 90% 80% 70% 60% 50% 40% 30% Relateive Performance Relateive 20% 10% 0% 1 10 100 1,000 10,000

Migration Period (Thousands of instructions) (a) Performance degradation for the InO core.

100% 90% 80% 70% 60% 50% 40% 30% Relateive Performance Relateive 20% 10% 0% 1 10 100 1,000 10,000

Migration Period (Thousands of instructions) (b) Performance degradation for the OoO core.

Figure 3.14: The average performance degradation for the two core type as a function of migration frequency. Chapter 3 Heterogeneous Multiprocessor Memory Subsystem Optimisation 57

3.5.5 Comparison of memory sharing schemes

In Figure 3.14 the performance gap between the proposed level of sharing and the full sharing scheme can be seen, which is 17 percentage points for the OoO and 30 pecentage points for the InO core. However, the design complexity of routing both cores to all the caches and the TLB is prohibitive. More importantly though, when migrating with a 100k instruction or greater period, sharing the L2 cache and at least some of the translations performs equally well as full sharing without the added complexity. For HMP systems pairing OoO and InO, it makes more sense to have a simpler design sharing just the L2 and a TLB. Sharing the L1 caches adds more complexity, but does not allow the system to migrate any faster than 100k instructions, as the OoO overheads diminish any benefits beyond that point. To push beyond this point the speculation needs to be common between cores, which is not feasible in any current hardware.

AppendixA includes figures from a similar study using Android and the Octane benchmark suite (Google Developers, 2013). Unlike MiBench, Octane features workloads with a larger memory footprint which help show the effects of memory sharing. The results show similar trends to the micro-benchmarks presented above, reinforcing further the point that migrations as fine as 1000 instructions are not feasible due to prohibitive design complexity.

3.6 Efficient Migrations

As mentioned in Section 3.3.3, the critical question for HMP systems is whether migrating from one core to another yields any benefits. Based on equation 3.7, it is possible to find how much of each workload can be migrated to the opposite core, using the two policies mentioned in 3.3.3 and adjusting respectively the threshold value θ. This analysis assumes the InO core is 3x more energy efficient, an estimate in line with current HMP systems (Lukefahr et al., 2012). The results are presented in Figures 3.15 and 3.16, showing only the two extreme cases of sharing.

When using the no trade-off policy, a system on the InO core will migrate only when executing the same phase on the OoO core costs equal or less energy. For this to happen the OoO core must perform 3x better than the InO one, so the threshold is adjusted to θ = 3. In Figure 3.15a irrespective of how much hardware the cores share, the cases where this is beneficial are less than 1% on average and at most 10% of the total execution, for specific workloads.

When considering the switch from OoO to InO with the no trade-off policy, the samples are considered beneficial when the InO core performs equally as good or better than the OoO core (θ = 1). This results in the same overall performance for only a third of the power. This case can happen when the OoO core cannot exploit any ILP or MLP and it is reduced to InO performance. Figure 3.15b shows that similarly to the previous result, the beneficial cases are at most 1% on average and 6% of the total execution for a few benchmarks. 58 Chapter 3 Heterogeneous Multiprocessor Memory Subsystem Optimisation

No Sharing (Max) Share Caches and TLB (Max) No Sharing (Avg.) Share Caches and TLB (Avg.) 10% 8% 6% 4% 2% 0% 1 10 100 1,000 10,000 Migration Frequency (Thousand of Instructions) Percentage of Beneficial Switches of Beneficial Percentage

(a) Beneficial migrations from the InO to the OoO core with the No trade-off policy.

10% 8% 6% 4% 2% 0% 1 10 100 1,000 10,000 Migration Frequency (Thousand of Instructions) Percentage of Beneficial Switches of Beneficial Percentage

(b) Beneficial migrations from the OoO to the InO core with the No trade-off policy.

Figure 3.15: Percentage of beneficial migrations when being energy constrained.

While these cases are few, the no trade-off policy reveals that the benefits can only be exploited mostly at the migration frequencies of 10k and 100k instructions. At a coarser granularity the phases average out while migrating every 1k instructions yields no benefits as the overheads mask micro-phases.

For equal or better EDP (Figure 3.16), the formula values performance more as EDP = P ower Delay2. The amount of beneficial switches when migrating from the InO to the OoO × core for this policy is calculated using a threshold value of θ = √3. The results show that the potential for switching and maintaining equal or better EDP in this case range between 48% and 66% when migrating every 100k, 1M and 10M instructions. When performing the opposite θ = √1 switch with the EDP policy ( 3 ), the results show that when aiming for the best EDP the OoO core is more suitable as the beneficial cases on average for all the workloads does not exceed 31% of the execution.

For both EDP experiments, migration frequencies above 100k instructions are less desirable, as Chapter 3 Heterogeneous Multiprocessor Memory Subsystem Optimisation 59

No Sharing (Max) Share Caches and TLB (Max) No Sharing (Avg.) Share Caches and TLB (Avg.) 100% 80% 60% 40% 20% 0% 1 10 100 1,000 10,000 Migration Frequency (Thousand of Instructions) Percentage of Beneficial Switches of Beneficial Percentage

(a) Beneficial migrations from the InO to the OoO core with equal or better EDP policy.

100% 80% 60% 40% 20% 0% 1 10 100 1,000 10,000 Migration Frequency (Thousand of Instructions) Percentage of Beneficial Switches of Beneficial Percentage

(b) Beneficial migrations from the OoO to the InO core with equal or better EDP policy.

Figure 3.16: Percentage of beneficial migrations when being EDP constrained. the gains are significantly smaller. Additionally, in most cases sharing the caches and the TLB is not necessary as both sharing schemes show similar potential for useful migrations. AppendixA shows a similar study using Octane.

3.7 Discussion

In this chapter, a brief analysis of the differences between hardware and simulation oriented experimentation approaches are presented. Through this, it is shown that for heterogeneous exploration simulations are preferable as HMP architectures are still not well understood and alternate designs need to be researched. Furthermore simulations are more efficient at performing a state space exploration analysis when using mechanisms that replicate the simulator. 60 Chapter 3 Heterogeneous Multiprocessor Memory Subsystem Optimisation

Furthermore a novel simulation methodology is presented, focusing on its ability to dissect certain effects that are sensitive to noise and indeterminacy, like the migratory effects for instance. Unlike hardware experimentation, the methodology is flexible enough to allow experiments for various HMP topologies with different degrees of resource sharing.

Having established a mechanism to assess and compare HMP results, it is possible to analyse system behaviour when migrating and identify the memory components that provide benefits when shared. In Chapter1 it was shown that the reasoning behind heterogeneous systems stems from the premise that each workload can be executed by a more suitable, designated core. However, transferring costs can potentially mask any positive effect, so it is crucial to minimise them as much as possible.

Many studies have claimed that when taking a closer look at execution, programs can be broken down into small parts of code that share similar characteristics called phases (Lukefahr et al., 2012, 2014; Padmanabha and Lukefahr, 2013; Forbes et al., 2016). In fact, Lukefahr et al.(2016) and Fallin et al.(2014) claim that phases can be in fact as small as a few hundred instructions and that, additionally, they can be exploited with the a suitable HMP architecture.

Contrary to these academic results, experiments in this chapter show strong evidence that full sharing and potential merging of OoO and InO cores is not the most promising way to design future systems. Instead, an alternate memory hierarchy for HMPs is presented that tries to simplify the sharing scheme between cores. The system only shares the L2 cache and the TLB which can be a practical design as these components do not add prohibitive wiring complexity to the design and can be designed to mask critical latency. This memory hierarchy is part of the contributions of this thesis on HMP system optimisation.

Taking advantage of the capabilities of the methodology, the proposed sharing scheme was shown to perform equally as good or better than the other examined sharing setups, for migrations as fine as 100k instructions. At finer migrations, while sharing all the resources yield slightly better results, the overheads mask any potential benefits and render the pursuit of such designs pointless.

The versatility of the novel methodology is demonstrated in the second set of experiments. These perform switches to different core designs and reveal that most benefits exist in the 100k instructions migration period as coarser grain switches do not detect the phases, while finer ones are overpowered by the overheads. The results show that even when sharing the whole memory subsystem the benefits under oracle scheduling are limited, yielding at most 7% performance and 4% energy savings. When considering acceptable trade-offs for the migration policy, an EDP policy shows that the OoO core is overall more preferable to execute on. The in-depth look of the results reveals an asymmetry between the OoO and the InO cores. The latter shows high sensitivity to the information in the caches, while the former has significant losses at fine grain migrations despite sharing all the memory and the translations. Chapter 4

Branch Predictor State Transfer for Heterogeneous Cores

As mentioned in Chapter2, prior work investigates ways to mitigate the hardware overheads that occur when migrating between cores, prime examples being Lukefahr et al.(2012); Fallin et al.(2014); Rotenberg et al.(2013). These proposals fuse caches, translation buffers, branch predictors and even the pipelines in order to eliminate the cost of switching between OoO and InO cores. In principle, low overhead switching and accurate scheduling in fused cores can enable applications to use the OoO core during phases that can deliver high performance, and swiftly switch to the InO core for low performance phases. This is claimed to save as much as 36% energy for 1% performance loss compared to a system running constantly on the high performance OoO core, effectively combining InO efficiency with OoO performance (Fallin et al., 2014; Lukefahr et al., 2012).

However in practice, front-end fusing is complex and adds delay to latency sensitive components, like the L1 caches and the branch predictor. Furthermore, InO and OoO front-ends cater to different operation points. Hence, a shared front-end cannot satisfy both OoO performance and InO efficiency constraints. In the previous chapter, the analysis focused on streamlining the process of migrating between heterogeneous cores by identifying the memory components that can be shared without adding latency and prohibitive complexity to the system. One of the main observations is that, when sharing the caches and theTLB, the InO cores can recover almost all of the performance that is lost when migrating.

For OoO cores on the other hand, sharing the caches and the TLB, while improving the overall performance, still causes significant slowdown in the system. In Figure 4.1 three systems are compared that exemplify this point, one that executes on an OoO core uninterrupted, one that shares caches and TLBs during frequent migrations and one that additionally retains the branch predictor state. The execution excerpt indicates that, at higher frequencies of migrations, misspeculation that occurs after a migration is performed has a significant impact on performance, as OoO cores rely on branch prediction to expose instruction-level parallelism and execute further

61 62 Chapter 4 Branch Predictor State Transfer for Heterogeneous Cores

Perfomance when Retaining Branch Predictor State 1.75

1.50

1.25

1.00

0.75 No Migration 0.50 Migration w/o BP 0.25 Migration w BP Instructions Per Cycle

3.24 3.25 3.26 3.27 3.28 3.29 Instruction Count (in Billions)

Figure 4.1: Execution excerpt of two identical OoO systems migrating every 1k instruc- tions. One retains all internal components when migrating, the other loses the branch predictor state. Depending on the amount of instruction-level parallelism, performance can drop by as much as 35%. Results generated using gem5 (Binkert et al., 2011) running qsort from MiBench (Guthaus et al., 2001). to mask memory latency. Another observation that can be made is that when retaining branch predictor state, the performance drop of a migratory system is practically eliminated. This indicates that other internal components such as the pipeline do not notably affect the system.

The focus of this chapter is the efficient retention of branch predictor state through migrations. The added latency overhead caused by sharing the branch predictor is prohibitive, much like the L1 caches described in Chapter3. The reason is that predictions need to happen without any delay early in the pipeline to avoid any execution stalls or “bubbles” in the pipeline. To share components, extra wires and are usually needed that increase capacitance and add extra cycles to the execution respectively.

Furthermore, the branch predictor size and accuracy requirements are different for large OoO and small InO cores. In order meet those requirements perfectly, a tailored design is used for each core type. In most cases the disparity between the the OoO and InO core Branch Predictors (BPs) is so great that one common design cannot cater to the needs of both core types. In fact the microarchitectural implementations are usually so different that sharing common state is not feasible.

Instead, transferring compatible branch predictor state between cores could be feasible with a microarchitectural redesign. This solution does not add overhead during execution as the state is transferred parallel to other migration operations, but can notably decrease “cold” mispre- dictions, ultimately improving overall performance. However the redesign requires an in depth understanding of the underlying designs in order to ensure state compatibility. Chapter 4 Branch Predictor State Transfer for Heterogeneous Cores 63

4.1 Branch Predictor Overview

Before introducing the mechanism that transfers branch predictor state between heterogeneous cores, a small introduction is presented of the necessary fundamentals for the comprehension of the design. As the field of branch prediction has evolved dramatically over the past three decades, this chapter focuses only on the designs that are commonly used today. Outdated designs that use only static predictions, such as Evers et al.(1996), Kampe et al.(2002) and Chang et al.(1996), are significantly less accurate rarely used today. The initial dynamic branch predictor designs (Lee et al., 1984; Mcfarling, 1993; Chih-Chieh Lee et al., 2002; Smith, 2003; Yeh and Patt, 2003, 2004b,a; Young and Smith, 2004) introduced concepts of local and global history and the use of combination of subpredictors, but ultimately also deliver worse accuracy than modern designs as the fail to extract long history patterns. The predictors assessed in this report are the bimodal, the TAGE and the perceptron, which are the designs that are predominantly used today Mittal(2019).

All predictors today are assessed based on a variety of metrics which aim at increasing system performance while adhering to necessary constraints. The most important are:

Accuracy is perhaps the most important requirement for branch predictors. Designs use • Mispredictions Per Kilo Instructions (MPKI) as a metric of the accuracy of each design. For OoO cores the misprediction penalty is very high and consequently their predictors need to be extremely accurate. Current state-of-the-art predictors of Jimenez´ (2014a) and Seznec(2014) experience as few as three to four MPKI on average.

Latency constraints are important as the BP lies on the critical path and needs to provide a • prediction almost as soon as the branch address is known. Usually the latency is 1 cycle however, complex designs with slow wires operating at a high clock frequency can make the delay larger (Seznec, 2005; Jimenez´ and Lin, 2002; Jimenez et al., 2000). To address high latency some designs also provide less accurate preliminary predictions called micro- and nano- predictions (Loh, 2006; Jimenez, 2003).

Area can be a limiting factor for BPs as the storage needs of sophisticated designs can be • in the order of tens of kilobytes (e.g., 32-64KB) as seen in the5 th Championship Branch Prediction (2016). This is the case for highly accurate predictors that require multiple different structures, large enough to avoid aliasing in their entries.

Energy needs to also be factored in as the BP is accessed almost every cycle. The circuitry • is usually optimised for low latency at the expense of high energy consumption. Energy efficiency is therefore also important, especially for low power designs.

Warmup is also an important consideration as large predictors may take longer to reach • their maximum accuracy compared to smaller predictors with less state to store. While in general this is not a significant consideration today, it can become a significant overhead in presence of frequent context-switching and security exploits that require the state to be flushed. This is investigated in detail in Chapter5. 64 Chapter 4 Branch Predictor State Transfer for Heterogeneous Cores

Pattern History

PC bits

Prediction ...

Figure 4.2: The architecture of a bimodal predictor

4.1.1 The Bimodal Predictor

The bimodal predictor is one of the simplest designs used today. The idea behind it is to predict branches that exhibit consistently taken or not taken sequences. The design features a one dimensional table of N entries, each entry containing a 2-bit saturating counter (4.2). The logic behind this design is simple; the table containing the predictions is addressed using the log2N bits of the (PC). The first bits can be omitted when the ISA has fixed instruction length, as they do not contain much useful information. When a prediction is made, depending on the 2-bit value stored in the entry a decision is made depending on the most significant bit. If the value is 1 then the branch is taken, while if it is 0 the branch is not taken. When updating the bimodal predictor depending on the actual outcome of the branch, a logical 1 or 0 is shifted from the Least Significant Bit (LSB) into the counter. A state diagram of the prediction and update state machines is shown in Figure 4.3.

Analysing the bimodal on the important metrics mentioned above, shows that as the design relies on a very short history specific to each branch, it captures accurately sequences that constantly have the same outcome. However, it fails to identify alternating taken - not taken sequences as it lacks a mechanism that identify such patterns. Furthermore, the lack of a global history does not allow outcomes of other branches to be factored into the prediction, while implementations also often suffer from entry aliasing when PC bits from different branches point to the same entry in the table.

In terms of latency, the simplicity of the designs makes it quite attractive. The single table structure guarantees that all predictions occur in a fixed amount of time and can be implemented so that the each prediction can return within a single cycle. This single cycle access can be very important as prohibits “bubbles” in the core’s pipeline. The relatively small amount of bits per entry also means that in terms of area and energy consumption the design can be considered efficient, especially when compared to more complex structures. One technique to further reduce the size of the predictor is to make the LSBs, commonly called the hysteresis bits, shared between Chapter 4 Branch Predictor State Transfer for Heterogeneous Cores 65

Taken

11 Not Taken 10 Predict Predict Taken Taken Taken Not Taken Taken 01 Not Taken 00 Predict Predict Not Taken Taken Not Taken Not Taken

Figure 4.3: The states of a bimodal predictor multiple entries. This has been shown in Loh et al.(2003) to make the predictor more dense without sacrificing virtually any performance.

4.1.2 The TAGE Predictor

The TAGE predictor is based on prior work using geometric history lengths in order to address multiple tables (Seznec, 2005). This design uses multiple tables that are indexed using a hash of the PC and part of the global history (H). The history length used for indexing each table is different and follows a geometric series. The lengths are calculated using:

L(i) = ai−1 L(1) + 0.5 d × e where L is the length of the global history, i is the table identifier and a is the geometric factor. For predictions, TAGE performs a look-up in all the tables and selects the prediction with the longest history from all the available valid predictions. If no prediction is available the predictor falls back to a base predictor, a simple bimodal predictor with hysteresis. Similar to the bimodal, each table entry stores a counter of usually two or three bits. Apart from the prediction counter, TAGE tables have two types of metadata that help increase the accuracy (Seznec and Michaud, 2006).

The first feature is a 2-bit “useful” saturating counter that improves the accuracy by overriding the standard policy. When a prediction is selected and it turns out to be correct the useful counter is incremented, while when the prediction was incorrect the useful counter is decremented. When the counter reaches zero the prediction is considered low confidence and overridden by an alternate one. Another feature of the useful counter periodically all entries are reset to zero, to reduce the impact of stale entries (Seznec and Michaud, 2006).

The second feature that makes TAGE more accurate is the fact that it uses tags to cross validate that the entries correspond to the same branches, effectively mitigating aliasing effects. The tags are stored alongside each entry and are usually a hash of the PC and the global history. 66 Chapter 4 Branch Predictor State Transfer for Heterogeneous Cores

PC PC H[0:L1] PC H[0:L2] PC H[0:L3]

Hash HashHash Hash Hash Hash

T0 T1 T2 T3 Base Predictor Tag Tag Tag

Loop Predictor

Statistical Corrector prediction

Figure 4.4: A depiction of a TAGE branch predictor. Tables use history length (H) for indexing and tag matching to ensure valid predictions.

When there is a tag match between the entry and the current branch PC hash, then the entry is considered a valid prediction otherwise they are ignored. The design is shown in Figure 4.4. The latest design, TAGE-SC-L, also features a statistical corrector to improve on biased cases that TAGE consistently mispredicts and a specialised loop predictor Seznec(2014).

The statistical corrector is a metapredictor that inverts the prediction of the main TAGE predictor in cases where the branches have a statistical bias but are not correlated with global history. While the predictor is generally very accurate, such branches it can often perform worse than simple PC-based predictors like the bimodal. It uses tables which are indexed with the same global history lengths as the TAGE tables and the output prediction of the TAGE predictor. For each prediction, the corresponding entries are added and based on the sign of the prediction the branch is considered taken if positive and not taken if negative. If the statistical corrector prediction disagrees with the output of the TAGE predictor and the magnitude is greater than a threshold, then the prediction is reverted (Seznec, 2011).

The loop predictor is a very small simple additional component that tracks past loops that have occurred during execution and stored the amount of iterations. When the PC that corresponds to a stored loop is encountered in the future, the component counts the active iterations to accurately predict when the loop will end (Seznec, 2011).

The TAGE predictor is considered one of the most accurate designs, ranking the highest in5 th Championship Branch Prediction (2016). To achieve this accuracy however, the design requires significantly more storage budget to accommodate for all the tables. The total size of the original predictor is 8kB or 64kB with an analogous misprediction decrease of 4.991 and 3.986 MPKI respectively.

The statistical corrector can be quite small, roughly 12% of the total budget, as it rarely needs to intervene. The tables are usually interleaved to reduce the area of the component. The loop Chapter 4 Branch Predictor State Transfer for Heterogeneous Cores 67

hist(0) hist(1) hist(2) bias XXX

PC PC (a) The bimodal predictor (b) The original perceptron

PC PC⊕hist [0-L] PC⊕hist [L-2L] PC⊕hist [2L-3L] (c) The multiperspective perceptron

Figure 4.5: The evolution of perceptron designs. There is a clear path from the bimodal that only uses the PC, to the modern multiperspective perceptron that uses multiple features to deliver predictions. predictor delivers marginal improvements to the accuracy. For this reason, very limited budget is dedicated to this component, in the order of tens of (Seznec, 2014).

In terms of prediction latency, the design is significantly slower than the bimodal as it requires a hashing step before the table accesses, followed tag matches, a comparison for the longest valid prediction and the latency of the statistical corrector. As this cannot be executed in a single cycle, preliminary predictions are extracted, first by the base bimodal, then potentially altered by the TAGE output and finally by the statistical corrector.

4.1.3 Perceptron Predictors

The Perceptron predictor, can be considered as an evolution from the simple PC based predictors like the bimodal and gshare (Tarjan and Skadron, 2005). The original design from Jimenez and Lin(2001), is loosely based on neural network theory and uses the concept of a Percpetron neuron to deliver the prediction. This features a single classic neuron that calculates a prediction based on the sum of all the products between the history bits and signed weights. The sign of the sum represents the outcome of the branch while the magnitude is the confidence in the prediction. As shown in Figure 4.5b, the weights are stored in a PC indexed array, similar in structure to the bimodal. This makes the prediction correlate both with the global history and the PC. 68 Chapter 4 Branch Predictor State Transfer for Heterogeneous Cores

The main issue with the original perceptron design is that it has very high prediction latency, as it requires many calculations in sequence. After the PC is used to index the correct weights, the history bits need to be multiplied with the corresponding weights and the resulting products added to deliver the outcome of the branch. Additionally the design requires a lot of storage budget and area to be able to store all the weights (Loh, 2006; Jimenez, 2003).

A modified version is the multiperspective perceptron by Jimenez´ (2014a). This design simplifies the prediction process while at the same time increasing the accuracy of the predictor. The idea, visually presented in Figure 4.5c, tries to combine the weights with the history by creating multiple tables that are indexed based on a PC and global history hash. As each table uses different parts of the global history calling them features. The tables effectively behave similar to TAGE that predicts based on geometric histories. One other advantage of this design is that as the feature tables are decoupled from each other, the entries can be multiplexed to cover more branches. The prediction uses fewer calculations:

N Prediction = Wi(fi) i=1 X where Wi are the features tables with the weights, and fi is the value of a specific feature.

Another optimisation that improves the predictor is the use of alternate features, different from the PC and the global history. In Jimenez´ (2014a) many different features are used to index into the weights, the most prominent being:

Global history keeps track of the outcomes of all the branches that haven been executed • in the past. The history depth can be significantly long as each branch is represented by only 1 bit.

Path keeps a record of the sequence of recent branch addresses. To save space, only • some bits from the addresses are used, while the information is compressed using hashing techniques. The depth, shift and mixing style are all parameters that are tuned based on the needs of the implementation.

Recency is a fixed depth stack that keeps track of addresses that have recently been used. • The value that is returned is a hash of all of the addresses in the stack. Unlike path, recency evicts addresses from the stack using a least recently used policy.

Local history keeps track of the outcomes per branch. These local histories store a hash • of the branch PC in a table. When a prediction is needed the hash is used to index the corresponding feature table.

Blurry path tracks large granularity windows of history. Instead of hashing per address as • path does, blurry path hashes per region of addresses tracking more prominent changes in behaviour. Chapter 4 Branch Predictor State Transfer for Heterogeneous Cores 69

Prediction +

feature 1 feature 2 ... = = = = ... = = feature N

Figure 4.6: The design of the multiperspective perceptron predictor.

loop prediction: feature similar to the component that exists in TAGE, called the inner • most loop iterator (IMLI). The IMLI can identify loops by tracking previous continuous repetitions of past backwards branches.

With these higher level features the generalized version of the multiperspective perceptron is displayed in Figure 4.6, which can use multiple different features to better identify program behaviour and more accurately predict the outcome of branches. For simplicity, the remainder of the text will refer to this multiperspective perceptron design simply as perceptron.

4.2 Branch Predictor State Transfer Mechanism

As shown in Figure 4.1, for high frequency migrations to enable potential performance/energy improvements in heterogeneous cores, it is necessary that the branch predictor maintains a high accuracy after each switch. Additionally, improvement in the accuracy must not be negated by extra transfer overheads introduced in the migration process.

Sharing the branch predictor is not feasible for two reasons. First, the latency that is added to a common predictor due to multiplexing eliminates any benefits gained by keeping shared state between migrations. Second, InO and OoO cores have been designed to operate at different performance/power points. OoO cores trade power and area for high performance, while InO cores remain small and energy efficient. As such their front ends, and in particular their branch predictors, are designed differently. The OoO core needs a large and accurate branch predictor to support deep speculation that unlocks instruction-level parallelism, while the InO core uses much smaller designs since it is less reliant on speculation. As a consequence, it is impossible to share the same design without under-provisioning the OoO core’s accuracy or notably increasing the area and power consumption of the InO core for little added performance.

To achieve the benefits of sharing without its drawbacks, a mechanism is presented that transfers some state between the two predictors when a migration is triggered, eliminating the complexity 70 Chapter 4 Branch Predictor State Transfer for Heterogeneous Cores

OoO Core TAGE

Other Components InO Core

Small Base Predictor State Transfer Predictor

Figure 4.7: The transfer mechanism between cores. stemming from component sharing. Furthermore, as modern branch predictors combine predic- tions from multiple components, the design identifies similar components between the branch predictors of the heterogeneous cores. When a migration is triggered, the state of the common components is transferred from the source to the destination core.

From past studies it is observed that both the TAGE and perceptron predictors perform accurately when they are large enough to store enough state, as large as 64kB for high performance cores Seznec(2014); Jim enez´ (2014a). In very constrained designs with small branch predictors (around 1-3 kB), perceptrons have been shown in modern InO cores to be viable Arm Ltd. (2017b).

Transferring state from one core to the other during a migration must be efficient for the benefits to outweigh the costs. TAGE predictors (Figure 4.4) have a fallback predictor which is small enough to transfer and can deliver predictions independent to the rest of the design. This can be used in a heterogeneous system that uses a small, fully transferable branch predictor for the InO core and an identical predictor as the TAGE fallback for the OoO core. When a migration occurs, the state of the identical components can be copied before it resumes execution negate any accuracy drop. This is shown in Figures 4.7 and 4.2.

Using a multiperspective perceptron as the high accuracy predictor for the OoO core makes transferring the state over to a smaller design significantly harder. This happens because for every prediction the weights from each feature need to be added to provide a valid prediction. Removing the extra features that exist only in the large perceptron would lead to imbalanced summations that could lead to less accurate predictions. Removing entries across all features to reduce the size also suffers creates the same problem as each feature entry is used for multiple predictions through all the possible combinations with the other features. Removing entries across all tables not only reduces the amount of predictions stored dramatically, it also reduces the accuracy as unavoidable some combinations will be incomplete. One benefit of perceptrons that can be useful in the case of heterogeneous state transfer is that the predictions are densely stored and can achieve higher accuracy than conventional small sized predictors. For this reason, it is worth investigating the ways to translate predictor state between small InO cores that use perceptrons and large OoO cores that rely on TAGE. Chapter 4 Branch Predictor State Transfer for Heterogeneous Cores 71

4.3 Impact of the Branch Predictor State

To measure the impact of transferring the branch predictor state when a heterogeneous system migrates, two studies are performed. The first one focuses on measuring how much a system is impacted when it executes without any branch predictor state. This shows the potential improvement of a heterogeneous system reduces the mispredictions that occur after a migration to a new core. The second study does an in depth analysis of the accuracy of TAGE predictors and devises a transfer mechanism that reduces the accuracy hit caused by a migration, without adding noticeable transfer overheads.

4.3.1 Calculating Potential Performance Gains

To measure the performance improvement potential of transferring predictor state in heteroge- neous systems, a direct comparison is necessary of two systems that migrate with and without keeping any branch predictor context. This can show the disparity of the worst case where the predictor is cold against a system with a warm predictor. This is particularly timely in today’s systems as they often perform power gating to preserve energy (Esmaeilzadeh et al., 2012) or juggle multiple applications that can pollute the branch prediction before the same application runs again on the same core. More importantly though, recent side-channel attacks shown by Kocher et al.(2018) target the branch predictor and require mitigation techniques such as tagging, partitioning or even flushing the predictor to ensure safe operation, which lowers the actual performance.

The methodology used for this experiment is identical to the one described in Chapter3 using a gem5 simulation (Binkert et al., 2011) of an Arm system. The execution is sampled at regular intervals to simulate the effects for a particular frequency, similar to the experiments in Chapter 3. In brief, for each sample the simulator is cloned twice, effectively running three parallel executions:

1. The main one continues to run without any change or interruption

2. The first clone flushes the pipeline and the branch predictor to simulate a system which has just migrated a task to another (identical) core and without any branch predictor state.

3. The second clone flushes only the pipeline to simulate a system which has just migrated a task to another (identical) core while transferring the complete branch predictor state.

Upon cloning the simulation state, the original simulation continues and periodically records the performance of the system and creates new clones until it reaches the end of execution. A cloned simulation performs a migration that executes for one period and then terminates. The migration is approximated by clearing all the private and non-transferable state in an otherwise identical system. 72 Chapter 4 Branch Predictor State Transfer for Heterogeneous Cores

The data collected from the clones of the simulator are compared to phase aligned main run data, create synchronous samples. This allows for a direct comparison of the performance, as shown in Figure 4.1. For each sample taken, IPC is compared from both simulations of each phase to determine the relative performance IPCmain/IP Cmigration. The experiment is performed both for OoO and InO cores comparing the two systems two systems with and without the branch predictor state. All systems retain the caches and the TLB. Additionally, the experiment is repeated for different migration periods ranging from 1k instructions to 10M instructions. Details of the simulated system are shown in Table 4.1.

OoO Core Value Pipeline Width 3-wide Pipeline Depth 15 Shared Resources Value Operating Frequency 1 GHz L1D Size 32kB L1I Size 32kB L2 Size 1MB In-Order Core Value L2 Associativity 16-way Pipeline Width 2-wide L2 Prefetcher Stride Pipeline depth 8 Operating Frequency 1 GHz

Table 4.1: The specifications of the system used in the experiments.

Figure 4.8 shows that while the InO core loses at most 12% of its performance when not migrating the branch predictor state, the OoO core loses as much as 38% of its performance when performing migration as frequent as every one thousand instructions. This is because the InO core is 2-wide and cannot reorder instructions to mask memory latency and is therefore heavily affected by cache misses. On the other hand, the wider OoO core is able to speculate and uncover instruction-level parallelism to mask misses. However, to do so, it needs accurate speculation to break dependency chains to feed the pipeline. Effectively, the more aggressive the core, the larger the need for accurate branch prediction. These results make evident the fact that while the InO can cope with migrations every 1k instructions, the performance lost by the OoO core prohibit such systems, if the branch predictor state is not “warm”.

When transferring the branch predictor state the OoO core practically eliminates the performance degradation as it recovers 35% of lost performance. Compared to a system that does not migrate, the overhead of migration is less than 3%. This remaining overhead is due to the pipeline and reorder buffer starting empty after a migration. When comparing the migrations between identical InO cores that transfer the branch predictor state, the improvement is virtually negligible as less than 1% of performance is reclaimed, resulting in performance loss compared to an InO core with no migrations of 7%. Comparing InO and OoO, migrations become more efficient for the OoO core when transferring the branch predictor state even for migrations with 1k instruction period. Chapter 4 Branch Predictor State Transfer for Heterogeneous Cores 73

1k 10k 100k 1M 10M Instructions 1.0 0.8 0.6 0.4 0.2 0 Relative Performance FFT jpeg qsort susan adpcm patricia dijkstra rijndael Average basicmath bitcounts stringsearch (a) InO performance breakdown when migrating w/o transferring the BP state (coloured bars) and with the BP state (white bars). 1k 10k 100k 1M 10M Instructions 1.0 0.8 0.6 0.4 0.2 0 Relative Performance FFT jpeg qsort susan adpcm patricia dijkstra rijndael Average basicmath bitcounts stringsearch (b) OoO performance breakdown when migrating w/o transferring the BP state (coloured bars) and with the BP state (white bars).

Figure 4.8: Performance cost when migrating without branch predictor state. The bars are normalised to the performance of a system that executes without a migration.

4.3.2 Measuring Accuracy Improvements

To mitigate the negative effects of migrations, it is necessary to reduce the number of mispre- dictions that happen when transferring execution to a new core. As shown above, InO cores do not experience significant slowdown from branch misprediction, unlike the high performance OoO cores. The focus therefore, is primarily to increase the accuracy of the OoO predictor after a migration, without affecting the long-term performance of either core.

For this reason, TAGE-SC-L from Seznec(2014) is evaluated, which is considered one of the most accurate designs. A modified version of5 th Championship Branch Prediction (2016) is used to allow per component flushing to assess in detail the accuracy of a system that would transfer partial state. The predictor is flushed across the range of 1k to 100M instructions to emulate 74 Chapter 4 Branch Predictor State Transfer for Heterogeneous Cores

Component Contribution to MPKI of 64kB TAGE 3.5 1k 3.0 10k 2.5 100k 2.0 1M 1.5 10M 1.0 100M 0.5

MPKI IncreaseMPKI (Norm.) 0.0

Branch Predictor Components Retained

Figure 4.9: The misprediction rate for TAGE across the range of migration periods evaluated. Legend shows the instructions per migration. the behaviour of a system that migrates at those frequencies, in accordance with experiments in Chapter3. The framework uses over 250 traces from mobile and server applications, both long and short, to cover a wide variety of cases.

Figure 4.9 shows the contribution to the misprediction after a migration of each component of TAGE is measured. Note that the Y-a xis shows only the increase in MPKI over a fully warm TAGE predictor. Results show that retaining none of the components is 4.3x the nominal MPKI. Keeping the state of all the TAGE tables delivers close to maximum accuracy but requires 55kB of data to be transferred; making fine grain migrations impractical. The bimodal and statistical corrector comparably reduce MPKI, 32% and 44% for 1k migrations respectively for example. However, the statistical corrector requires 7.3kB of state to be transferred to achieve this, as opposed to 1.25kB required by the bimodal. Overall, the bimodal is the component with the highest MPKI reduction per kB.

To measure the impact of the transferred state regarding the branch predictor accuracy, three experiments are conducted. The first uses large, accurate TAGE predictor for the OoO core and a smaller bimodal predictor for the InO core. The OoO branch predictor uses a 64kB TAGE, while the bimodal is 1.25kB and perfectly matches the fallback predictor in the OoO TAGE.

When migrating to the InO core, the fallback bimodal state of the OoO is transferred entirely. This means that the InO core always resumes with a fully warm predictor, so no performance degradation is expected. To measure the opposite transfer, the trace simulation flushes all the TAGE components except the bimodal, emulating a system that inherits the state from an InO core. This design is referenced as TAGE-B in the analysis below. Chapter 4 Branch Predictor State Transfer for Heterogeneous Cores 75

Name Size Details 8kbits counter bimodal 1.25kB 2kbits hysteresis 8 features: 1.25kB perceptron GHIST (2), GHISTPATH (3), IMLI, 3kB GHISTMODPATH, RECENCYPOS Loop Predictor, TAGE 64kB Statistical Corrector, 1.25kB bimodal base predictor Loop Predictor 156B, 64kB TAGE-P Statistical Corrector 8kB, 66kB 1.25kB / 3kB perceptron base predictor

Table 4.2: The evaluated branch predictors. Accuracy of TAGE Transferring State 16 14 12 10 8

MPKI 6 4 2 0 1k 10k 100k 1M 10M 100M Migration Period (instructions) TAGE (baseline) TAGE-B TAGE-P 64kB TAGE-P 66kB

Figure 4.10: Misprediction of TAGE variants that transfer the base predictor compared to a TAGE that does not transfer the state upon migration. Benefits of transferring become increasingly noticeable with more frequent migrations

For the second experiment, the concept is that the TAGE fallback bimodal can be replaced by another predictor with greater accuracy, especially since it will be the sole prediction provider in the InO core. Taking this under consideration, the base bimodal is swapped with a similar sized but more accurate perceptron predictor. The experiments assess two designs for the OoO core that use this hybrid (TAGE-P), one that replaces the 1.25kB bimodal for an equally sized, very constrained perceptron and one with a more conventional small perceptron of 3kB instead. The total size of these two variants is 64kB and 66kB respectively. Both cases assume that the small perceptron in the InO core perfectly matches the fallback perceptron in the TAGE design in terms of size, to avoid reducing the accuracy of the predictions. Details of the used predictors are shown in Table 4.2.

In Figure 4.10 the experiment flushes the entire 64kB TAGE predictor across different migration periods and compare it with TAGE-B which retains the bimodal and TAGE-P that retains a base perceptron predictor (1.25kB and 3kB variants). Even when preserving minimal state, the accuracy of the TAGE predictor is dramatically improved, especially for fine grain migrations 76 Chapter 4 Branch Predictor State Transfer for Heterogeneous Cores

OoO Migration and Transfer Overheads 1.25 Migration w/o BP Migration w/ BP No Migration 1.00 Transfer 1.25kB Transfer 3kB

0.75

0.50

0.25 Cycles per Instruction Cycles per

0 1k 10k 100k 1M 100M Migration Period (instructions)

Figure 4.11: CPI stacks show the breakdown of the overheads with and without branch predictor state. When factoring in the transfer cost, using a fast SRAM cache , a 3kB perceptron can be transferred without prohibitive overheads. where the MPKI reduction is in the best case 62% using the TAGE predictor with the 3kB base perceptron (TAGE-P 66kB).

4.3.3 Calculating the transfer overheads

From the above, it is evident that migrations cause branch mispredictions that lead to significant performance loss. Additionally, the trace analysis showed that with minimal state the majority of the accuracy can be recovered practically eliminating the performance drop in today’s systems. However, even when pushing minimal branch predictor state between cores, it is necessary to take into consideration potential added overheads that stem from the transfer of data. For this reason, a theoretical calculation is used which can provide insight into how much a heterogeneous system will be affected by the transfer.

Figure 4.11 shows the total cost of a migration for an OoO core, taking into account the transfer of state, as a CPI stack. This shows that when the BP state is retained, the CPI overheads are reduced by 0.43. Assuming a 256-bit bus for the transfer, the CPI increase for the 1kB and 3kB base predictors is 0.04 and 0.12 respectively for 1k migrations. This means that even taking into account a transfer of 3kB predictor state, the overall CPI is reduced by 0.31. Even in this worst case scenario where the transfer cost is appended to the migration delay, the system performance improves overall. At coarser migration frequencies the transfer effects are negligible, achieving almost the same CPI as a system with no migrations at a 10k instruction migration period. One potential optimisation that can improve further this result is the fact that the transfer can be designed to happen parallel to the rest of the migration operation. Chapter 4 Branch Predictor State Transfer for Heterogeneous Cores 77

4.4 Summary

Heterogeneous multiprocessors traditionally feature OoO and InO cores, enabling operation at high performance and high energy efficiency respectively. By leveraging migrations, such systems can improve efficiency, executing on the InO core for memory bound phases and switching to the OoO core when it can deliver high performance. Studies show that high frequency core switching could exploit smaller execution phases and save as much as 36% energy without sacrificing performance.

However, these savings cannot be achieved currently due the overhead of migrations. Sharing caches and TLBs reduces the overhead, but the cold branch predictor also causes a notable performance drop. The branch predictor cannot be shared as this adds prohibitive latency during execution, while InO and OoO cores have non-compatible designs. Instead, transferring a subset of the predictor between cores reduces mispredictions. The experimentation results show that, even for migrations as frequent as 1k instructions, this can reduce mispredictions by 62% on average and help recover as much as 35% of the performance lost otherwise. Even when considering the overheads of transferring partial state, the effects are beneficial reducing CPI by as much as 31% compared to a system that migrates to a core without any predictor state.

Chapter 5

Security Considerations for Shared Microarchitecture

In Chapters3 and4, system level and microarchitectural improvements where described that aimed at streamlining migrations between cores. All the improvements presented, are based on the principle that sharing components or transferring state between cores can eliminate some of the warm-up overheads and consequently improve performance. The previous analysis focused on the feasibility of such designs and presented practical architectures that are able to be implemented in hardware to enable high frequency migrations with the minimal realistic overheads.

However, when considering any architectural optimisation apart from a performance and feasi- bility analysis it is also important to take into account if the alterations to a system cause any unexpected adverse behaviours. From this perspective, recent studies have shown that shared structures create dangerous security exploits that can leak sensitive information (Kocher et al., 2018; Lipp et al., 2018; Liu et al., 2015).

Components such as shared caches have been shown to be able to leak information or be manipulated to create covert communication channels between applications with the use side- channel attacks (Liu et al., 2015; Lipp et al., 2018). More recently, studies found that aggressive branch prediction in modern processors can permit side-channel attacks that exploit the micro- architectural design to manipulate the branch predictor into fetching and leaking sensitive data (Kocher et al., 2018; Evtyushkin et al., 2018).

While the main focus of this chapter focuses on security implications of optimisation techniques for heterogeneous multiprocessors, many of these exploits can be generalised to threads that share components with in a single core. For this reason, the analysis is extended to include Context Switches (CS) even within a single core, as this can also be a critical enabler with similar characteristics.

The mitigation of security exploits is not a trivial issue as narrow, “hot” fixes for side-channel attacks often do not guarantee that systems are safe from newer variations. Furthermore, keeping 79 80 Chapter 5 Security Considerations for Shared Microarchitecture

Steady and Transient State accuracy 35 Transient State Region Steady State Region

30

25

20

15

10

5 Misprediction Per Kilo Instructions 0 10 100 1k 10k 100k 1M 10M 100M Branch Instructions Per Flush TAGE Perceptron TAGE with Retention

Figure 5.1: Branch predictors do not deliver their full accuracy when flushed frequently in order to ensure secure operation. components like the caches or the branch predictor separate is costly in terms of area. Crude approaches such as flushing the shared structures at potential vulnerability points while not increasing the area, negates the benefits of sharing, creating a notable drop in performance while also increasing power consumption as the system constantly expends energy to scrub the system clean. This is depicted in Figure 5.1 Addressing vulnerabilities in software after discovery is often not possible and usually even more costly in terms of performance. For this reason, a more universal approach is desirable that guarantees isolation between contexts for a reasonable performance and energy consumption compromise.

The above observations motivate the design of future systems that take into account one additional design constraint, that of being able to securely operate without allowing information to leak. Extensive work has been done to secure the caches and block branch predictor attacks albeit with a notable increase in area (Yan et al., 2018; Khasawneh et al., 2018).

To prevent the aforementioned branch predictor side-channels, a novel microarchitectural com- ponent is described in this chapter that guarantees branch predictor state isolation efficiently, without degrading performance. In detail, the contributions are: Chapter 5 Security Considerations for Shared Microarchitecture 81

The introduction of a new notion, that of transient prediction accuracy, critical for future • side-channel free branch predictors. Transient accuracy is juxtaposed with steady-state accuracy, which is what is used to evaluate designs today.

An analysis that shows that current state-of-the-art branch predictors perform notably worse • in transient state when flushed for security. The TAGE predictor (Seznec and Michaud, 2006) is used to show that large components do not contribute to the prediction accuracy under these new circumstances.

To solve this, the branch retention buffer, a novel mechanism is described that reduces • cold-start effects by preserving partial branch predictor state per context. Compared to state-of-the-art, this design achieves the same high accuracy at steady-state, improved transient accuracy and ensures branch predictor state isolation without increasing the overall area.

5.1 Speculative side-channels

Side-channel attacks in architecture have emerged in recent years to be a legitimate source of concern for modern processors. Due to the large variety of side-channel attacks, their severe potential to compromise systems and the specialised nature that is needed to address each case separately the remainder of this chapter focuses on a subset of branch predictor attacks that manipulate the state of the component.

5.1.1 Branch predictor side-channels

Due to , branch predictors are a critical part of modern core design, dras- tically increasing processor performance. Recent studies have shown that side-channel attacks leaking sensitive data are possible in most contemporary CPU designs (Kim et al., 2014; Kocher et al., 2018; Evtyushkin et al., 2018; Lipp et al., 2018). These exploits take advantage of hardware oversights at design time, leaving the system vulnerable to code that can cause information to leak outside its defined scope. Mitigation techniques are far from trivial as current solutions that protect against branch predictor side-channels notably degrade predictor accuracy and reduce overall system performance, while academic proposals are too costly to implement.

Spectre (Kocher et al., 2018) class attacks target branch predictor components in a variety of ways. The focus is particularly on one specific case (variant 2) that exploits conditional branch misprediction that allows malicious code (a gadget) to be executed. That code uses flush and reload type timing attacks on the caches to leak information (Yarom and Falkner, 2014; Liu et al., 2015). For this attack to be plausible, some knowledge of the micro-architectural behaviour is needed to influence the predictor to misspeculate and execute the gadget. With that, an attacker can poison the predictor entries and guarantee the necessary misprediction. The attack can also 82 Chapter 5 Security Considerations for Shared Microarchitecture be triggered in cases when the branch mispredicts without altering the branch predictor state, although this is significantly harder to orchestrate.

Similar to how Spectre uses the branch predictor to leak information, BranchScope by Evtyushkin et al.(2018) uses a mechanism similar to how variant 2 of Spectre to target the Branch Target Buffer (BTB) to influence the Pattern History Table (PHT) of the directional branch predictor. In this attack scheme, the predictor is primed so that it is in a predefined state the attacker can control. From this starting point, the target code is triggered to execute the victim code, which will observably change the PHT state. A simple probe can then calculate the branch flow based on the primed branch predictor conditions. The technique has been shown to successfully leak information from a secure SGX enclave in Intel processors (Evtyushkin et al., 2018).

The potential scenarios to exploit such attacks are numerous. Any transition from one process to another or a process to/from the kernel can be a potential point of vulnerability, as any context switch or system call can be used to let malicious software “hijack” the branch predictor and leak information. For heterogeneous systems that share branch predictor state this is also a serious reason for concern that should be investigated.

5.1.2 Mitigation techniques

On one hand, software and firmware mitigation techniques are often hard to implement and induce high overheads. A recent study from Prout et al.(2018) measures the actual performance loss for Spectre and Meltdown vulnerabilities and find that it ranges from 15% to 90%. While mitigation measures for known exploits has been deployed, potential unknown exploits can still find ways to exploit shared components such as the caches, the branch prediction logic, and the translation tables as the fixes are topical and do not guarantee security. Furthermore, such vulnerabilities (i.e. Spectre v2, Branchscope) cannot be addressed in software or and require hardware redesign.

On the other hand, hardware isolation is a sufficient measure to ensure no leaks occur, but often comes at the expense of performance. Cache side-channels are extremely difficult to address without significant performance loss or without sacrificing a lot of area. Creating shadow structures to keep cache and TLB speculative state, proposed in (Khasawneh et al., 2018; Yan et al., 2018), isolates the information in the caches and protects against Spectre and Meltdown type attacks. The size of the shadow structures can be very costly, especially in systems with much larger caches and TLBs than the ones evaluated in that study. Similarly, the aim is to isolate the branch predictor state to achieve similar security properties for some of the exploits without increasing the area or degrading the performance.

For branch prediction, clearing the branch predictor of any state for each context switch can take care of attacks that manipulate or eavesdrop on control flow to access data stored in the caches. However, flushing the entire state results in a significant accuracy drop with every context switch. Other alternatives such as hard partitioning can negatively impact the steady-state accuracy of the Chapter 5 Security Considerations for Shared Microarchitecture 83 predictor, as the effective size per context is reduced. Tagging the branch predictor entries is also not an acceptable solution, because in the worst case it behaves similar to flushing when most of the entries have been replaced by another context. More importantly though, while using tags eliminates branch predictor entry poisoning, it still allows observation of active entries, which can be flushed to cause deliberate mispredicts and data leaks.

A more secure branch predictor design that uses isolation, can train efficiently, and quickly assume its steady-state performance will be useful in a post Meltdown and Spectre world (Lipp et al., 2018). Taking into account all of the above, current and future branch predictors need to be able to protect from potential side channels, without significant performance overhead when context switching frequently.

5.1.3 Threat model

This work is based on a threat model that assumes a victim and an attacker application. The attacker in this model, attempts to infer victim information without having authority to access it directly. The model is formulated explicitly from the below statements:

Both the victim and the attacker reside in the same core (but different threads) and share • the same branch predictor as described in Kocher et al.(2018) and Evtyushkin et al.(2018).

Slowdown of the victim execution is able to detect the behaviour of a single branch. This • has been proven to be possible in recent studies (Kocher et al., 2018; Evtyushkin et al., 2018; Allan et al., 2016).

The attacker can force the victim code to execute, allowing vulnerable code to be targeted. • The attacker has the ability to poison branch predictor entries used by the victim application, • forcing a misprediction.

5.2 Predictor Flexibility

Branch predictors have been evaluated based on how accurate their predictions are; given a reasonable amount of history they can learn from. However, given the recent findings described in the previous section, in many real world cases they operate in a time-frame much shorter than the ideal, predicting from only a partially warm or cold state. This happens as in order to fully guarantee isolation of branch predictor state the predictor needs to be flushed between context switches causing a notable accuracy drop.

A novel design is described that reduces such overheads, by identifying the components that contribute the most to accuracy during the warm-up phase and create a mechanism to retain their state per context – effectively replicating them. The rest of the predictor that only contributes to 84 Chapter 5 Security Considerations for Shared Microarchitecture

Branch Retention Buffer Predictor

BRB Entry 1 Predictor Components

Useful BRB Entry N Component

Figure 5.2: The mechanism which stores the essential state of the predictor. Most of the state is disregarded and only the most useful state is kept to increase transient accuracy. long-term accuracy is instead flushed preventing any state leakage. This design addresses the overall security-performance trade-off, by guaranteeing isolation while also increasing accuracy for frequent switches.

Large predictors that store more information are usually able to deliver better predictions. How- ever, if the state is lost or invalidated before the predictor has time to warm up, then effectively it will not reach peak performance. In this case, smaller predictors might be able to deliver equivalent accuracy for a fraction of the state.

5.2.1 Steady-state and transient accuracy

As the branch predictor is expected to flush the state to guarantee isolation between contexts, it is important to distinguish between steady-state and transient accuracy. The term steady-state accuracy refers to the performance of the branch predictor when it is fully warm and reached its highest accuracy. Conversely, transient accuracy describes the behaviour during the warm-up phase.

To quantify transient accuracy, the branch predictor state is flushed across different branch instruction periods and track the change in average MPKI. As shown in Figure 5.1, depending on how frequently the state is flushed, the actual (transient) accuracy can be significantly worse than the nominal, steady-state accuracy normally evaluated against.

5.2.2 The Branch Retention Buffer

From the predictor analysis in Chapter4, retaining even a small amount of state per context could improve the transient accuracy of the predictor, without hurting the maximum accuracy achieved during long uninterrupted execution. To guarantee isolation and simultaneously improve transient accuracy, a novel mechanism is presented (Figure 5.2), where some state of the branch predictor is retained per context. The mechanism uses a dedicated component called the branch retention buffer (BRB) that replaces part of the predictor. Chapter 5 Security Considerations for Shared Microarchitecture 85

Predictor

Other Predictor Address Space ID Components

SRAM CAM Retention BRB Entry 1 Bank 0 Active BRB Entry 2 Bank 1

Retention BRB Entry N Bank N Current Entry Selector Entry ID

Figure 5.3: A diagram of the BRB. The retained state is stored in separate Static Random-Access Memory (SRAM) banks which correspond to different contexts.

The branch retention buffer is tightly coupled with the branch predictor and keeps multiple separate entries. Ideally, storing the entire state would enable high accuracy without any warm- up time. However, the size of designs today is too large to fully store without incurring notable overheads. To reduce the additional storage costs, the component is made as small as possible. The design uses 10kbit (1.25kB) sized entries, as they match the size of the base predictor for both the 8kB and 64kB variants. As such, the entries are designed to store entire components that can deliver a standalone prediction.

In short, the low overhead mechanism retains partial state that delivers better accuracy than a large “cold” predictor, with no steady-state accuracy penalty.

The BRB is designed to allow for easy switching between entries without moving or copying data and thus with no added latency during execution. This is done by keeping multiple entries in separate SRAM banks. When a switch occurs, the entry corresponding to the correct context is selected and used instead of swapping out data from the BRB. The selection uses the Address Space ID and triggers only when a context switch occurs. A small Content Addressable Memory (CAM) is used to map to the correct BRB entry ID, shown in Figure 5.3. As a context switch occurs, any delay in the rerouting of BRB entries is masked by the larger overhead of storing the context state, which is conducted in parallel.

After the newly selected entry is activated, the others are put into retention to save energy. While operating under the same context, the active BRB entry is directly accessed for predictions and does not go through the CAM. When the BRB is full and a new context requests an entry, the least recently used entry is evicted. Previous studies show that 3 entries are enough to handle 2 communicating processes and operating system implications (Rubio and Vijaykrishnan, 2007) without negatively impacting system performance. If more contexts are active and are consistently being switched out, more entries can be introduced. The added entries increase the area of the BRB component linearly, while timing remains the same irrespective of entries, for realistic sizes. 86 Chapter 5 Security Considerations for Shared Microarchitecture

TAGE Breakdown at 20k Branch Instructions

2 Retained Components: None Bimodal (B) Loop Predictor (L) L S Stat. Corrector (S) 1.8 B TAGE tables (T) S, L B, L B, S B, S, L 1.6

1.4

T, B, L MPKI (Normalized) T T, B T, S 1.2 T, L T, S, L

T, B, S 1 0 8 16 24 32 40 48 56 64 Size (kB)

Figure 5.4: Contribution to MPKI for each TAGE64 component that is retained at 20k branch instructions. The bimodal reduces the MPKI by 16% while the statistical corrector by 10% despite it being 8x larger.

Retaining the reduced state makes isolation of processes efficient and, as a consequence, improves security. Keeping that reduced state separate per process (or between untrusted parts of the same process, such as browser and script engine), while emptying the rest of the structures, ensures that no state is ever shared between mutually untrusted processes and the kernel. The performance hit is softened in this case, as the preserved state increases the transient state accuracy.

The obvious question, in this case, is what method is used to reduce the data in the most impactful way. This question is not trivial, as it is directly tied to the branch predictor implementation and the amount of state that can be stored efficiently when taking into account the overhead constraints.

One way to preserve state is to select certain components that provide a good balance of the amount of data stored and accuracy achieved; and discard the state of the remaining components. This “vertical cut” method can be used in the case of TAGE as it is comprised of various components that can provide accurate standalone predictions (Figure 5.4).

For instance, the base predictor can be isolated from the rest of the components and still provide reasonable accuracy. Similarly, separate TAGE tables can be preserved instead of the entire design, to target certain history lengths. Chapter 5 Security Considerations for Shared Microarchitecture 87

Figure 5.5: The ParTAGE predictor combines benefits from both perceptron and TAGE predictors. It also allows partial state to be preserved to improve transient accuracy.

Other components, however, provide only complementary benefit to the predictions Jimenez´ (2014a). Therefore, it does not make sense to consider preserving the state of them by themselves. The loop predictor, for instance, is a relatively small component that identifies regular loops with a fixed number of iterations and needs few instructions to warm up. Its overall effect of on the accuracy of TAGE-SC-L is measured to be around 0.3% improvement Seznec(2014).

5.2.3 Perceptron amplified / reinforced TAGE

As a practical example using the BRB, a hybrid predictor is also devised named ParTAGE – Perceptron amplified / reinforced TAGE, shown in Figure 5.5. The branch retention buffer in this case stores a small multiperspective perceptron predictor, similar to the one in Chapter4, to replace the bimodal predictor.

Each entry in the BRB stores an independent small perceptron predictor enabling contexts to keep some reduced state separate. The rest of the branch predictor design can be cleared to eliminate the possibility of side-channel leaks targeting the branch predictor.

Two versions of ParTAGE are assessed; one that hardwires the Perceptron predictor to be always chosen during the transient operation of the predictor (based on the context switching frequency), and one that assesses the confidence of the prediction of the base perceptron predictor before selecting the outcome. 88 Chapter 5 Security Considerations for Shared Microarchitecture

Name Size Details BiM 64kB 2-bit counter 8kbits counter 1.25kB 2kbits hysterisis BiMH 4kbits counter 625B 1kbits hysteresis 64kB 37 feature tables Perceptron 8kB 16 feature tables Loop Predictor, TAGE 64kB / 8kB Statistical Corrector, BiMH1 base predictor Loop Predictor, ParTAGE 64kB / 8kB Statistical Corrector, 1.25KB 8 table perceptron

Table 5.1: The evaluated branch predictors.

5.3 Experimental Setup

The experiments conducted use the CBP5 framework with the traces from 2016 5th JILP Work- shop on Competitions (JWAC-5)(2016) as a base. The framework uses 268 traces, ranging from 100 million to 1 billion instructions for both mobile and server workloads.

The modified CBP framework from Chapter4 is used to perform full or partial flushes on the branch predictor state. This enables temporal studies of their behaviour, revealing the effects of full or partial loss of state.

In the experiments, a variety of prominent predictors are assessed for completeness. As a baseline design, a set of bimodal type predictors are analysed, with and without hysteresis. To compare more contemporary predictors, the TAGE-SC-L Seznec(2014) and multiperspective perceptron without TAGE Jimenez´ (2014a) from CBP5 are selected. Both designs are lightly modified so that they can flush their state; while being cautious to not affect their exact steady-state behaviour.

Furthermore, two variants of ParTAGE are evaluated, that express different policies for the selection of the best transient prediction. ParTAGE-S overrides the prediction of the TAGE tables below a period threshold which is set at 200k branch instructions. ParTAGE uses an integrated confidence value that is assessed by the rest of the TAGE design in order to indicate the most accurate prediction.

The focus is on 8kB and 64kB predictors, similar to the ones that are evaluated at CBP. A detailed list of all the evaluated predictors is shown in Table 5.1. the transient accuracy of the evaluated predictors is measured for the cases outlined above. To achieve this, the experiments are repeated for a range of flushing periods from 10 to 60M branch instructions per flush; extracting the optimal design for each use case. Chapter 5 Security Considerations for Shared Microarchitecture 89

Applications Time(s) Context Switches (CS) CS/s adobereader 73 459,699 6,279 gmail 23 228,924 9,833 googleslides 108 343,788 3,185 youtube 36 418,254 11,487 yt playback 21 224,631 10,698 angrybirds rio 81 406,711 5,044 camera@60fps 60 1,958,557 32,643 geekbench 60 129,533 2,159

Table 5.2: The frequency of context switches on a Google Pixel phone.

MPKI Branch instruction flushing period (instructions) Name 100 200 2k 20k 200k 2M 20M 40M 60M BIM 64kB 22.22 19.52 14.12 12.13 11.40 11.16 11.14 11.15 11.15 BIM 1kB 22.26 19.58 14.40 12.87 12.59 12.52 12.51 12.51 12.51 BIMH 40kB 19.33 17.23 13.40 12.04 11.56 11.41 11.39 11.39 11.39 BIMH 10kB 19.34 17.24 13.43 12.10 11.63 11.49 11.47 11.47 11.47 BIMH 1.25kB 19.39 17.33 13.68 12.53 12.17 12.08 12.07 12.07 12.07 BIMH 625B 19.45 17.42 13.93 12.91 12.62 12.55 12.54 12.54 12.54

Table 5.3: Misprediction rates of a wide range of bimodal predictors across different state flushing periods measured in branch instructions.

5.4 Results

Three different studies are conducted; simple bimodal predictors, current TAGE and perceptron designs, and the ParTAGE design. The frequency of context switches on a modern mobile device is measured (Table 5.2) and find that switches can happen on average as often as every 12k branch instructions, assuming that one out of five instructions is a branch instruction which is typical of 5th JILP Workshop on Computer Architecture Competitions (JWAC-5)(2016) workloads. This leads to accuracy loss of 90% which translates to core performance loss of 15% based on the limit study shown in Figure 4.1 from the previous chapter.

5.4.1 Quantifying transient accuracy

Bimodal accuracy results

The bimodal predictor is used as a simple first experiment, consisting of a single table of counter bits. Table 5.3 shows how the MPKI of different bimodal designs improves as the state retention period increases. Despite a 100x difference in size, the steady-state MPKI increase is only 12.28%. The results reveal that while size contributes to the steady-state accuracy, transient accuracy is not affected by the size of a predictor design. 90 Chapter 5 Security Considerations for Shared Microarchitecture

Bimodal Transient Accuracy 20

19

18

17

16

15

14

13

12 Misprediction Per Kilo Instruction 11 100 1k 10k 100k 1M 10M 100M Branch Instructions Per Flush BiM 64kB BiMH 40kB BiM 1kB BiMH 625B Figure 5.6: Comparison of sizes and types of bimodal predictors. Size contributes to steady-state accuracy, hysteresis to transient accuracy. Note the Y-axis begins at 11 MPKI offset.

Instead, a variation in the design such as adding hysteresis improves transient accuracy, even for smaller predictors. This happens as the hysteresis bits also affect neighbouring branches and ultimately warm-up the design faster. This is clearly visible at smaller flushing periods where the misprediction is on average 18% lower for the bimodal designs with hysteresis. Figure 5.6 shows the difference in accuracy for different bimodal sizes and designs.

Transient accuracy: TAGE vs perceptron

The second set of results focuses on comparing the transient behaviour of the two most prominent designs in modern systems; TAGE and perceptron. Figure 5.7 shows the different transient behaviour between TAGE and perceptron designs. From the results, it is noteworthy that the 8kB variant of perceptron has the worst cold start, however it manages to rapidly improve and assume similar steady-state accuracy.

The transient MPKI for a flushing period of 20k branch instructions is 7.75 and 6.98 for the 8kB and 64kB TAGE designs, and 7.57 and 6.93 respectively for perceptron. Switching every 20k branch instructions is within a realistic range for applications like the ones presented in Table 5.2. Chapter 5 Security Considerations for Shared Microarchitecture 91

TAGE vs Perceptron Accuracy 50

45

40

35

30

25

20

15

10

5 Misprediction Per Kilo Instruction 0 100 1k 10k 100k 1M 10M 100M Branch Instructions Per Flush TAGE 64kB Perceptron 64kB TAGE 8kB Perceptron 8kB

Figure 5.7: A comparison between TAGE and perceptron. While both exhibit similar steady-state accuracy TAGE shows better transient behaviour.

This result shows that TAGE can deliver better steady-state accuracy. However, for applications that perform frequent context switching (20k branches), perceptron is marginally more accurate.

Another observation can be extracted when comparing the 64kB variants with the 8kB ones at smaller windows of uninterrupted execution. Considering for instance, flushing every 20k or 200k branch instructions, the 8kB predictors perform on average 10% and 15% worse than the 64kB designs. However, the accuracy gap increases to 33% when observing the same designs at steady-state.

To assess the accuracy drop the steady-state accuracy is used as a baseline and compared against the transient accuracy, across for periods as fine as 20k branch instructions. From Figure 5.7, the MPKI increase is as much as 90% and 80% for TAGE and perceptron respectively. This reinforces our belief that predictors today are evaluated without taking into account disruptions that can occur during realistic execution.

5.4.2 ParTAGE

To improve the transient accuracy of the prediction, the results of the hybrid predictor ParTAGE are presented. The studies illuminate how the components were sized and tuned. Overall results 92 Chapter 5 Security Considerations for Shared Microarchitecture

Small Perceptron vs Bimodal Accuracy 25

20

15

10

5 Misprediction Per Kilo Instruction 0 100 1k 10k 100k 1M 10M 100M Branch Instructions Per Flush BiMH 1kB Perceptron 1.7kB BiMH 625B Perceptron 1.25kB

Figure 5.8: Comparing bimodal and perceptron designs below 1.25KB as base predictors. Results show even when reduced this size Perceptrons greatly outperforms competition. show how it compares in terms of accuracy to the current perceptron and TAGE competitive designs.

Finding the right, “small” predictor

The TAGE breakdown analysis leads to the underlying idea for the hybrid predictor: replace components which use a significant amount of area with other, retainable components that can deliver a higher transient accuracy.

Figure 5.7 shows that perceptron predictors rapidly improve their MPKI over time before reaching steady-state accuracy. This reveals an interesting insight about Multiperspective Perceptron predictors: compared to bimodal predictors, the hashed and common weight values in modern perceptron designs are a more effective design, being able to aggressively train and and store information in a denser format. This ability to densely store branch information is the main reason to replace the base predictor, as it allows more context to be preserved and maintain most of the transient accuracy. Furthermore, in Figure 5.4 the statistical corrector is shown to add little accuracy when the TAGE tables are cold, despite its large size (8kB). Another experiment can be Chapter 5 Security Considerations for Shared Microarchitecture 93

Perceptron Table Calibration 9

8.5

8 MPKI 7.5

7 6 7 8 9 10 11 Number of Feature Tables (a) Calibrating the number feature of tables for 1.25kB size.

Perceptron Size Calibration 16 14 12 10 MPKI 8 6 0.25 0.5 1 1.25 2 2.5 Size (kB) (b) Different sizes of perceptron with a constant of 8 feature tables.

Figure 5.9: Calibrating the small perceptron conducted that replaces the 8kB of the statistical corrector for a larger base perceptron, swapping steady state accuracy for better transient accuracy.

Figure 5.8 compares bimodal and perceptron designs that are roughly within the budget of 1.25kB. Results show that while perceptron has higher MPKI for periods as short as 200 branch instructions, its steady-state accuracy is significantly better than bimodal.

Optimizing the base predictor

A limit study on the size of each BRB entry is conducted for ParTAGE that calibrates the number of feature tables and size of the Perceptron predictor. Initially when constraining the total size to 1.25kB, Figure 5.9a shows that eight tables deliver the most accuracy. As the total budget between the tables is shared the fidelity beyond that point decreases reducing accuracy. Figure 5.9b shows the best configuration for each size keeping 8 features. Note that changing the feature sizes affects the prediction more than the altering the size does. The process is repeated for 3kB perceptron entries, which add a 10% are overhead. To adhere to the original budget a design that removes the statistical correcter is assessed. To calibrate the perceptron with higher accuracy, a genetic algorithm can be used as shown in Jimenez´ (2014a). 94 Chapter 5 Security Considerations for Shared Microarchitecture

Transient MPKI with State Retention 30 Transient State Steady State

25

20 TAGE (Flush) ParTAGE-S 15 TAGE (B) ParTAGE 10 ParTAGE-S 3KB

5 Misprediction Per Kilo Instruction 0 10 100 1k 10k 100k 1M 10M 100M Branch Instructions Per Flush

Figure 5.10: Decrease of MPKI when retaining 1.25KB and 3kB of state. Note that TAGE clears its entire state, while all other designs keep their base predictor state intact.

5.4.3 ParTAGE results

The ParTAGE predictor is based on observations that better accuracy can be achieved when replacing the 1.25kB bimodal with a small perceptron.

Comparing ParTAGE variants

The results in Figure 5.10 compare the designs featuring the BRB, all variants of ParTAGE and TAGE (B) (that retains the bimodal), to TAG without the BRB. For ParTAGE, evaluate both variants that retain both 1.25kB and 3kB BRB entries. ParTAGE-S statically overrides the TAGE tables for a brief period after a flush is performed, forcing the Perceptron prediction to be selected, while ParTAGE selection prediction based on each component confidence. The 1kB entry designs are overall 3% larger in area, while the 3kB variant removes the statistical corrector, maintaining the same area budget.

When considering current system upper limit context switching, the most accuracy is delivered with the inclusion of the BRB extension in the 3kB ParTAGE variant which is iso-area compared to TAGE-SC-L . Figure 5.11 shows the improvement is roughly 20% and 15% for 3kB ParTAGE and TAGE(B) respectively. Overall, preserving a minimal state can have a significant improvement Chapter 5 Security Considerations for Shared Microarchitecture 95

Overhead Compared to Steady-State TAGE +200%

+150%

+100%

+50%

Added Added Overhead +0%

Branch instructions per flush TAGE ParTAGE 1kB TAGE (B) ParTAGE 3kB ParTAGE 3kB + SC

Figure 5.11: MPKI increase compared to steady-state when retaining 1.25kB. In coarse granularity the effect is negligible, but for frequent flushes the improvement is signifi- cant.

(15-20% less MPKI) when the state is frequently flushed but has a small effect at the steady- state (maximum 5% MPKI increase). To maintain the same steady-state accuracy the statistical corrector is included (ParTAGE 3kB + SC in Figure 5.11) increasing the overall area by 10% but delivering better accuracy throughout.

Comparing ParTAGE to TAGE and Perceptron

Comparing the different implementations of ParTAGE, ParTAGE-S works well for the fine grain switches, but the transient accuracy suffers at larger flushing periods, as the TAGE tables have not been trained adequately. In contrast, ParTAGE, which simply feeds the confidence into TAGE, does not achieve the same transient accuracy. This happens when TAGE tables are completely cold, but their prediction is prioritised over the more accurate one from the perceptron component. This is also why ParTAGE does not improve the transient accuracy at 20k branch instructions. This can be solved with better tuning of the selection policy, if desired.

Improving the steady-state accuracy of the base predictor and retaining its state effectively enables efficient operation at higher frequencies. This can be done by either increasing the size of the branch retention buffer to fit more state, or develop predictors ranging between 1kB to 3kB with better steady-state accuracy. For instance, approaching the accuracy of an 8kB perceptron can further reduce the misprediction, shown in 5.7.

While the primary focus of this study is the improvement of the transient accuracy of branch prediction, it is equally important to maintain a competitive steady-state accuracy for our proposed design. A direct comparison is also performed both between TAGE-SC-L and Multiperspective Perceptron against the best version of ParTAGE. Figure 5.12 provides a detailed look at the steady-state MPKI compared to TAGE and the best version of the Multiperspective Perceptron 96 Chapter 5 Security Considerations for Shared Microarchitecture

Steady-State Performance 4.30 4.20 4.10 4.00 3.90

MPKI 3.80 3.70 3.60 1M 10M 100M Branch Instructions Per Flush TAGE 64kB Perceptron 64kB ParTAGE 64kB

Figure 5.12: ParTAGE matches or outperforms current high accuracy predictors.

Jimenez´ (2014b,a). We observe that ParTAGE delivers 3.726 MPKI, competitive to TAGE (3.660 MPKI) and even outperforming perceptron (3.826 MPKI).

5.5 Discussion

Protection from side-channels that target the branch predictor can be disruptive to performance when dealt in software and costly in when dealt in hardware. In realistic scenarios where this can occur, such as frequent context switches and system calls, the studies in this chapter showed that current mitigation techniques for side-channel attacks targeting the speculation engine are both expensive and impractical. These disruptions create a disconnect between the reported nominal accuracy of branch predictors and the actual one in a real world applications. To distinguish between the two, the notions of steady-state and transient branch predictor accuracy are introduced to better describe the behaviour of a system with frequent branch predictor flushes.

To ensure isolation and maintain performance, area and energy constraints, the branch retention buffer is described. This novel mechanism keeps a minimal, isolated state per context to reduce the high number of mispredictions. To showcase the effect of the BRB two designs are presented; first, an extension to TAGE, named TAGE (B), that keeps the state of its bimodal predictor. Second, a the hybrid branch predictor design, ParTAGE, that replaces the bimodal in TAGE with a perceptron. These variants are evaluated with a new methodology, which modifies the Championship Branch Prediction framework so it clears predictor state across different frequencies and components.

Results show that branch predictors can have as much as 90% more mispredicts than what is evaluated today at steady-state, under certain realistic conditions. With the inclusion of the BRB the system manages to guarantee branch predictor state isolation and consequently increase security when context switching, while achieving reductions on MPKI of 15% and 20% for TAGE and the hybrid predictor design respectively. Chapter 5 Security Considerations for Shared Microarchitecture 97

The work can be further extended in various directions. Currently the threat model does not cover attacks that are orchestrated within the same thread. This could be incorporated by specifying sensitive regions in the thread to operate in an isolated environment. With the regions specified the same BRB mechanism can be used to make sure that unsafe code does not affect the sensitive region, even within the same thread.

Additionally, the security optimisations presented in this chapter can be combined with the branch predictor sharing scheme presented in Chapter4. More precisely, contexts can migrate between cores transferring the base predictor state and flushing the state from the other components in the incoming core. Alternately, the rest of the predictor state can become inaccessible to preserve it for other applications that potentially are running on the same core.

Finally, the combination could be further improved by having a shared SRAM memory to facilitate this, where the heterogeneous cores always point to different memory banks. With this method transferring of the state is instantaneous, while it still preserves context isolation. This prospective research is left for future studies.

Chapter 6

Conclusions

An incredible progress in processor design is evident over the past 50 years, that has also triggered revolutionary social changes. Today’s mobile devices far exceed our initial expectations, as they are able to do powerful computations, exhibit high connectivity and feature a plethora of complex peripherals. Limited energy resources in such systems prohibits executing constantly using all the resources and require efficient management. On the other end of the spectrum high performance computers tackle complex problems that help solve important questions in all scientific fields. However, diminishing returns in the latest silicon innovations no longer deliver enough performance to match past rate of improvements and satisfy the expected future computational needs.

To overcome these problems, processors have gradually progressed to more efficient heteroge- neous multicore designs. However, these heterogeneous multiprocessors are still at an early stage, and many questions regarding design and usage still remain unanswered. In principle, hetero- geneity is used because multiprocessor systems with cores specialised for different workloads can deliver more efficient execution across a larger variety of applications. Current commercial heterogeneous design usually consist of a general purpose high performance Out-of-Order exe- cution (OoO) and generic power efficient In-Order execution (InO) cores. In the future though many more types of cores are predicted to be included like Graphics Proccesing Units (GPUs), mixed/augmented reality and neural network accelerators.

However, to deliver continuous high performance within a restricted power envelope, it is not enough to just include cores specialised at desirable applications. It is also vital for such systems to be able to select and hand over execution to the correct core without loss of performance in the transition. To enable fast and low overhead switches between cores, some state needs to be shared or transferred as efficiently as possible. As shown throughout the thesis this has proven to be challenging. Furthermore, the problem is expected to be exacerbated as more diverse cores or even workload accelerators are being incorporated in future systems that have different memory interfaces and bandwidth/latency requirements.

99 100 Chapter 6 Conclusions

6.1 Summary of Conclusions

This work provides a generalised quantitative analysis for Heterogeneous Multi-Processors (HMPs) which ultimately leads to several improvements that facilitate a seamless transfer of execution between cores. These are achieved through several contributions.

First, an infrastructure that enables in depth research of the behaviour of heterogeneous systems. The methodology is based on the gem5 simulator that permits cycle accurate in depth analysis of core behaviour, modified so that parallel execution paths can be directly compared. The framework presented has obvious strengths over pre-existing methodologies that do not possess the ability to dissect the system into separate components and separately assess how they affect the overall performance. The versatility of the methodology allows for studies that can extend beyond just simulating a heterogeneous system and focus on specific bottlenecks and weaknesses of specific designs in controlled scenarios. The experiments conducted with the methodology support this claim as through the comparative analysis the performance impact of components such as the caches, the translation buffer, the branch predictor and the pipeline are quantified.

The second contribution, derives directly such an analysis. Experiments in Chapter3 aim at finding memory sharing schemes that can enable low overhead high frequency migrations between heterogeneous cores. The methodology is used to measure the impact of each memory component in a HMP comprised of an OoO and an InO core. The results reveal several novel insights that are critical for the design of future HMPs. The first observation is obtained from a system that shares all of the caches, the Translation Lookaside Buffers (TLBs) and uses an oracle scheduler to select the best core to execute on. In this scenario the system has marginal benefits to both performance and energy savings even when switching as frequently as one thousand instructions. This result is important as it opposes prior findings that claimed that sharing all of the memory components allows for migrations every one thousand instructions without experiencing any overheads.

The detailed per component breakdown of the performance of the system reveals the reason for this disparity. The InO, which cannot re-order instructions, relies heavily on the caches to deliver its maximum performance. The OoO core, however, can mask memory latency through re-ordering but requires accurate speculation to constantly keep the deeper pipeline full. Consequently, while sharing the caches eliminates most of the overheads for the InO core, the OoO core requires retention of the Branch Predictor (BP) state to operate through migrations without slowdown. As the branch predictor is not shared between the cores, frequent switching degrades the OoO core mispredicts constantly losing almost half of its nominal performance making the system in its entirety impractical at fine grain migrations.

Instead, an improved sharing scheme is provided as an alternative, that delivers the same perfor- mance as full sharing at realistic switching frequencies but for a fraction of the complexity. The results presented bring two new interesting facts to the forefront. First, migrations as fine as 1k instructions are impractical because sharing latency sensitive components like the Level 1 caches Chapter 6 Conclusions 101 outweigh potential benefits. Second, most workload phases can be scheduled using coarser grain migrations. From these two new insights, a memory scheme that only shares the L2 cache and the translations can deliver equivalent performance to a system that shares the entire memory subsystem and improve the Energy Delay Product (EDP) over existing architectures.

Retaining the branch predictor state when switching cores practically eliminates the migration overheads (especially for the OoO) and recovers all of the lost performance in HMPs. Different cores, however, have different design constraints which influence the microarchitectural design of their branch predictors. As such the BPs of heterogeneous cores do not match neither in size nor in microarchitectural design. Additionally, predictions need to be available as soon as possible in the pipeline stages and therefore prediction latency is critical. Similar to the L1 caches this prohibits heterogeneous cores from sharing their branch predictors to retain state through migrations.

The third contribution presents a different approach. It identifies state that can be made common between a small energy efficient InO predictor and a large, highly accurate predictor commonly found in an OoO core. To accomplish this an investigation of the two most prominent BP designs used today is conducted, TAGE and Perceptron. A per component analysis reveals that the fallback bimodal predictor in TAGE delivers the most accuracy per kilobyte. Consequently, instead of sharing predictors a more viable solution is to transfer the state of an InO core’s bimodal predictor to the matching fallback predictor of an OoO core’s TAGE design. This recovers most of the lost predictor accuracy and reduces overheads incurred when migrating to the OoO core.

A further optimisation is based on the observation that Perceptron predictors are significantly more accurate when size constrained compared to bimodal predictors. Swapping out the InO bimodal and the OoO core’s fallback predictor for a small Perceptron improves the accuracy assumed immediately after a migration and improves the overall performance of the InO core. Retaining minimal BP state is shown through the simulation methodology to ultimately eliminate most of the slowdown caused by migrations, further increasing the potential benefits of a system that switches correctly between cores.

The final consideration stems from recent findings that shared state between contexts (and by extension between cores) can be leveraged to create security exploits that can leak privileged information or compromise correct operation. In the epicentre of such attacks is the branch predictor, which can be manipulated to force side-channel attacks that leak information through the caches. Isolation between contexts is considered to be a sufficient and necessary condition to guarantee secure operation. Flushing the state between context switches or migrations ensures iso- lation, but comes at the cost of significant system slowdown and regressing in performance. The experiments conducted as part of this thesis reveal that this approach can double the misprediction rate in current mobile devices, significantly impacting their performance.

To guarantee isolation without degenerating performance, a mechanism is described that retains and isolates minimal state per context. The microarchitectural component that enables this is 102 Chapter 6 Conclusions the Branch Retention Buffer (BRB), that stores minimal state per context. When a context switch or a migration is performed, the BRB points to the active thread and the rest of the predictor is flushed of any prior state. This eliminates the danger of any prior state being leaked. Furhtermore, it improves both the transient accuracy experienced when frequently switching, and as a consequence flushing, between contexts and maintains the same steady state accuracy as current competitive designs.

6.2 Research Questions Evaluation

The above contributions can also help assess how well the research questions posed in Chapter1 have been addresseds:

1. What are the best metrics to evaluate the quality of HMP designs? What is an unam- biguous way to compare different architectures? This was addressed by breaking down the slowdown effects per component for both core architecture that are used in heteroge- neous systems today. The simulation methodology that was designed enabled experiments that provided a quantitative way to assess slowdown broken down per component. This in turn, showed that while retaining the memory state is necessary, internal components also have valuable state that needs to be preserved to maintain high performance. While the simulation methodology identifies how to theoretically improve scheduling and increase performance, this has not been validated in hardware. This would be a useful extension to further validate the results.

2. How does the memory hierarchy affect the trade-off between design complexity and improved operation for heterogeneous multiprocessors? The methodology developed showed that current proposals from future heterogeneous systems are too complex to build, and yield little to no benefits regardless of the complexity. Instead, using this analytic approach a feasible design was found that shares the translation buffer and the Level 2 cache. One valid criticism is that it is too complex to share the entire translation buffer. Instead, it would be more viable to have a hierarchical TLB and share the last tier. This has not been modelled in the simulations and it would be interesting to see how it performs.

3. How does the sharing of internal core components affect the behaviour of heteroge- neous systems? The studies presented in Chapter4 and Chapter5 showed that during migrations the system slowdown attributed to internal core components can almost ex- clusively be attributed to the branch predictor. Sharing the branch predictor however is prohibitive as predictions have strict latency requirements. To thoroughly answer this question, a transfer mechanism is presented that can preserve minimal predictor state and avoid inducing high latencies. From a critical standpoint, the work only covers sharing schemes that use TAGE as the predictor of the OoO core. It would be useful to explore similar approaches for perceptron based predictors. Chapter 6 Conclusions 103

4. How do shared resources in HMPs affect the overall system security and what design changes are needed to ensure safe operation? From the literature review, it has been shown that shared resources are indeed vulnerable to security exploits, particularly side- channel attacks. To mitigate the effects, this work considered the concept of isolation of state in shared resources when switching between different contexts. While isolation can be an easy way to mitigate certain types of side-channels, the design modifications are not simple and can have considerable performance implications. This is especially true for components that store large amounts of data such as the caches and the branch target address buffer.

6.3 Future work

The work conducted as part of this thesis currently has certain limitations, that need to be ad- dressed in any future work in this direction. The first important limitation is that the simulation methodology presented in Chapter3 only explores first order migrations, while ignoring any other more complicated scheduling scenarios. The second limitation is that the security miti- gation scheme presented in5 only protects the branch predictor tables, while leaving all other microarchitectural components vulnerable. The above highlighted areas for improvement can lead to an extension of this research in many ways. Below is a brief description of promising related future work.

6.3.1 Research into HMP Scheduling

The simulation methodology compared first-order migrations to measure the best case perfor- mance and energy efficiency when migrating between cores. A viable way to extend this would be to perform an analysis over the entire state space. As explained in Chapter3 the state space can quickly become intractably large as it exponentially increases is size with each added potential scheduling step. Despite this, preliminary analysis documented in AppendixA shows that some simplifications can be made to the model and merge many states, effectively linearising the problem.

While this still generates a prohibitively large state-space to explore with classical methods, machine learning and artificial intelligence techniques can be used to develop more sophisticated schedulers for heterogeneous systems.

6.3.2 Investigating Memory Sharing for HMPs with Emerging Accelerators

The analysis in Section 3.3 identifies the core components that need to be shared between an In-Order and an Out-of-Order execution core. It is expected, however, that future systems will have many more heterogeneous cores or accelerators specialised for very specific workloads. For 104 Chapter 6 Conclusions instance, neural network and augmented reality accelerators are expected to be prominent in the near future. These accelerators have very different memory access patterns and latency/bandwidth restrictions when compared to convention general purpose cores. Extending the investigation into systems that incorporate emerging accelerators is a promising future route that would deepen our understanding of bottlenecks in heterogeneous systems in general but also for specific architectures that are expected to be widely used in the future. The usefulness of these new processing elements is crucial for tackling problems that deal with large amounts of data and therefore any improvement in their performance or energy efficiency can have a large impact in the spread of their use.

6.3.3 Extending Branch Predictor State Transfer to Other Designs

The work presented in Chapter4 focuses primarily transferring state from a TAGE predictor. The modular design of TAGE makes the useful state easy to identify and separate. Furthermore the implementation of a components that directly matches the fallback predictor in TAGE is relatively straightforward.

Nonetheless, many commercial processors today use Perceptron branch predictors, and would therefore be useful to extend the work to these designs as well. The state stored in the Perceptron tables is never used as a standalone prediction, as mentioned in Section 4.1.3. Instead, weights are produced by all the tables and are summed up to deliver a valid prediction. As such identifying a single table, whose state can be transferred to deliver valid predictions seems unlikely. However, it would be worth investigating whether it is feasible to identify a subset of entries from all tables or, alternately, the tables that contribute the most in accurate predictions and only transfer them to a core assuming execution after a migration.

6.3.4 Securing Other Shared Resources

A final recommendation for future work would be to extend the concept of isolating minimal state of resources that are shared and vulnerable to security exploits. Caches are known to leak information through timing side-channel attacks. However research in this area is still ongoing as no viable solution has been proposed, minimal performance impact.

It would be worth trying to extend the notion of trying to identify the most useful state to retain Address Space Identifier (ASID) to guarantee isolation. More promisingly though, the work could be extended to the secure partial state from the branch target address cache (BTAC) which also has been shown to be susceptible to Spectre-like attacks. However as the BTAC stores addresses and not predictions, it would be promising to identify the most recent or frequent address predictions to store, in order to maintain high accuracy. Additionally, compressing techniques could be explored to further increase the amount of state retained in isolation. Appendix A

Supplementary Material

A.1 Migration Overhead Study

Aside from the data extracted from mibench for the migration overhead studies, a smaller experiment was conducted using the Octane benchmark suite (Google Developers, 2013) running on Android Kitkat. Octane OoO performance degradation shared caches 100% 80% 60% 40% 20%

0% Performance Benchmarks Octane OoO performance degradation private caches

Relative 100% 80% 60% 40% 20% 0%

Benchmarks 1k insts 10k insts 100k insts 1M insts 10M insts

Figure A.1: OoO performance degradation for shared and private caches (No TLB sharing).

105 Octane InO performance degradation shared caches 100% 80% 60% 40% 20%

0% Performance Benchmarks Octane InO performance degradation private caches

Relative 100% 80% 60% 40% 20% 0%

Benchmarks 1k insts 10k insts 100k insts 1M insts 10M insts 106 Appendix A Supplementary Material

The reasoning behind repeating the experiments for Octane is that mibench is features mi- crobenchmarks that fit into the caches. Using more realistic workloads with larger footprints can lead to different results, as the increase of necessary memory accesses can lead to different insights. Some of the workloads are omitted as their execution takes too long to realistically simulate. The ones selected for our study are:

richards: Richards is an OS kernel simulation benchmark, which focuses on property • load/store, function/method calls, code optimization and elimination of redundant code.

deltablue: Deltablye is a one-way constraint solver, originally written in Smalltalk to test • performance with polymorphism and on object oriented programs.

crypto: Crypto is an encryption and decryption benchmark based on code by Tom Wu. • The benchmarks has lots of bit operations bit operations.

raytrace: Is an object-oriented raytracer. The workload does a lot of arithmetic. It benefits • from fast object allocation, initialization, and efficient garbage collection.

earleyboyer: This is a mostly-mechanical Scheme to JS translation of two programs: • earley, which does parsing, and boyer, which is a constraint solver. Being based on Scheme, this program creates a lot of small objects, so fast object allocation and generational garbage collection are very important. The extensive use of closures also generate a lot of call operations, so those need to be fast. Finally the workloads feature lots of tail recursion, and general tail calls.

regexp: Is a combination of regular expression operations from 50 different sites, running • in succession.

splay: Splay is a data manipulation benchmark that deals with splay trees and exercises • the automatic memory management subsystem. The benchmark also stresses the garbage collection subsystem of a virtual machine. SplayLatency instruments the existing Splay code with frequent measurement checkpoints. A long pause between checkpoints is an indication of high latency in the garbage collector. This test measures the frequency of latency pauses, classifies them into buckets and penalizes frequent long pauses with a low score,

navierstokes: is a 2D Navier-Stokes equations solver, which heavily manipulates double • precision arrays. As a result the benchmark is useful for testing floating point math performance.

mandreel: Runs the 3D Bullet Physics Engine ported from C++ to JavaScript via Mandreel. • Additionally, it measures the benchmark with frequent time checkpoints. Since Mandreel stresses the virtual machines compiler, this test provides an indication of the latency introduced by the compiler. Octane OoO performance degradation shared caches 100% 80% 60% 40% 20%

0% Performance Benchmarks Octane OoO performance degradation private caches

Relative 100% 80% 60% 40% 20% 0%

Appendix A Supplementary Material 107

gameboy:Emulates the portableBenchmarks console’s architecture and runs a demanding 3D simula- • 1k insts 10k insts 100k insts 1M insts 10M insts tion, all in JavaScript.

codeload: Measures how quickly a JavaScript engine can start executing code after loading • a large JavaScript program, social widget being a common example.

Octane InO performance degradation shared caches 100% 80% 60% 40% 20%

0% Performance Benchmarks Octane InO performance degradation private caches

Relative 100% 80% 60% 40% 20% 0%

Benchmarks 1k insts 10k insts 100k insts 1M insts 10M insts

Figure A.2: InO performance degradation for shared and private caches (No TLB sharing).

A.2 State Space Reduction to Linear Proof

The concept behind state space reduction for HMPs is based on cases where states can be merged in order to avoid extra computations. We observe that when a migration to the opposite core occurs, the caches and the cores are flushed and start “cold”. As the states symbolize a small amount of execution measured in instructions, states at the same level starting on the same “empty” core can be considered equivalent. As it was shown in A.3 the unique states, regardless of the depth of the tree, will always remain a constant number (4 in the case of big and little cores). 108 Appendix A Supplementary Material

b L

L b L b

L b L b L b L b

L b L b L b L b L b L b L b L b UNIQUE UNIQUE UNIQUE UNIQUE

Figure A.3: A representation of the states that can be merged. The color coordination of the leaf states denotes which states can be merged, according to their most recent migration.

The rest of the cases can be truncated depending on how much history they have in common. We observe that as have of the cases switch to the opposite core all these states can be truncated to just two cases, one that represent a system running on big and switching to little and one performing a migration in the opposite direction. In Figure A.3 the red color surrounding the states denotes 8 merge-able states, four that migrate to the big and four that migrate to the little core. After the merge, two states remain, one on big and one on little , effectively removing six states from the initial tree.

The same logic for truncations can be extended for migrations that performed the same switch at the same state in the past, albeit with diminishing returns in terms of how many states are merged. These are the states with the dark blue border. Following the same logic, these 4 states can be reduced to two.

As a general rule we observe then that half the states get merged into two “super” states. From the remainder states half of them can be further merged into two new super states, with this pattern continuing until the remainder states are the four aforementioned unique states. Summing up all the truncations and subtracting them from the total amount of states we can construct a rule to extrapolate the complexity:

a0 = 2

a1 = 4 n−1 n k an = 2 (2 2) 1 − − k=1 all states X |{z} state truncations n n+1 | {zk } an+1 = 2 (2 2) 2 − − k=1 X 2 1 : − Appendix A Supplementary Material 109

n n−1 n+1 n k k an+1 an = 2 2 (2 2) + (2 2) − − − − − k=1 k=1 X X n−1 n−1 n n−1 = 2n+1 2n 2n 2k + 2k + 2 2 − − − − k=1 k=1 k=1 k=1 X X X X 0 ¨¨* n−1 n−¨1 n+1 n n k¨¨ k = 2 2 2 2¨ + 2 + 2 n 2 (n 1) − − − ¨ · − · − ¨k=1 k=1 ¨ X X : 1 n+1 n  = 2 2 2 + 2 (n(n 1)) − · · − − : 0 = 2n+12n+1 + 2  − ⇒ an+1 an = 2 − ⇒

an+1 = an + 2

Alternately:

n−1 n k an = 2 (2 2) an = 2 n − − ⇒ · k=1 X because n−1 2n 2k = 2 − k=1 X

Appendix B

Performance and Energy Efficiency

As an auxialiary supplement to fully understand heterogeneous processor performance, a clear definition of the terms energy efficiency and performance is provided, as these two terms are ultimately the underlying metrics for all research in the field. These two terms are more often not orthogonal to each other and therefore it is also necessary to also define a coherent metric for the trade-off between them. This appendix aims at providing a brief background of concepts directly tied to the evaluation of processor performance and efficiency.

B.1 Performance

The term performance vaguely describes the amount of work completed compared to a specific resource. More specifically though processor performance is defined as the amount of Instructions per Second (IPS) or, to be more realistic, the Million Instructions Per Second (MIPS). In essence, a system that has a higher IPS count will terminate a specific task faster compared to a system with a lower IPS count and hence has lower execution time or delay ( ). D

work completed performance = execution time

Following the above logic, for a fixed amount of instructions, execution time is the inverse of performance and vice versa, or expressed in a mathematical form:

DB = n DA where n is the relative improvement of A over B. As performance is the inverse of execution time:

1 = performance execution time

111 112 Appendix B Performance and Energy Efficiency

performance n = A performanceB

Sometimes, workloads may run indefinitely or might be dependant on external events. In those cases the workload (or instructions) executed are at best not constant and at worst undefined. In those cases performance can be derived by assuming a common execution time and therefore comparing the average amount of work completed during the measurement. This is commonly referred to as throughput.

In general, computers process information at a constant rate using what is commonly called a clock. The smallest time event that registers the amount of work done is called a clock cycle. This way a 1Hz processor has 1 clock cycle per second and a 2GHz one 2,000,000,000 cycles per second, however in both cases 20 cycles will result in the same amount of “progress” in operation. Execution time can therefore be calculated as follows:

clock cycles = D

Throughput can therefore also be described in terms of clock cycles with:

clock cycles CPI = instructions

Where CPI are the cycles per instruction. Sometimes the inverse is also used as a metric IPC. Knowing the instructions of a workload, the CPI and the operating frequency it is easy to deduce the execution time: CPI instructions seconds = × × D clock cycles

B.1.1 Parallel Processor Performance

As discussed in the Chapter1, performance on a single core reaches a limit, dictated by a frequency cap that prohibits execution at higher speeds due to thermal constraints. However, it is often the case that a system’s performance can be enhanced through the exploitation of inherent parallelism of operations. Taking advantage of parallelism is very important to modern systems as it allows performance to scale without the need to analogously increase frequency.

Individual processors have developed various design improvements that allow parallel execution at different abstraction levels. One of these is to redesign specific operative components so that they can be used simultaneously. Examples of this are carry look-ahead adders, fused multiply accumulate etc. all of which use parallelism to improve their response time. This sort of parallelism is called bit level parallelism.

Another form of parallelism is DLP, which is when the same operation can be done to multiple data at the same time. The idea of exploiting DLP was first introduced in the 1970s, with the Appendix B Performance and Energy Efficiency 113

Cray-1 supercomputer that used vector registers. Vector processors contain instructions that operate on one-dimensional arrays of data called vectors, in contrast to scalar processors, whose instructions operate on single data items. Today, the popularity of these systems has greatly dwindled as the price to performance ratio is significantly worse than their more versatile scalar counterparts. The idea resurfaced in GPUs, as data level parallelism is dominant. This happens because in 3D rendering each pixel undergoes the exact same operations, albeit with different coefficients (Buck and Hanrahan, 2003). In those cases multiple cores are designed to execute in a SIMD fashion to maximise throughput, much like the vector processors.

Similarly, one could argue that set-associative caches are a form of DLP. The argument in this case is that multiple banks can be accessed in parallel to find blocks faster. Hence, that the same operations are done on hardware in parallel on different data, much like an SIMD design.

(a) No ILP (CPI 5). ≥

(b) 5 stage pipeline (IPC 1). (c) 5 stage superscalar pipeline (IPC 2). ≤ ≤ Figure B.1: Examples of instruction level parallelism exploitation (Hennessy and Patterson, 2006).

Another possible optimisation is to take advantage of instruction level parallelism (ILP) through pipelining. Pipelining is a technique used to overlap instruction execution. When doing so execution time is effectively reduced while a little overhead latency is introduced. The additional overhead is negligible compared to the performance improvement, especially at high frequencies. Another key improvement in pipelining comes from the fact that often neighbouring instructions are completely independent from each other. In those cases, and assuming that the necessary resources are available, instruction execution can be done partially or even completely parallel. To accomplish this superscalar processors are used, that can execute more than one instruction at the same time in different execution units.

A form of parallelism similar to ILP is memory level parallelism. The underlying concept is that memory has not managed to keep up with the advances in processor speeds and, as a consequence, is the main bottleneck of modern systems. To clarify, when fetching data from memory (or cache) a miss causes a stall in the order of hundreds of cycles. To overcome this problem, Out-of-Order execution is commonly used, which allows instructions to be rearranged to mask memory latency. 114 Appendix B Performance and Energy Efficiency

To accomplish this, the reordering must not affect the outcome of the original sequence. As a logical sequence, cores that cannot reorder instruction are referred to as In-Order cores.

Lastly, in a multi-threaded environment and assuming that threads can run independently, it is possible to run parallel when a system has more than one cores or processors. This category of parallelism is called thread level parallelism (TLP). Ideally, TLP can significantly decrease the absolute execution time in a system. However, a more realistic realisation is that certain parts are independent and others require sequential processing or are constrained by specific bottlenecks, such as shared resources between threads (peripherals, memory, etc.).

In general, all the aforementioned methods achieve parallelism to a certain degree. Nonetheless, there are always parts of execution that have to follow a specific order to ensure correct operation. For this reason realistic consideration would assume that a system can be split into parts that are sequential and ones that can execute in parallel. Performance increase in these cases can be predicted using Amdahl’s law. Amdahl’s law calculates the improvement in performance when enhancing part of a system. This improvement is commonly referred to as speedup. A more analytic form of the above can be phrased mathematically as:

1 overall = S (1 p) + p − s#

Where:

overall is the theoretical speedup in the overall execution. •S

s# is the speedup in the part that can be executed parallel. • p is the percentage of the system that is parallelisable in the original system (before any • improvement).

An interesting observation is that Amdahl’s law describes the concept of diminishing returns. For example if the non-parallel part of the execution percentage p is 0.4, then for incremental speedups of the parallel execution we see that the overall speedup becomes increasingly insensitive asymptotically peaking at 1.66. A graphical representation can be seen at B.2

In the general case, the overall speed up is restricted by the serial parts that can not be optimised. Intuitively therefore, if the amount of work that can be done in parallel is small, then the benefit of scaling with cores will be insignificant. It is important to accurately calculate the performance when using a parallel execution design as neglecting to factor in the slowdown caused by the necessarily serial execution distorts the performance difference a system exploiting some form of parallelism and one which does not. Appendix B Performance and Energy Efficiency 115

1.7

1.6

1.5

1.4

1.3 Overall speedup Overall

1.2

1.1

1.0 20 40 60 80 100 Speedup of parallelizible code

Figure B.2: Amdahl’s Law of diminishing returns when 60% of execution is parallel.

B.1.2 Benchmarking

Practically, establishing concrete and concise definitions of performance simplifies the problem of comparing systems significantly. However, performance depends not only on design but also on the type of workload. Without a meaningful workload that project pragmatic behaviour, extracting system metrics is meaningless. For this reason, specific standardised workloads exist that help further construe system performance. These are specific programs called benchmarks. The purpose of benchmarks is twofold, to evaluate a system’s performance and to facilitate identifying specific behaviour.

Evaluation is done by reporting some metric after completion, which usually is benchmark specific. For example Dhrystone, a CPU-intensive synthetic workload, outputs Dhrystone MIPS or DMIPS which is effectively the average time the processor takes to perform a single loop of a fixed sequence of the benchmark instructions. The metric DMIPS is arguably more meaningful than just evaluating MIPS between two designs because instruction count comparisons between different ISAs can confound simple comparisons. For instance, the same high-level task may require many more instructions on a Reduced Instruction Set Computer (RISC) machine, but might execute faster than a single Complex Instruction Set Computer (CISC) instruction. Similarly, other workloads establish their own meaningful criteria that helps comparisons between different architectures.

Apart from evaluation, benchmarks act as specific workloads that capture very specific system behaviour. Regular programs are not a single pure type of workload but are instead comprised of various different types of phases. As such it is very difficult to identify what causes a specific program to be slow, and why a specific architecture performs worse than others. These sort of 116 Appendix B Performance and Energy Efficiency questions become significantly easier to answer with benchmarks as they usually tend to focus on one specific behaviour (e.g. memory performance like CacheBench (Mucci et al., 1998))

Various types of benchmarks exist, that can correctly assess performance for different system types. For example, some focus on generating traffic between nodes in networks or cores within many-core systems, while others might try to stress graphical computations. Depending on what behaviour is needed to be tracked, different types of benchmarks are used. For that reason benchmarks are often categorised as follows:

Application benchmarks are actual programs that perform common functions relative to • the type of application they aim at evaluating. These are usually system-level benchmarks and are very useful at comparing overall performance, but struggle at identifying the specific bottlenecks in systems.

Component benchmarks are relatively small pieces of code that aim to test specific • components. Instead of testing the performance of the system running real applications, component-level benchmarks focus on the performance of subsystems within a system. These subsystems may include operating system, arithmetic integer unit, arithmetic floating- point unit, memory system, disk subsystem, etc. Even though in depth comparison can be done with component benchmarks, they cannot capture inter-component performance.

Kernel benchmarks stem from the observation that in most cases a small part of the • code takes up the majority of CPU time and resources. These benchmarks therefore are created from the extracted code excerpt that usually slows down systems. A common downside is that kernel benchmarks are usually small, fit in cache and are prone to compiler over-optimisation.

Synthetic benchmarks are ones that are carefully crafted out of basic functions to be • most indicative of system performance. These functions are usually artificially created to match an average execution profile. Synthetic benchmarks ultimately suffer from problems similar to those of kernel benchmarks.

As it is clear, no single benchmark provides an absolute measure for system or even component performance. It is therefore necessary to be aware of all the strengths and weaknesses of the benchmarks used and combine several of them to guarantee an adequate coverage of the system behaviour.

B.2 Power and Energy Efficiency

In every system ranging from large-scale data centres to mobile devices, energy usage is an important concern. In HPCs power consumption directly affects cost of operation but also through cooling from thermal losses which require extra equipment to be dealt appropriately. Appendix B Performance and Energy Efficiency 117

Additionally, the thermal losses also affect the reliability of such systems and their future scalability. In mobile devices, limitations in battery capacity combined with the ever-increasing demand in computationally heavy applications sacrifice usability for prolonged operation. Hence, the need to track and make operation more energy efficient is necessary. However, in order to do so the notions of energy and power efficiency must be clearly defined.

A lot of confusion arises from the fact that the terms energy and power are often used interchange- ably despite them having very different meanings. Power ( ) is defined as the an amount of P energy ( ) used over time. To clarify, energy efficient systems minimise consumption which E can be important for example in mobile devices as it leads to maximised battery life. Power efficiency on the other hand minimises peak power which is useful when trying to mitigate thermal constraints. An interesting point to make is that a system might be power efficient but not energy efficient and vice versa depending on the duration on operation. This could be the case when, for instance, a system runs for a very brief period and uses a burst of energy, in which total energy will be low but power (which is the rate of energy over time) will be high. On the opposite end, a program that runs forever on bounded power will consume infinite amount of energy.

Of the two concepts, energy is increasingly becoming more important, as operational costs of servers and battery life of mobile systems primarily depend on energy efficiency. It is therefore crucial to find ways to drive energy costs down both by hardware design and software scheduling optimisation. To remove any future confusion for the remainder of the report, efficiency will always refer to energy unless it is specifically stated otherwise.

B.2.1 Processor Power Dissipation

A significant amount of energy is expended by the central processing unit (CPU) and therefore it is crucial to fully understand its power consumption. Several factors contribute to the overall CPU power consumption, which can be classified into dynamic and static power losses:

CPU = capacitance + short + leak P P P P dynamic losses static losses | {z } | {z } Dynamic losses occur during the switching of transistors from one state to another. These losses can be further categorised as switching capacitance losses and short-circuit losses. The former originates from the charging and discharging of the load capacitance in logic gates. When a gate is charged, current flows from the source and charges the load capacitance (CL). During discharge the flow is reversed from the capacitance to ground. For a full charge discharge cycle the amount of charge that flows from source VDD to ground and the energy consumed is:

1 = CLVDD = VDD Q E 2 Q 118 Appendix B Performance and Energy Efficiency

Multiplying with the frequency gives us the power:

1 1 2 f = VDD f = CLVDD f E 2 Q ⇒ P 2

Considering that a processor is comprised of many logic gates that do not all switch at the same time is analogous to the overall capacitance C, the operating voltage V and frequency f:

2 capacitance CV f P ∝

Short-circuit losses occur during transistor transition when state changes from “on” to “off” and vice versa. As the rise and fall time is nonzero, during that transition multiple transistors will be “on” creating a short-circuit current flowing directly from source to ground. Naturally the briefer the rise/fall times, the smaller the power losses due to short currents. For this reason manufacturers strive to lower the threshold voltage, which leads to another undesirable static phenomenon which is sub-threshold leakage currents.

100 300 Gate length

250 1 Dynamic power 200

0.01 150 Sub- threshold leakage 100 Physical gate length (nm) length gate Physical 0.0001 50 Normalized total chip power dissipation power chip total Normalized

0.0000001 0 19901995 20002005 2010 2015 2020 Year

Figure B.3: Total chip dynamic and static power dissipation progression over the years (Kim et al., 2003)

In semiconductors, leakage is the phenomenon where electrons or holes tunnel through an insulating region. Hence, when a transistor is off a small amount of power is consumed due to leakage currents. The phenomenon increases exponentially as the thickness of the insulating region decreases. Furthermore, this phenomenon is static, meaning that unlike dynamic effects they have a constant effect on operation and power dissipation. For this reason leakage is currently one of the main factors limiting further integration and a primary factor in the slowing down of Moore’s law. Appendix B Performance and Energy Efficiency 119

B.2.2 Methods of Power Loss Mitigation

As improvements in materials and the fabrication process progress, processors have constantly been reaping the benefits, improving power efficiency. However, on a macro scale overall power dissipation in processors is a primary constraint. To that extent certain management policies at system level are used to drive to lower thermal losses.

B.2.2.1 DVFS

One concept with widespread popularity and usage today is altering the operating voltage and frequency of the processor. The idea behind this is that undervolting and underclocking causes notable reduction in the dynamic losses of the transistors, albeit with a respective performance hit. In mobile devices such as smartphones and however, where energy resources are limited a small performance hit is acceptable if there is a worthy amount of energy saved. Underclocking can also be used when system performance is in general low, due to memory stalls, frequent mispredictions and other bottlenecks or when devices are “idling”. In those cases the processor performance is already low and masks the degradation caused by voltage and frequency scaling.

This reconfiguration of the systems operating specifications can be done online by the operating system kernel with dynamic voltage and frequency scaling (DVFS). Various policies exist that change the DVFS level of operation using an equally wide variety of criteria such as high level application metadata to hardware performance monitoring counters (PMCs). These policies used to steer the system are called governors.

B.2.2.2 Core Migrations

Another approach used in systems with multicore designs is to manage what each core is executing. Depending on the system architecture and the desired performance goal workloads can all be migrated to one core to save of energy, by switching the surplus cores off, or balance the load amongst all available cores to boot performance. Similar to DVFS, migrations are also dictated by a governing policy that monitors operation and makes adjustments.

B.3 Performance - Energy Trade-offs

As performance and energy efficiency are commonly opposing forces in design. This means that optimising for system efficiency commonly leads to performance deterioration and maximising system throughput leads to more consumption. For this reason, when evaluating a system’s behaviour it is important to observe the relation between energy and delay. As these two notions express different qualities of behaviour, which are correlated but do not directly affect each other, 120 Appendix B Performance and Energy Efficiency

12 Non optimal points Optimal points 10

8

6 Constraint 1 Constraint

4

2

0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Constraint 2

Figure B.4: The formation of a Pareto frontier from Pareto optimal points. improving a design can be achieved by finding Pareto optimal points and defining the trade-off relation between them.

B.3.1 Pareto Optimality

Pareto optimality or Pareto efficiency is a method of optimising resources used in engineering and theory of economics. Given a set of alternative combinations of resources for a set of constraints, a movement from one combination to another that can make at least one constraint better without making any other worse is called a Pareto improvement. Any combination is Pareto optimal when no further improvements are possible. This is formally referred to as a Strong Pareto Optimum (SPO) and the different combination states as allocations.

Sometimes a less stringent requirement is needed to consider some state as the optimum. For those cases a Weak Pareto Optimum (WPO) new allocation is considered an improvement only if all constraints are better off. As it is clear, SPO points are by definition also WPO, while the inverse statement is not true. Another interesting observation is that in most Multi-objective optimisation problems the end result will not be a single Pareto optimum point, but instead a set of points forming the Pareto frontier. Mathematically this can be described with the equations below (Schumpeter, 1949; Mathur, 1991).

Considering: n m X R , Y = y R : y = f(x), x X ∈ { ∈ ∈ }

and

n m f : R R → Appendix B Performance and Energy Efficiency 121

The Pareto frontier is therefore defined as:

P (Y ) = yA Y : yB Y : yB yA, yB = yA = Ø (for SPO) { ∈ { ∈ 6 } }

and

P (Y ) = yA Y : yB Y : yB yA = Ø (for WPO) { ∈ { ∈  } }

Where:

X is a compact set of feasible allocations.

Y is the possible set of constraint values.

f is the set of functions relating the output constraints to a set of input allocations

yA, yB are points in Y .

yB yA signifies that yB is strictly preferred over yA.

B.3.2 The Pareto Frontier E − D

The Pareto frontier is incredibly useful in ideal systems which solely depend on their constraints. In that manner, it is often used in processor design to portray a specific architecture’s capabilities as it focuses on the important trade-offs, abstracting the full range of possible parameters. As such the pairs that comprise the Pareto frontier of a core are usually the most useful metric E − D to assess core and processor behaviour.

To understand this, it is useful to imagine execution as a succession of states, each with their own point. When, for instance, a DVFS governor executes on a specific workload, this leads to E − D a footprint. comparing that to all other policies captures all other pair points. From E − D E − D the whole set, the policies that lead to the pairs where there exist no others that improve E − D delay without sacrificing energy and vice versa are on the Pareto Frontier.

Appendix C

Selected Publications

The first article was presented at CODES+ISSS (ESWEEK) 2017 conference in Seoul, Korea, nominated for Best Paper Award and published in the Transactions on Embedded Computing Systems (TECS) journal.

The second publication was presented and published at the High Performance Computer Archi- tecture (HPCA) 2019 conference in Washington DC, United State of America.

The third publication is currently under review in the Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD) journal.

123 1

Nucleus: Finding the sharing limit of heterogeneous cores

ILIAS VOUGIOUKAS, ARM Research and University of Southampton ANDREAS SANDBERG and STEPHAN DIESTELHORST, ARM Research BASHIR M. AL-HASHIMI and GEOFF V. MERRETT, University of Southampton

Heterogeneous multi-processors are designed to bridge the gap between performance and energy eciency in modern embedded systems. Œis is achieved by pairing Out-of-Order (OoO) cores, yielding performance through aggressive speculation and latency masking, with In-Order (InO) cores, that preserve energy through simpler design. By leveraging migrations between them, workloads can therefore select the best seŠing for any given energy/delay envelope. However, migrations introduce execution overheads that can hurt performance if they happen too frequently. Finding the optimal migration frequency is critical to maximize energy savings while maintaining acceptable performance. We develop a simulation methodology that can 1) isolate the hardware e‚ects of migrations from the so‰ware, 2) directly compare the performance of di‚erent core types, 3) quantify the performance degradation and 4) calculate the cost of migrations for each case. To showcase our methodology we run mibench, a microbenchmark suite, and show that migrations can happen as fast as every 100k instructions with liŠle performance loss. We also show that, contrary to numerous recent studies, hypothetical designs do not need to share all of their internal components to be able to migrate at that frequency. Instead, we propose a feasible system that shares level 2 caches and a translation lookaside bu‚er that matches performance and eciency. Our results show that there are phases comprising up to 10% that a migration to the OoO core leads to performance bene€ts without any additional energy cost when running on the InO core, and up to 6% of phases where a migration to the InO core can save energy without a‚ecting performance. When considering a policy that focuses on improving the energy-delay product, results show that on average 66% of the phases can be migrated to deliver equal or beŠer system operation without having to aggressively share the entire memory system or to revert to migration periods €ner than 100k instructions. CCS Concepts: •Computer systems organization →Heterogeneous (hybrid) systems; Embedded sys- tems; •Computing methodologies →Simulation evaluation; Additional Key Words and Phrases: HMP, Out-of-Order, gem5, migration, heterogeneous , simulation methodology, embedded systems ACM Reference format: Ilias Vougioukas, Andreas Sandberg, Stephan Diestelhorst, Bashir M. Al-Hashimi, and Geo‚ V. MerreŠ. 2017. Nucleus: Finding the sharing limit of heterogeneous cores. ACM Trans. Embedd. Comput. Syst. 1, 1, Article 1 (October 2017), 16 pages. DOI: 0000001.0000001

Œis work was supported in part by the Engineering and Physical Research Council (EPSRC) under grant number EP/K034448/1 Authors’ addresses: Ilias Vougioukas, Andreas Sandberg and Stephan Diestelhorst, ARM Ltd., CPC1, Capital Park, Fulbourn Road, Cambridge, CB21 5XE, UK; Bashir M. Al-Hashimi and Geo‚ V. MerreŠ, Electronic and So‰ware Systems Group, School of Electronics and Computer Science, University of Southampton, Southampton, SO17 1BJ, UK. Permission to make digital or hard copies of all or part of this work for personal or use is granted without fee provided that copies are not made or distributed for pro€t or commercial advantage and that copies bear this notice and the full citation on the €rst page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permiŠed. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior speci€c permission and/or a fee. Request permissions from [email protected]. © 2017 Copyright held by the owner/author(s). Publication rights licensed to ACM. 1539-9087/2017/10-ART1 $15.00 DOI: 0000001.0000001

ACM Transactions on Embedded Computing Systems, Vol. 1, No. 1, Article 1. Publication date: October 2017. 1:2 I. Vougioukas et al.

Sampling every 1 million instructions 2.0

1.5

1.0

0.5

3.830 3.835 3.840 3.845 Sampling every 100 thousand instructions 2.0

1.5

Instructions per cycle 1.0

0.5

3.830 3.835 3.840 3.845 big Billions of instructions LITTLE

Fig. 1. An excerpt of execution of dijkstra running on two di€erent core types. Sampling at a higher resolution reveals phases that can potentially be exploited to increase e€iciency.

1 INTRODUCTION In today’s embedded devices, performance is always tied to power constraints and energy eciency. Œis happens because, while process technology scaling is enabling larger transistor densities, power per area has shown to increase beyond a certain threshold in transistor size. Œis laŠer e‚ect is commonly referred to as the breakdown of Dennard’s scaling, which forces designs to operate only a fraction of the whole system, leaving the remaining design in a dark silicon state [8, 10]. To address this, heterogeneous multiprocessors (HMPs) have emerged over other designs, especially in mobile devices [17, 33], where keeping within a strict energy envelope while still being able to deliver performance on demand is crucial. To be able to deliver high performance through Memory Level Parallelism (MLP) and instruction level parallelism (ILP), an Out-of-Order (OoO) core is commonly used that has large caches, does aggressive speculation (branch predictors, prefetchers) and masks memory latency at the cost of signi€cantly increased design complexity, area and power requirements. On the other hand, In-Order (InO) cores aim at conserving energy through a simpler and smaller design, at the expense of performance and lower operating frequency. A characteristic implementation of an HMP architecture is, for example, ARM’s big.LITTLE [1], which uses two types of cores that share the same instruction set. Workloads can be broken down into phases, some of which show high ILP potential and can therefore yield high performance while others heavily access memory and therefore deliver lower instructions per cycle (IPC). Choosing the most suitable core leads to an optimal performance and energy pro€le, while making a wrong decision can lead to unnecessary slowdown or energy waste. Recent studies claim that micro-phases in the order of thousands of instructions exist [12, 30], as seen in Figure1, and can be exploited to reduce energy consumption and boost performance through €ne-grained migrations. In theory, perfect core matching should lead to optimal behaviour, however every time a migration is performed the working state needs to be transferred to the operating core, adding both a so‰ware and hardware execution overhead.

ACM Transactions on Embedded Computing Systems, Vol. 1, No. 1, Article 1. Publication date: October 2017. Nucleus: Finding the sharing limit of heterogeneous cores 1:3

From the so‰ware side the operating system (OS) handles the scheduling to the appropriate core and sends the required signals to perform the migration. Œe complexity of the scheduler and the so‰ware context switch signi€cantly a‚ects the amount of overhead added to a system, which is on the order of ten million instructions [7, 19, 32, 35]. From a hardware perspective, the overheads depend on how much of the state needs to be transferred and how much is shared between the two migrating cores. Œe more state shared and preserved, the less warm-up time required, which is the period necessary to achieve maximum performance. Unlike so‰ware, hardware overheads can be signi€cantly smaller, as low as thousands of instructions, depending on the amount of resources shared. To reduce overall system slowdown when migrating, several studies propose the implementation of dedicated hardware to eciently handle the transfer without any so‰ware intervention [12, 13, 18, 21, 26] and also fusing cores to share the front-end [20, 22, 24]. However, as hybrid designs are too complex to implement, it is worth investigating a simpler approach determining which components contribute most to the bene€ts and sharing only those. Œe questions we aim to answer therefore are: Which core is more suitable for the current workload phase? • Which components should be shared to improve migrations? • How o‰en can we switch before overheads suppress the bene€ts? • For the scope of this study we choose to focus only on the hardware migration overheads for various cache and TLB sharing con€gurations, as only through hardware acceleration it makes sense to explore €ne-grained migrations. Our motivation is based on understanding migrations from an architectural standpoint and €nding out how to improve future designs. Our contributions are: An architecture for HMPs which shares only the TLB and the last level cache. Œis e‚ectively • achieves the same performance as systems with merged cores [13, 18, 20–22, 24, 26], but is signi€cantly simpler to design. Nucleus: A novel simulation methodology that allows the direct comparison of the be- • haviour of the exact same task on di‚erent cores, which is impossible on hardware platforms. ‹antify the migration overheads for the OoO and InO core types under various degrees • of sharing and €nd the migration period limit is roughly 100k instructions. Results show that full sharing does not realistically improve performance or reduce the overheads. Œe rest of this paper is organised as follows: Section2 describes the background theory behind resource sharing and migrations. In Section3 we present the details of our simulation methodol- ogy. In Section4 we perform a limit study stressing the capabilities of our methodology, which quanti€es switching overheads and cases where migrations are bene€cial. Section5 puts our work in perspective with respect to related work in the €eld. Section6 contains concluding remarks, describing the current limitations and our plans to extend the methodology and conclude the paper.

2 RESOURCE SHARING 2.1 Heterogeneous multiprocessor organization Œe level of sharing in a system signi€cantly impacts the ability to perform a context switch between cores without signi€cant performance degradation. Even though this can notably reduce costs when migrating, it adds complexity, area and latency that can hurt the performance of the cores. Based on this, conventional design mandates that heterogeneous cores usually share at best the Last Level Cache (LLC) or the main memory. To achieve high frequency switching, recent academic trends propose extreme aŠempts to bring cores closer together, usually by sharing instruction Level

ACM Transactions on Embedded Computing Systems, Vol. 1, No. 1, Article 1. Publication date: October 2017. 1:4 I. Vougioukas et al.

TLB TLB TLB TLB OoO InO OoO InO OoO TLB InO OoO TLB InO

I$ D$ I$ D$ D$ I$ D$ I$ I$ D$ I$ D$ L2 L2 L2 L2 L2

MEMORY MEMORY MEMORY MEMORY

(a) No sharing. (b) Share caches. (c) Share caches and TLB. (d) Proposal: Share L2 and TLB.

Fig. 2. Di€erent types of sharing schemes examined for HMP systems

1 (IL1), data Level 1 (DL1) and Level 2 (L2) caches and in some cases even some internal components [13, 18, 20–22, 24, 26]. Œe feasibility of such designs however still remains questionable as some parts cannot be shared between drastically di‚erent designs, while others (IL1, DL1 caches) cannot be shared without lowering performance. A more realistic approach, instead of opting for full or no sharing, is to target the features cores can easily share to increase performance when migrating. With this in mind, we evaluate four di‚erent sharing schemes. No sharing. Œe €rst case worth analysing is when cores have private caches and share no components except the main memory (Figure 2a). Œis sharing scheme is a common setup in chip multiprocessors and HMPs, with numerous industrial and academic examples[1, 6, 16]. Here every migration causes the system to transfer it’s register €le from the retiring to the incoming core before resuming normal operation. In addition the system also has to warm-up the empty or “cold” caches of the new core before it can achieve its maximum performance. Œe overheads for the no sharing T case can be broken down as: ( private ) tied to core

Tprivate = Tpip + TBP + Treg +TTLB + T$1 +T$2 (1)

z }| { non-shareable

Where Tpip and TBP are the pipeline and| branch{z predictor} warm-up phases respectively, Treg is the overhead to transfer the register €le, TTLB is the extra cost of warming up the TLB and T$1 and T$2 are the penalties induced when starting with cold L1 and L2 caches respectively. Share caches. Œe next level of sharing we consider is one where the two cores partly or fully share the caches (Figure 2b). We consider this setup as an in between state, that can provide insight into system performance with and without sharing the TLB. Œis can potentially reduce the warm-up time a‰er a switch. However, when workloads have a very small memory footprint, or thrash though data in the cache, the bene€ts of sharing are limited. Another potential problem with this sharing scheme is that, even if caches contain useful data post-migration, empty TLBs can penalize the system signi€cantly. Œis happens when, despite the fact that the desired data is still maintained in the cache, the virtual-to-physical translation is not in a TLB. TLB misses are slow, as they need to access main memory, to the point that a miss is much more expensive than an IL1 or DL1 miss, as the system ends up performing a page walk, requiring several memory accesses. Share caches and TLB. Œe full sharing scheme (Figure 2c) assumes that both caches and TLBs are shared amongst heterogeneous cores, but still needs to warm-up the pipeline, branch predictors

ACM Transactions on Embedded Computing Systems, Vol. 1, No. 1, Article 1. Publication date: October 2017. Nucleus: Finding the sharing limit of heterogeneous cores 1:5 and the register €le. Sharing the register €le is too complex design-wise, while its relatively small size permits ignoring the penalty, especially when compared to warming the branch predictors. Œe pipeline and branch predictors cannot be shared between di‚erent core types as they are functionally very di‚erent components. For example the state of the speculative, reordering, 4-wide and deeper pipeline in the OoO cannot be transferred to a 2-wide InO pipeline with half the depth in a conventional way. Œis design archetype has been a popular academic trend recently, with some studies claiming that even the branch predictor and pipeline can be re-purposed and shared to an extent [12, 13, 20–22, 24, 26]. Our Proposal: Share L2 and TLB. Œe issue with full system sharing is that while the total slowdown of a system is reduced a‰er a migration, sharing the level 1 caches and the TLBs between the cores adds extra latency that can dramatically reduce the overall performance. Œis is because components, like the IL1 cache, are very sensitive to latency ( 4 cycles per hit) and even a single ∼ extra cycle of latency can lead to 25% performance decrease as it is constantly being accessed. Sharing the L1 has this e‚ect as in order to connect two cores to the same cache extra capacitance from wires is introduced and a switching signal is added (e.g. a multiplexer), which add latency. For this reason we propose an alternate architecture (Figure 2d) that shares the L2 cache and the TLB.

2.2 Internal core components While the sharing scheme plays a large part in determining system performance and energy eciency, the di‚erent functionality of cores in heterogeneous systems also a‚ects system behaviour. Œis is because, despite the inherent penalty to switching, migrating to a more suitable core can amortise the cost and boost performance compared to remaining on the less compatible core. For this reason, it is important for heterogeneous systems to be comprised of cores that cover a wide range of applications across a variety of energy/performance points. For the OoO core, the deeper and wider the pipeline, the more cycles needed to warm it up. Furthermore, in order to approach steady state performance, the OoO core also needs to €ll the re-order bu‚er and the branch predictor. Both these structures are vital for Out-of-Order execution and, perhaps more importantly, can signi€cantly hinder performance when they are not warmed-up. Œe simpler branch predictor and the lack of a reorder bu‚er of the InO core, while limiting the ability to parallelise instructions and mask memory requests, make it much faster to assume steady performance. Additionally, the simplicity in functionality allows the core’s design to be much smaller and consume less energy. Overall comparisons between the two cores show that the liˆle core is 3x more power ecient when clocked at the same frequency [21]. In our limit study, presented in Section3, we will adhere to this power ratio to compare the big and liŠle cores.

3 METHODOLOGY In this section we describe a gem5 based [4] simulation methodology used to explore the cost of migrations. To allow for direct comparisons, we gather samples, where we compare the IPC of two simulations, one executing on a system that runs uninterrupted and and one on a system that commences a‰er the migration is triggered for the exact same phases. Based on our methodology we devise two techniques, one to calculate the overheads for each core type for the sharing schemes in Section 2.1 and one to €nd potential phases where it makes sense to migrate to a di‚erent core. To be able to measure the impact of switching cost in a system, it is necessary to directly compare execution with and without performing migrations. To achieve this, we propose Nucleus, a methodology where we fork the simulator creating two identical simulations, the main or parent simulation and the forked or child simulation. A‰er the simulator splits, the main simulation

ACM Transactions on Embedded Computing Systems, Vol. 1, No. 1, Article 1. Publication date: October 2017. 1:6 I. Vougioukas et al.

CPU OoO TLB Slot ------InO

I$ D$

L2

MEMORY

Fig. 3. The implementation with which the simulator “hot-swaps” the cores to emulate a system with varying shared resources. continues unperturbed, while the fork performs a migration. Switching the core in the forked simulation can be achieved by signalling the operating system (OS) in the child to issue a migration. Œe OS can either trigger the migration itself, or the forked simulation cause a hardware interrupt to signal that a migration must be performed. Both of these methods however would add the OS migration overhead which, as described in the previous section, is undesirable as it will obfuscate the hardware e‚ects. Instead, we propose to simulate a single core system, where the forked simulations ”hot-swap” the core leaving the caches and memory una‚ected. To share the TLB, which for our simulator is tied to the core design, we just rewire the TLB to be connected to both cores, e‚ectively detaching them from a speci€c core. When a hot-swap is performed, the system drains in-ƒight messages, ƒushes the operating core and a‰erwards swaps it with the inactive one. One limitation from this approach is that the core operation is mutually exclusive. However, this is deliberate for our limit study as we are interested in single thread behaviour and it resembles the operation of the €rst commercial HMP systems [6] and merged cores that share the front end [20, 22, 24]. Œe infrastructure can easily be extended to accommodate beyond for more than one pair of coupled cores, however the decision tree of migration combinations is signi€cantly more complex and is beyond the scope of this paper. A‰er the fork swaps the core designs, depending on the level of sharing explored, a writeback and invalidate is triggered for the components considered as private. Œe newly active core resumes operation with those structures empty and proceeds to warm them up before assuming steady state performance, which emulates the e‚ects of a hardware migration. On the so‰ware level, the system seamlessly continues to operate completely agnostic of the swap. Œis can be accomplished because the gem5 simulator can switch the micro-architecture of components in the system, such as the core design, without altering the system architecture state, such as the register values. Œis is exploited to single out the speci€c hardware component overheads during migrations. A graphical representation of the methodology infrastructure can be seen in Figure3, where CPU slot is the placeholder for the desired CPU model, while the rest of the components remain unchanged.

ACM Transactions on Embedded Computing Systems, Vol. 1, No. 1, Article 1. Publication date: October 2017. Nucleus: Finding the sharing limit of heterogeneous cores 1:7

Continue run Main run End execution

Warm-up

M M M M M Fork Fork Fork Fork

Start m m m m Period

migrations

Fig. 4. A visual representation of Nucleus. The main simulation spawns periodic forks that perform a migration.

3.1 Nucleus One important detail to note is that, irrespective of the sharing scheme, the register €le is always considered to be shared as the simulator does not account for the cost of transferring it. Several hybrid core proposals claim that the state transfer is possible with negligible overhead [13, 21, 26]. While we do not consider this realistic for actual contemporary hardware, it helps to draw two useful conclusions from our study. First, it helps us assess a best-case scenario for each sharing scheme, and compare our results to many studies supporting €ne grain migrations and merged designs [12, 13, 20–22, 24, 26]. Secondly, it allows a cleaner observation by singling out of the e‚ects of speci€c components, and provides insight that could otherwise be masked. Our methodology allows for certain experiments to be conducted which were not previously feasible. For instance, it is possible to compare the exact same phases in systems with di‚erent characteristics. Figure1 shows how an excerpt of dijkstra is perfectly aligned. To contrast, as forked execution is not possible in hardware, two separate executions need to be triggered in parallel. However, as the executions are not fully deterministic, deviations between them will cause them to diverge and therefore the phases will end up misaligned. Œis is especially severe in the case of €ne grain investigation as even a small skew can lead to wrong results. By forking from the same starting point Nucleus solves this problem as every sample is aligned, even at the €nest granularity.

3.2 Comparing parallel simulations To explore migrations, a main simulation is created that runs on the same core from beginning to end without performing any migrations. A‰er a warm-up interval the main simulation starts spawning forks periodically until the workload terminates. Œe main simulation and forks are symbolised with M and m respectively in Figure4. Œe period for each sample is measured in instructions instead of time. Using time to determine the sample period can cause the parallel simulations to diverge when there is a di‚erence in performance, as the simulation with the higher IPC will execute more instructions in the alloŠed time. Sampling with instructions solves this as, irrespective to the performance or time they take, both runs will end up executing the same code.

ACM Transactions on Embedded Computing Systems, Vol. 1, No. 1, Article 1. Publication date: October 2017. 1:8 I. Vougioukas et al.

Œe forked simulations examine €rst-order migrations, which e‚ectively run only for a single period and then terminate. Œis way, for every period running on the main simulation, there is an equivalent simulation that examines how a system would be a‚ected by a migration at that speci€c phase. Œe termination of the forked simulations a‰er one period restricts our methodology to only one migration however, expanding to the more migrations would cause the state space to become too large to compute. We devise two separate experiments with our methodology that answer two separate questions: Migration Cost: Taking into consideration the di‚erent sharing schemes, what are the • warm up costs for each core? Ecient Migrations: What is the best setup (sharing scheme, switching frequency) for • HMPs? Answering the former question, we perform our methodology migrating always to the same core type, from OoO to OoO and likewise from InO to InO, for all the sharing schemes described in section 2.1. Œe migration frequency limitations for each core can be calculated by normalizing the performance of the migrations with the corresponding main simulations, averaged over all samples: IPCmig. IPC Relative Performance = main (2) Í N where N is the number of samples. To determine what the optimal setup is for HMPs we devise another experiment, that switches the core type, from OoO to InO and vice versa, in the migratory simulations. Œe IPC comparison for this experiment reveals the performance disparity between the two cores in each phase. We explore two migration policies that determine bene€cial switches, the no trade-o‚ policy and the energy delay product (EDP) policy. Œe €rst is when migrating to the opposite core reduces power or execution delay without sacri€cing performance or energy respectively, which can be useful in high response situations or when energy is a constrained resource. For the second policy we examine switches that lead to equal or beŠer EDP. We quantify this by calculating the percentage of workload that a migration is bene€cial:

IPCmig. IPC θ Bene€cial Migrations = [ main ≥ ] (3) Í N Bene€cial Migrations refer to the amount of cases where switching from one core type to the opposite leads to an improvement. For instance, if the indicator is 0.6, this means that a system with an InO core as the main core could bene€t from running on the OoO core for 60% of the workload. Œe ... are the Iverson brackets [14]. P is de€ned to be 1 if P is true, and 0 if it is [ ] [ ] false. Œe formula only counts the migrations that are beŠer than the main run, using θ, which is the threshold we set to determine a bene€cial migration. Depending on the policy desired the threshold θ is adjusted, for instance to €nd cases that lead to beŠer EDP or to €nd the migrations that save energy without performance loss.

4 RESULTS In this section we present our setup for both experiments, described in Section3, and their results. We compare the sharing schemes from Section 2.1 to determine whether sharing the caches and TLB is worth the added complexity. We also present our experiments on ecient migrations and provide some insights into the results.

ACM Transactions on Embedded Computing Systems, Vol. 1, No. 1, Article 1. Publication date: October 2017. Nucleus: Finding the sharing limit of heterogeneous cores 1:9

OoO Core In-Order Core Shared Resources Value Pipeline Width 3-wide 2-wide L1D Size 32KB Pipeline Depth 15 8 L1I Size 32KB Branch Predictor Two-Level Tournament L2 Size 1MB Operating Frequency 1GHz 1GHz L2 Associativity 16-way L2 Prefetcher Stride Table 1. The specifications of the system used in our experiments.

4.1 Experimental Setup To evaluate Nucleus, we use the gem5 simulator [4] in full system mode and modify the forking mechanism [28] to be able to fork with instructions and switch the cores. Œe experimental setup simulates a heterogeneous system using big OoO and liŠle InO publicly available ARMv8 models [27, 31], both operating at the same frequency using the classic memory subsystem. Œe simulated system uses the same cache sizes for both the OoO and the InO cores, so that a fair comparison between private and shared caches is drawn. Œe simulator is restored from a previous checkpoint that bypasses the boot-up and each benchmark is let to run until no warm-up transient e‚ects occur. A detailed description of the system con€guration is shown in Table1. Œe sharing architecture simpli€cations we have made and the set of benchmarks have been selected to represent the best possible case for €ne-grain migrations. While most sharing schemes are not feasible with today’s implementation techniques, the hypothetical set-ups aim to answer whether it makes sense to pursue €ne-grain migrations in HMP systems, regardless of current physical limitations.

4.2 Migration Cost For the €rst set of experiments, the simulated system runs the mibench benchmark suite [15] using a minimal build of Linux Ubuntu 14.04.5 (kernel version 3.14.0) as the operating system. Œis features microbenchmarks that have very di‚erent characteristics and demonstrate high variation in branch, memory, and integer ALU operations. At the same time, the used datasets can €t in caches most of the time, which should work further in favour of the sharing architectures. Œe relatively small size of the workloads also helps with the computationally expensive process of forking the simulator. 4.2.1 No sharing. In Figures 5a and 5b the performance of the migrating runs is presented, for the InO and OoO cores respectively, normalised to the main simulation’s performance for every sample. A relative performance of 100% means that switches at that frequency would not cause a performance hit, whereas relative performance of 50% means that on average migrating with that frequency halves the IPC. Œis e‚ectively allows for a fair comparison between a system performing a migration and one that does not. Œe performance drop at €ne granularity suggests that with private caches fast switching is not feasible, as the overheads dominate. Œis represents fairly accurately conventional architectures we use today, and shows why migrations fall into the category of millions of instructions, even without considering the considerable so‰ware overhead. 4.2.2 Share caches. When sharing all the caches but not the TLB, performance of the migratory samples visibly increases. One thing to note in Figures 5c and 5d is that the performance improve- ment of the InO core is slightly larger than that of the OoO one at 1k instructions migration period. Œe deeper pipeline and larger mis-speculation penalties seem to be causing a greater negative e‚ect on the OoO core. Overall, the performance does not justify this complex design, especially

ACM Transactions on Embedded Computing Systems, Vol. 1, No. 1, Article 1. Publication date: October 2017. 1:10 I. Vougioukas et al.

1k 10k 100k 1M 10M Instructions 1k 10k 100k 1M 10M Instructions 100% 100% 80% 80% 60% 60% 40% 40% 20% 20% 0% 0% Relative Performance Relative Relative Performance Relative

(a) InO performance with no sharing. (b) OoO performance with no sharing.

1k 10k 100k 1M 10M Instructions 1k 10k 100k 1M 10M Instructions 100% 100% 80% 80% 60% 60% 40% 40% 20% 20% 0% 0% Relative Performance Relative Relative Performance Relative

(c) InO performance sharing the caches. (d) OoO performance sharing the caches.

1k 10k 100k 1M 10M Instructions 1k 10k 100k 1M 10M Instructions 100% 100% 80% 80% 60% 60% 40% 40% 20% 20% 0% 0% Relative Performance Relative Relative Performance Relative

(e) InO performance sharing the L2 and the TLB. (f) OoO performance sharing the L2 and the TLB.

1k 10k 100k 1M 10M Instructions 1k 10k 100k 1M 10M Instructions 100% 100% 80% 80% 60% 60% 40% 40% 20% 20% 0% 0% Relative Performance Relative Relative Performance Relative

(g) InO performance sharing caches and the TLB. (h) OoO performance sharing caches and the TLB.

Fig. 5. Relative performance degradation in mibench for di€erent sharing schemes. for €ne-grain switching. Œis is due to the fact that even though the caches are shared, the TLB misses cause the system to potentially access to main memory which adds huge penalties. 4.2.3 Share caches and TLB. When the system shares all memory components (caches and TLB) another interesting result is revealed. While the InO core eliminates almost all of the overhead even at the €nest migration frequency, the OoO design still exhibits signi€cant performance losses. Œis exposes the inability of the OoO core to achieve its maximum potential, due to the deeper pipelines

ACM Transactions on Embedded Computing Systems, Vol. 1, No. 1, Article 1. Publication date: October 2017. Nucleus: Finding the sharing limit of heterogeneous cores 1:11

No sharing Share caches Share caches +TLB Share L2 + TLB No sharing Share caches Share caches +TLB Share L2 + TLB 100% 100%

90% 90%

80% 80% 70% 70% 60% 60% 50% 50% 40% 40%

30% 30% Relative Performance Relative Relative Performance Relative 20% 20% 10% 10% 0% 0% 1 10 100 1000 10000 1 10 100 1000 10000 Migration Period (Thousands of Instructions) Migration Period (Thousands of Instructions) (a) Performance degradation for the InO core. (b) Performance degradation for the OoO core.

Fig. 6. The average performance degradation for the two core type as a function of migration frequency. and the re-order bu‚er that need to be €lled, as well as the fact that OoO execution heavily relies on speculation and branch prediction to achieve high IPC.

4.2.4 Our Proposal: Share the L2 and TLB. When sharing the L2 and the TLB but keeping the L1 caches private to each core, the results of the InO core show a notable improvement over the previous sharing scheme. Our proposal shows a marginal decrease in performance for the OoO core, when performing €ne grain migrations, but a signi€cant improvement for the InO core. More importantly, sharing an extra TLB between the cores shows that even though the L1 caches are not shared, the overall system improves in performance over current architectures and performs on par with solutions that try to share all the resources. Œe improvement, as stated above, is due to the TLB sharing that helps to avoid accessing main memory to fetch data, taking care of the worst case when no pages are stored in the L2 cache. In the event where a TLB miss has occurred in the recent past, the page already exists in the L2 in which case the page walk for this system will be similar to the system that only shares caches. Figure6 summarises the averages presented in Figure5, which stresses the point that our proposal can achieve similar performance without resorting to sharing all resources. We see that the performance gap between our proposed method, depicted as the red doŠed line with triangles, and the full sharing scheme is 17% for the OoO and 30% for the InO core. However, the design complexity of routing both cores to all the caches and the TLB is prohibitive. More importantly though, we observe that the migration frequency of the OoO core does not have signi€cant slowdown above 100k instructions. At this granularity though, our proposal performs equally well without the added complexity. For HMP systems pairing OoO and InO, it makes more sense to have a simpler design sharing just the L2 and a TLB. Sharing the L1 caches adds more complexity, but does not allow the system to migrate any faster than 100K instructions, as the OoO overheads diminish any bene€ts beyond that point.

4.3 E€icient Migrations As introduced in Section 3.2, the critical question for HMP systems is whether migrating from one core to another yields any bene€ts. Based on equation3, we €nd how much of each workload can be migrated to the opposite core, using the two policies mentioned in 3.2 to adjust the threshold value θ. We base our analysis on the fact that the InO core is 3x more energy ecient, an estimate in line with current HMP systems [21].

ACM Transactions on Embedded Computing Systems, Vol. 1, No. 1, Article 1. Publication date: October 2017. 1:12 I. Vougioukas et al.

No sharing (Max) Share caches +TLB (Max) No sharing (Max) Share caches +TLB (Max)

No sharing (Avg) Share caches +TLB (Avg) No sharing (Avg) Share caches +TLB (Avg)

10% 10% 8% 8% 6% 6% 4% 4% 2% 2% 0% 0%

1 10 100 1000 10000 1 10 100 1000 10000

Beneficial Migrations Beneficial Beneficial Migrations Beneficial Migration Period (Thousands of Instructions) Migration Period (Thousands of Instructions) (a) Beneficial migrations from the InO to the OoO (b) Beneficial migrations from the OoO to the InO core with the No trade-o€ policy. core with the No trade-o€ policy.

No sharing (Max) Share caches +TLB (Max) No sharing (Max) Share caches +TLB (Max)

No sharing (Avg) Share caches +TLB (Avg) No sharing (Avg) Share caches +TLB (Avg)

100% 100% 80% 80% 60% 60% 40% 40% 20% 20% 0% 0%

1 10 100 1000 10000 1 10 100 1000 10000

Beneficial Migrations Beneficial Beneficial Migrations Beneficial Migration Period (Thousands of Instructions) Migration Period (Thousands of Instructions) (c) Beneficial migrations from the InO to the OoO (d) Beneficial migrations from the OoO to the InO core with equal or be‚er EDP policy. core with equal or be‚er EDP policy.

Fig. 7. Percentage of beneficial migrations when being energy and EDP constrained.

We present in Figure7 the two extreme cases of sharing for the no trade-o‚ and EDP policies mentioned in Section3. Œe percentage in these plots signi€es how much of the workload leads to a more desirable state. Œe solid coloured bars depict the bene€ts averaged across all workloads, while the shaded ones show the workload with the maximum bene€t. When using the no trade-o‚ policy, a system on the InO core will migrate only when executing the same phase on the OoO core costs equal or less energy. For this to happen the big core must perform 3x beŠer than the liŠle one, so the threshold is adjusted to θ = 3. In Figure 7a we see that, irrespective of how much hardware the cores share, the cases where this is bene€cial are less than 1% on average and at most 10% of the total execution, for speci€c workloads. When considering the switch from OoO to InO with the no trade-o‚ policy, the samples are considered bene€cial when the InO core performs equally as good or beŠer than the OoO core (θ = 1). Œis results in the same overall performance for only a third of the power. Œis case can happen when the OoO core cannot exploit any ILP or MLP and it is reduced to InO performance. Figure 7b shows that similarly to the previous result, the bene€cial cases are at most 1% on average and 6% of the total execution for a few benchmarks. While these cases are few, the no trade-o‚ policy reveals that the bene€ts can only be exploited mostly at the migration periods of 10k and 100k instructions. Œis reveals that micro-phases most probably exist. However these remain latent at any granularity coarser than 100k instructions, as the phases average out, while migrating every 1k instructions yields no bene€ts as the overheads mask micro-phases. For equal or beŠer EDP our metric values performance more as EDP = Power Delay2. Œe × amount of bene€cial switches when migrating from the InO to the OoO core for this policy is calculated using a threshold value of θ = √3. Œe results show that the potential for switching and maintaining equal or beŠer EDP in this case range between 48% and 66% when migrating

ACM Transactions on Embedded Computing Systems, Vol. 1, No. 1, Article 1. Publication date: October 2017. Nucleus: Finding the sharing limit of heterogeneous cores 1:13 every 100K, 1M and 10M instructions. When performing the opposite switch with the EDP policy (θ = 1 ), the results show that when aiming for the best EDP the OoO core is more suitable as the √3 bene€cial cases on average for all the workloads does not exceed 31% of the execution. For both EDP experiments, migration periods larger 100K instructions are less desirable, as the gains are signi€cantly smaller. Additionally, we observe that in most cases, sharing the caches and the TLB is not necessary as both sharing schemes show similar potential for useful migrations. Furthermore, in terms of the maximum of the bene€cial migrations, we notice that for periods larger than 1M instructions both the InO and OoO core show that some workloads are beŠer o‚ running at the opposite core.

5 RELATED WORK While the problem of maximizing eciency in HMPs through core migrations has been researched in the past, the e‚ect of migrations in the system is, to the best of our knowledge, still misunderstood especially in the realm of €ne-grained migrations. Some studies have explored the notion of migrating execution amongst heterogeneous cores that takes into account the e‚ect of the memory system [9][16]. Improvement is limited to relatively coarse-grained switching frequencies and modi€cations are required for the system to achieve them. Other e‚orts aim at redesigning accelerator cores that can adapt beŠer to the applications[23]. Œese accelerator type cores are determined through a performance study conducted at coarse instruction granularity. Furthermore, the redesigned cores also operate on coarse application phases rather than periodically deciding when to switch. A similar approach [18] has been explored where a the core can morph to exploit MLP and ILP in the workloads. Furthermore, this HMP architecture has the potential for €ne-grained migrations although this has not been fully explored, to the best of our knowledge. Other works [30] have recently examined this approach. Prior work on the existence of phases at €ne granularity and their potential micromanagement has shown that it is possible to exploit their bene€ts through a heterogeneous system with di‚erent back-ends sharing the same front end [12, 20–22]. Despite the claim, no experimental data is provided that reveals the actual existence of application phases that appear at a €ne granularity nor are the circumstances under which they can be exploited explained. In [13, 26], the authors have implemented a HMP system sharing the memory system and the register €les amongst the heterogeneous cores, claiming under 100 cycle migration latency in actual hardware. Œey propose three dimensional stacked cores to deal with the wiring and routing problem. One interesting claim made is that the register €le can be transferred with a very small overhead (one cycle), which points towards plausible future designs with high sharing. However, the system they propose has certain assumptions that we consider problematic. Firstly, the cores used for sharing are two OoO designs of di‚erent width, but using up roughly the same area. In their proposed system the similarity of the two cores make migrations pointless both in terms of energy eciency and performance. Additionally, no direct information is given about the operating frequency of the system however, the reported cycle time is 5.5ns which implies a frequency of 181MHz which would undeniably drastically degrade performance. We believe this is indicative of the performance degradation incurred when sharing sensitive components as like the register €le, the L1 caches and the TLB as such approaches either add extra processing cycles or clock the system at extremely low frequencies to deal with the extra and routing overheads.

ACM Transactions on Embedded Computing Systems, Vol. 1, No. 1, Article 1. Publication date: October 2017. 1:14 I. Vougioukas et al.

In terms of HMP scheduling investigation, several studies examine whether it is possible to augment the architecture and make it more aware of the heterogeneity to exploit the bene€ts of fast and low overhead migrations [24, 29]. In the past, studies similar to those of €ne-grained thread migrations focused primarily on levering homogeneous cores that operate at di‚erent energy and performance points by changing the voltage and frequency domains [5]. Signi€cant evidence shows that €ne-grained Dynamic Voltage and Frequency Scaling (DVFS) has received mixed results in terms of energy savings with some claiming notable gains in memory intensive phases [20, 25], while others refute this noting that, at €ne-grain DVFS, the switching latency reduces performance and increases energy consumption [11, 34].

6 CONCLUSIONS In this work we have proposed Nucleus, a full system simulation methodology using gem5, to study the e‚ects of migrations in heterogeneous systems. Our methodology accomplishes this by forking the simulation, comparing a system that migrates with one that does not. We show that Nucleus can be used in various ways, exploring di‚erent aspects of migrations. For instance, it can be used to measure the performance degradation across di‚erent migration frequencies. Alternately, we propose using it to calculate how much of each workload can be migrated to a di‚erent core to gain in energy or performance. We use our methodology to focus on identifying how much sharing of internal components (e.g. caches, TLB, register €le) is most desirable for migrations as €ne as 1k instructions. We show that our methodology is well suited to measure the hardware overheads, as it isolates them from the so‰ware overheads. Our results show that the two cores have asymmetric warm-up times and behaviour, where the OoO core needs larger instruction windows to amortize the cost of migration than the InO core. Overall this prohibits the bene€cial use of migration frequencies €ner than 100k instructions. Furthermore, we propose an architecture sharing only the L2 and the TLB to enable more ecient migrations. Compared to other sharing schemes, we €nd that our design performs equally as good as designs that share all the caches and the TLBs, but can be physically much simpler to implement. To demonstrate the capabilities of Nucleus, we perform a case study running mibench, a suite of diverse embedded benchmarks, on a heterogeneous system. Œe results show that, even when the overheads are not prohibitive, migrating to save energy without sacri€cing performance or vice versa is limited to at most 10% of the execution. Relaxing the policy to €nd equal or beŠer EDP points of operation shows that migrations can be bene€cial on average for 66% of the workload execution. Currently our methodology is limited to exploring only €rst-order migrations. Œis prohibits us from performing studies using asymmetrical switching, where the liŠle cores can migrate at higher frequencies than the big cores, as our study suggests. Our aim is to address this in future research, where some of the state can be retained on the big core while it is switched o‚. Additionally, we plan to expand to higher-order migrations which will enable our methodology to explore more of the state space for paŠerns in execution that, when identi€ed, can lead to beŠer scheduling of heterogeneous systems. Finally, as the results of our proposed architecture show promise, we believe that it is worth investigating HMPs with hierarchical TLBs. Such systems can be more bene€cial in terms of latency restrictions as the level 1 TLB can be private and the level 2 TLB be shared. Similar approaches have been mentioned in the past for homogeneous multiprocessors [2, 3], but need additional investigation in solving coherency issues in TLBs.

ACM Transactions on Embedded Computing Systems, Vol. 1, No. 1, Article 1. Publication date: October 2017. Nucleus: Finding the sharing limit of heterogeneous cores 1:15

ACKNOWLEDGMENTS Œis work was supported in part by the Engineering and Physical Research Council (EPSRC) under grant number EP/K034448/1 “PRiME: Power-ecient, Reliable, Many-core Embedded systems” (www.prime-project.org). Experimental data used available at DOI:10.5258/SOTON/D0161. Œis research was supported by the people of ARM Research. We thank our colleagues for their valuable feedback that greatly assisted us in delivering a beŠer version of this work.

REFERENCES [1] ARM. 2013. big.LITTLE Technology: Œe Future of Mobile. ARM white paper (2013), 12. hŠps://www.arm.com/€les/ pdf/big LITTLE Technology the Future of Mobile.pdf [2] A. BhaŠacharjee, D. Lustig, and M. Martonosi. 2011. Shared last-level TLBs for chip multiprocessors. In 2011 IEEE 17th International Symposium on High Performance Computer Architecture. 62–63. hŠps://doi.org/10.1109/HPCA.2011. 5749717 [3] A. BhaŠacharjee and M. Martonosi. 2009. Characterizing the TLB Behavior of Emerging Parallel Workloads on Chip Multiprocessors. In 2009 18th International Conference on Parallel Architectures and Compilation Techniques. 29–40. hŠps://doi.org/10.1109/PACT.2009.26 [4] Nathan Binkert, Somayeh Sardashti, Rathijit Sen, Korey Sewell, Muhammad Shoaib, Nilay Vaish, Mark D. M.D. D Hill, David A. D.A. A Wood, Bradford Beckmann, Gabriel Black, Steven K. S.K. K Reinhardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R. Hower, Tushar Krishna, A. Basil, Joel Hestness, Derek R. Hower, Tushar Krishna, Somayeh Sardashti, Rathijit Sen, Korey Sewell, Muhammad Shoaib, Nilay Vaish, Mark D. M.D. D Hill, and David A. D.A. A Wood. 2011. Œe gem5 Simulator. Computer Architecture News 39, 2 (2011), 1. hŠps://doi.org/10.1145/2024716.2024718 [5] Kihwan Choi, Ramakrishna Soma, and Massoud Pedram. 2005. Fine-grained dynamic voltage and frequency scaling for precise energy and performance tradeo‚ based on the ratio of o‚-chip access to on-chip computation times. In IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, Vol. 24. 18–28. hŠps://doi.org/10.1109/ TCAD.2004.839485 [6] Hongsuk Chung, Munsik Kang, and Hyun-Duk Cho. 2013. Heterogeneous Multi-Processing Solution of 5 Octa with ARM ® big.LITTLE ™ Technology. (2013), 1–8. hŠps://www.arm.com/€les/pdf/Heterogeneous Multi Processing Solution of Exynos 5 Octa with ARM bigLITTLE Technology.pdf [7] Œeofanis Constantinou, Yiannakis Sazeides, Pierre Michaud, Damien Fetis, Andre Seznec, and Irisa Inria. 2005. Performance implications of single thread migration on a chip multi-core. SIGARCH Comput. Archit. News 33, 4 (nov 2005), 80–91. hŠps://doi.org/10.1145/1105734.1105745 [8] Robert H. Dennard, Jin Cai, and Arvind Kumar. 2007. A perspective on today’s scaling challenges and possible future directions. Solid-State Electronics 51, 4 SPEC. ISS. (2007), 518–525. hŠps://doi.org/10.1016/j.sse.2007.02.004 [9] MaŠhew DeVuyst, Ashish Venkat, and Dean M. Tullsen. 2012. Execution migration in a heterogeneous-ISA chip multiprocessor. In ACM SIGARCH Computer Architecture News, Vol. 40. ACM Press, New York, New York, USA, 261. hŠps://doi.org/10.1145/2189750.2151004 [10] Hadi Esmaeilzadeh, Emily Blem, Renee´ St. Amant, Karthikeyan Sankaralingam, and Doug Burger. 2012. Dark silicon and the end of multicore scaling. IEEE Micro 32, 3 (2012), 122–134. hŠps://doi.org/10.1109/MM.2012.17 [11] Stijn Eyerman and Lieven Eeckhout. 2011. Fine-grained DVFS Using On-chip Regulators. ACM Trans. Archit. Code Optim. 8, 1 (2011), 1:1–1:24. hŠps://doi.org/10.1145/1952998.1952999 [12] Chris Fallin, Chris Wilkerson, and Onur Mutlu. 2014. Œe heterogeneous block architecture. In 2014 32nd IEEE International Conference on Computer Design, ICCD 2014, Vol. -. 386–393. hŠps://doi.org/10.1109/ICCD.2014.6974710 [13] EllioŠ Forbes, Zhenqian Zhang, Randy Widialaksono, Brandon Dwiel, Rangeen Basu Roy Chowdhury, Vinesh Srinivasan, Steve Lipa, Eric Rotenberg, W. RheŠ Davis, and Paul D. Franzon. 2016. Under 100-cycle thread migration latency in a single-ISA heterogeneous multi-core processor. In 2015 IEEE Hot Chips 27 Symposium, HCS 2015. IEEE, 1–1. hŠps://doi.org/10.1109/HOTCHIPS.2015.7477478 [14] Ronald L Graham, Donald E Knuth, and Oren Patashnik. 1989. Concrete Mathematics: A Foundation for Computer Science. Vol. 2. xiii + 625 pages. hŠps://doi.org/10.2307/3619021 arXiv:arXiv:1011.1669v3 [15] M. R. Guthaus, J. S. Ringenberg, D. Ernst, T. M. Austin, T. Mudge, and R. B. Brown. 2001. MiBench: A free, commercially representative embedded benchmark suite. In 2001 IEEE International Workshop on Workload Characterization, WWC 2001. IEEE, 3–14. hŠps://doi.org/10.1109/WWC.2001.990739 [16] Anthony Gutierrez, Ronald G. Dreslinski, and Trevor Mudge. 2014. Evaluating private vs. shared last-level caches for energy eciency in asymmetric multi-cores. In Proceedings - International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation, SAMOS 2014, Vol. -. 191–198. hŠps://doi.org/10.1109/SAMOS.2014.6893211

ACM Transactions on Embedded Computing Systems, Vol. 1, No. 1, Article 1. Publication date: October 2017. 1:16 I. Vougioukas et al.

[17] H Homayoun. 2016. Heterogeneous chip multiprocessor architectures for big data applications. 2016 ACM International Conference on Computing Frontiers - Proceedings (2016), 400–405. hŠps://doi.org/10.1145/2903150.2908078 [18] Khubaib. 2014. Performance and Energy Eciency via an Adaptive MorphCore Architecture. Ph.D. Dissertation. University of Texas Austin. [19] Chuanpeng Li, Chen Ding, and Kai Shen. 2007. ‹antifying the cost of context switch. In Proceedings of the 2007 workshop on Experimental computer science - ExpCS ’07. 2–es. hŠps://doi.org/10.1145/1281700.1281702 [20] Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das, Ronald Dreslinski, Œomas F. Wenisch, and ScoŠ Mahlke. 2014. Heterogeneous trump voltage scaling for low-power cores. Proceedings of the 23rd international conference on Parallel architectures and compilation - PACT ’14 (2014), 237–250. hŠps://doi.org/10.1145/2628071.2628078 [21] Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das, Faissal M. Sleiman, Ronald Dreslinski, Œomas F. Wenisch, and ScoŠ Mahlke. 2012. Composite cores: Pushing heterogeneity into a core. In Proceedings - 2012 IEEE/ACM 45th International Symposium on Microarchitecture, MICRO 2012. 317–328. hŠps://doi.org/10.1109/MICRO.2012.37 [22] Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das, Faissal M. Sleiman, Ronald G. Dreslinski, Œomas F. Wenisch, and ScoŠ Mahlke. 2016. Exploring €ne-grained heterogeneity with composite cores. IEEE Trans. Comput. 65, 2 (2016), 535–547. hŠps://doi.org/10.1109/TC.2015.2419669 [23] Sandeep Navada, Niket K. Choudhary, Salil V. Wadhavkar, and Eric Rotenberg. 2013. A uni€ed view of non-monotonic core selection and application steering in heterogeneous chip multiprocessors. In Parallel Architectures and Compilation Techniques - Conference Proceedings, PACT. IEEE, 133–144. hŠps://doi.org/10.1109/PACT.2013.6618811 [24] Shruti Padmanabha, Andrew Lukefahr, Reetuparna Das, and ScoŠ Mahlke. 2015. DynaMOS: Dynamic Schedule Migration for Heterogeneous Cores. In MICRO’15, Vol. -. 322–333. hŠps://doi.org/10.1145/2830772.2830791 [25] Krishna K Rangan, Gu-Yeon Wei, and David Brooks. 2009. Œread : Fine-Grained for Multi-Core Systems. Proceedings of the 36th annual international symposium on Computer architecture - ISCA ’09 (2009), 302. hŠps://doi.org/10.1145/1555754.1555793 [26] Eric Rotenberg, Brandon H. Dwiel, EllioŠ Forbes, Zhenqian Zhang, Randy Widialaksono, Rangeen Basu Roy Chowd- hury, Nyunyi Tshibangu, Steve Lipa, W. RheŠ Davis, and Paul D. Franzon. 2013. Rationale for a 3D heterogeneous multi-core processor. In 2013 IEEE 31st International Conference on Computer Design, ICCD 2013. IEEE, 154–168. hŠps://doi.org/10.1109/ICCD.2013.6657038 [27] Roxana Rusitoru. 2015. ARMv8 micro-architectural design space exploration for high performance computing using fractional factorial. In PMBS@SC. [28] Andreas Sandberg, Nikos Nikoleris, Trevor E. Carlson, Erik Hagersten, Stefanos Kaxiras, and David Black-Scha‚er. 2015. Full speed ahead: Detailed architectural simulation at near-native speed. In Proceedings - 2015 IEEE International Symposium on Workload Characterization, IISWC 2015. IEEE, 183–192. hŠps://doi.org/10.1109/IISWC.2015.29 [29] Daniel Shelepov, Juan Carlos Saez Alcaide, Stacey Je‚ery, Alexandra Fedorova, Nestor Perez, Zhi Feng Huang, Sergey Blagodurov, and Viren Kumar. 2009. Hass: A Scheduler for Heterogenous Multicore Systems. ACM SIGOPS Operating Systems Review 43, 2 (2009), 66. hŠps://doi.org/10.1145/1531793.1531804 [30] Sudarshan Srinivasan, Nithesh Kurella, Israel Koren, and Sandip Kundu. 2016. Exploring heterogeneity within a core for improved power eciency. IEEE Transactions on Parallel and Distributed Systems 27, 4 (2016), 1057–1069. hŠps://doi.org/10.1109/TPDS.2015.2430861 [31] D. Sunwoo, W. Wang, M. Ghosh, C. Sudanthi, G. Blake, C. D. Emmons, and N. C. Paver. 2013. A structured approach to the simulation, analysis and characterization of applications. In 2013 IEEE International Symposium on Workload Characterization (IISWC). 113–122. hŠps://doi.org/10.1109/IISWC.2013.6704677 [32] Dan Tsafrir. 2007. Œe Context-Switch Overhead Inƒicted by Hardware Interrupts (and the Enigma of Do-Nothing Loops) General. ExpCS (2007), 13–14. hŠps://pdfs.semanticscholar.org/86f8/a42a44b82cf76dcfe023209cfa4cdc0c8981. pdf [33] Violaine Villebonnet, Georges Da Costa, Laurent Lefevre, Jean-Marc Pierson, and Patricia Stolf. 2015. “Big, Medium, LiŠle”: Reaching Energy Proportionality with Scheduler. Parallel Processing Leˆers 25, 03 (sep 2015), 30. hŠps://doi.org/10.1142/S0129626415410066 [34] Fen Xie, Margaret Martonosi, and Sharad Malik. 2005. Ecient behavior-driven runtime dynamic voltage scaling policies. Proceedings of the 3rd IEEE/ACM/IFIP international conference on Hardware/so‡ware codesign and system synthesis - CODES+ISSS ’05 (2005), 105. hŠps://doi.org/10.1145/1084834.1084864 [35] Kisoo Yu, Donghee Han, Changhwan Youn, Seungkon Hwang, and Jaechul Lee. 2013. Power-aware task scheduling for big.LITTLE . In ISOCC 2013 - 2013 International SoC Design Conference. IEEE, 208–212. hŠps: //doi.org/10.1109/ISOCC.2013.6864009

Received May 2017; revised June 2017; accepted July 2017

ACM Transactions on Embedded Computing Systems, Vol. 1, No. 1, Article 1. Publication date: October 2017. BRB: Mitigating Branch Predictor Side-Channels.

Ilias Vougioukas , Nikos Nikoleris Bashir M. Al-Hashimi, Geoff V. Merrett † Andreas Sandberg, Stephan Diestelhorst University of Southampton † Arm Research [email protected], [email protected][email protected]

Abstract—Modern processors use branch prediction as an Steady-State and Transient Behaviour optimization to improve processor performance. Predictors 35 have become larger and increasingly more sophisticated in Transient State Region Steady State Region order to achieve higher accuracies which are needed in high 30 performance cores. However, branch prediction can also be a source of side channel exploits, as one context can deliberately 25 change the branch predictor state and alter the instruction flow of another context. Current mitigation techniques either 20 sacrifice performance for security, or fail to guarantee isolation when retaining the accuracy. Achieving both has proven to be 15 challenging. In this work we address this by, (1) introducing the notions of steady-state and transient branch predictor accuracy, and (2) 10 showing that current predictors increase their misprediction rate by as much as 90% on average when forced to flush 5 branch prediction state to remain secure. To solve this, (3) we Kilo Instruction per Misprediction introduce the branch retention buffer, a novel mechanism that 0 10 100 1k 10k 100k 1M 10M 100M partitions only the most useful branch predictor components Branch instructions per flush to isolate separate contexts. Our mechanism makes thread TAGE Perceptron TAGE with Retention isolation practical, as it stops the predictor from executing cold with little if any added area and no warm-up overheads. At the same time our results show that, compared to the state- Figure 1: Comparison of current designs with our of-the-art, average misprediction rates are reduced by 15-20% predictor that preserves a minimal state between flushes. without increasing area, leading to a 2% performance increase. These flushes can be triggered to avoid side-channel attacks that use the branch predictor. Keywords-Branch prediction, branch retention buffer, TAGE, side-channel attacks, Perceptron, microarchitecture, context switching. However, recently discovered vulnerabilities in micro- architecture permit side-channel attacks that manipulate the I.INTRODUCTION branch predictor [1]. For many of these exploits context Branch prediction contributes to high performance in switching is a critical enabler, which are hard to overcome. modern processors with deep pipelines by enabling accurate Using narrow, “hot” fixes for separate security issues that speculation. Since the inception of the idea of speculative use side-channel attacks does not guarantee that systems are execution, the improvement has gradually increased overall safe from newer variations, whereas scrubbing the entire processor performance, as Branch Predictor (BP) designs branch predictor state as a “heavy” approach can have a have steadily become more sophisticated and more complex. significant impact on performance. Addressing vulnerabil- Initially, accuracy was the dominant factor for predictor ities in software after discovery is often not possible and design however, as power became a limiting factor, designs usually even more costly in terms of performance. had to take into account more constraints. Today accuracy The above observations motivate us to design future sys- is still the primary goal, but power and area are also critical. tems that take into account one additional design constraint, Systems in general have become significantly more com- that of being able to securely speculate without allowing plex today, featuring multiple types of cores and accelerators information to leak. Duplicating or partitioning BP state to on a single die. Software, taking advantage of the aforemen- satisfy isolation is costly in terms of area. Flushing, on the tioned improvements, has also changed enabling more ap- other hand, inevitably causes frequent operation disruption plications to be handled simultaneously. Applications today of the branch predictor and inhibits effective warm-up. This are often multi-threaded and systems context switch (CS) can be described by distinguishing between steady-state and frequently between processes. transient predictor accuracy as shown in Figure 1. To prevent the aforementioned branch predictor side- Similar to how Spectre v2 uses the branch predictor channels our proposal guarantees branch predictor state iso- to leak information, BranchScope [4] uses a mechanism lation efficiently, without degrading performance. In detail, similar to how variant 2 of Spectre to target the Branch our contributions are: Target Buffer (BTB) to influence the Pattern History Table We introduce the notion of transient prediction accu- (PHT) of the directional branch predictor. In this attack • racy, its relevance to designing future side-channel free scheme, the predictor is primed so that it is in a predefined branch predictors, and show that it differs from the state the attacker can control. From this starting point, the steady-state accuracy, with which designs are evaluated target code is triggered to execute the victim code, which today. will observably change the PHT state. A simple probe We show that current state-of-the-art branch predictors can then calculate the branch flow based on the primed • perform notably worse in transient state when flushed branch predictor conditions. The technique has been shown for security. We perform an in-depth analysis of the to successfully leak information from a secure SGX enclave TAGE predictor[2], and show how large components in Intel processors[4]. do not contribute to the prediction accuracy under these The potential scenarios to exploit such attacks are numer- new circumstances. ous. Any transition from one process to another or a process To solve this, we propose the Branch Retention Buffer to/from the kernel can be a potential point of vulnerability, as • (BRB), a novel mechanism that reduces cold-start any context switch or system call can be used to let malicious effects by preserving partial branch predictor state software “hijack” the branch predictor and leak information. per context. Compared to state-of-the-art, our design 2) Mitigation techniques: On one hand, Software and achieves the same high accuracy at steady-state, im- firmware mitigation techniques are often hard to implement proved transient accuracy and ensures branch predictor and induce high overheads. Recent studies [8] measure the state isolation without increasing the overall area. actual performance loss for Spectre and Meltdown vulner- abilities and find that it ranges from 15% to 90%. While II.BACKGROUND some mitigations for the known exploits have been deployed, potential unknown exploits can still find ways to exploit Due to speculative execution, branch predictors are a shared components such as the caches, the branch prediction critical part of modern core design, drastically increasing logic, and the translation tables as the fixes are topical and do processor performance. However, mitigation techniques for not guarantee security. Furthermore, such vulnerabilities (i.e. recently discovered branch predictor side-channels notably Spectre v1, Branchscope) cannot be addressed in software degrade predictor accuracy and reduce overall system per- or microcode and require hardware redesign. formance. On the other hand, hardware isolation is a sufficient measure to ensure no leaks occur, but often comes at the A. Speculative side-channels expense of performance. Cache side-channels are extremely Recent studies have shown that side-channel attacks leak- difficult to address without significant performance loss or ing sensitive data are possible in most contemporary CPU without sacrificing a lot of area. Creating shadow structures designs [3, 1, 4, 5]. These exploits take advantage of hard- to keep cache and TLB speculative state, proposed in [9], ware oversights at design time, leaving the system vulnerable isolates the information in the caches and protects against to code that can cause information to leak outside its defined Spectre and Meltdown type attacks. The size of the shadow scope. While such attacks target various components like the structures can be very costly, especially in systems with caches and the DRAM, we focus on those that are based on much larger caches and TLBs than the ones evaluated in that exploiting vulnerabilities in branch prediction. study. Similarly, we aim to isolate the branch predictor state 1) Branch predictor side-channels: Spectre [1] class at- to achieve similar security properties for some of the exploits tacks target branch predictor components in a variety of without increasing the area or degrading the performance. ways. We focus on one specific case (variant 2) that exploits For branch prediction, clearing the branch predictor of any conditional branch misprediction that allows malicious code state for each context switch can take care of attacks that ma- (a gadget) to be executed. That code uses flush and reload nipulate or eavesdrop on control flow to access data stored type timing attacks on the caches to leak information[6, 7]. in the caches. However, flushing the entire state results in a For this attack to be plausible, some knowledge of the micro- significant accuracy drop with every context switch. Other architectural behavior is needed to influence the predictor to alternatives such as hard partitioning can negatively impact misspeculate and execute the gadget. With that, an attacker the steady-state accuracy of the predictor, as the effective can poison the predictor entries and guarantee the necessary size per context is reduced. Tagging the BP entries is also misprediction. The attack can also be triggered in cases when not an acceptable solution, because in the worst case it the branch mispredicts without altering the branch predictor behaves similar to flushing when most of the entries have state, although this is significantly harder to orchestrate. been replaced by another context. More importantly though, Figure 3: An abstract representation of Multiperspective Perceptron.

lengths that capture correlation from remote branch out- comes and recent history [2]. Internally, TAGE is comprised of tables that store the information for different history lengths. In short, when a prediction is needed, TAGE searches for the match belonging to the table with the longest history. If no match is found it uses its base predictor, a Figure 2: An abstract representation of TAGE. bimodal design, as a fall-back mechanism. The TAGE design has been improved over the years incorporating other small components in order to further while using tags eliminates branch predictor entry poisoning, improve its accuracy in cases where the original design it still allows observation of active entries, which can be was shown to frequently mispredict. For that reason, the flushed to cause deliberate mispredicts and data leaks. latest version published is the TAGE-SC-L predictor that A more secure BP design that uses isolation, can train also incorporates a statistical corrector and a loop predictor. efficiently, and quickly assume its steady-state performance In this paper we will use the latest version presented in will be useful in a post Meltdown and Spectre world[5]. Figure 2 as described in [13] as our representative example Taking into account all of the above, current and future of TAGE-type predictors. branch predictors need to be able to protect from potential Perceptron type predictors: The other popular design that side channels, without significant performance overhead is widespread today is the Perceptron predictor [15]. Loosely when context switching frequently. based on neural network theory, Perceptron type predictors 3) Threat model: For this work our threat model assumes achieve high accuracy from efficiently stored state. Similar a victim and an attacker application trying to infer without to the TAGE predictor, we use the latest variant from the 5th having authority to access victim information directly. In our championship of branch prediction (CBP5) [16], achieving model we assume: higher accuracy but with more complexity [17, 14]. Both the victim and the attacker reside in the same core The principle behind Perceptron, as shown in Figure 3, • and share the same BP as described in [1, 4]. uses a table with sets of weights that are multiplied with the Slowdown of the victim execution to be able to detect history bits. The output and confidence of the predictor de- • pends on the sign of the sum of all the history weight prod- the behavior of a single branch. This has also been × proven to be possible in recent studies [1, 4, 10]. ucts. The confidence can be deduced from the magnitude of Ability for the attacker to force the victim code to the value. Modern versions of Perceptron have reduced the • execute, allowing vulnerable code to be targeted. amount of calculations required for each prediction and use The attacker having the ability to poison BP entries multiple hashing tables of weights that are indexed using • used by the victim application, forcing a misprediction. both the history and the branch address [18]. These tables are commonly referred to as feature tables [14]. B. Branch prediction design III.PREDICTOR FLEXIBILITY Branch prediction has evolved over the years [11, 12] from small and simple designs to large, complex structures storing Branch predictors have been evaluated based on how long histories of control flow. Here, we focus on the two accurate their predictions are; given a reasonable amount most common designs used today as the basis of our study: of history they can learn from. However, given the recent the latest TAGE predictor [13], and the Multiperspective findings described in the previous section, in many real Perceptron predictor [14]. world cases they operate in a time-frame much shorter than TAGE-based predictors: The TAGE predictor is one of the ideal, predicting from only a partially warm or cold the most accurate designs. It uses tagged geometric history state. This happens as in order to fully guarantee isolation of Contribution of TAGE 64kB Components to Misprediction 100% Retained Components: 91% Bimodal (B) 80% 76% Stat. Corrector (S) 65% TAGE tables (T) 60% 39% 40% 35% 33%

20% 9% 9% 9% 3% 2% 1% 0% 1% 1% 0% 0% 0% 0% 0% 0% 1% 1% 1% 0% 0% 0% 0% 0% 20k 200k 2M 20M Misprediction increase Misprediction Branch instruction per flush None B 1.25kB B+S 9.25kB T 55kB T+B 56.25kB T+S 63kB All

Figure 4: Contribution to MPKI for each TAGE64 component across different flushing periods.

BP state the predictor needs to be flushed between context We use TAGE-SC-L to analyse how each of the com- switches causing a notable accuracy drop. ponents contributes to the accuracy when in a transient We mitigate these overheads in our BRB design, by state. For that, we measure the reduction to the overall identifying the components that contribute the most to MPKI caused by each component, when preserving its state accuracy during the warm-up phase and create a mechanism between flushes. In Figure 4 we track the size needed to retain their state per context – effectively replicating them. to retain the components in each configuration and the The rest of the predictor that only contributes to long-term relative MPKI, normalized to the steady-state accuracy of accuracy is instead flushed preventing any state leakage. Our TAGE. The analysis in the figure shows the transient state proposal addresses the overall security-performance trade- for a flushing period ranging from 20k to 20M branch off, by guaranteeing isolation while also increasing accuracy instructions. For instance, it shows that, for 20k branches, for frequent switches. flushing the entire predictor increases the MPKI by 90%. Large predictors that store more information are usually Additionally, Figure 4 shows that the majority of the able to deliver better predictions. However, if the state is accuracy is delivered by the TAGE tables, which are too lost or invalidated before the predictor has time to warm up, large to preserve. We note that for flushing every 20k then effectively it will not reach peak performance. In this branches, the bimodal base predictor and the statistical case, a smaller predictor might be able to deliver equivalent corrector improve the MPKI by 16% and 10% respectively. accuracy for a fraction of the state. However, as bimodal needs only 1.25kB of state to be preserved, compared to the 8kB of the statistical corrector, A. Steady-state and transient accuracy its MPKI improvement per area cost is much higher than We therefore distinguish between steady-state and tran- that of the statistical corrector. sient accuracy. The term steady-state accuracy refers to Dissecting Perceptron: The evaluated Multiperspective the performance of the branch predictor when it is fully Perceptron design uses multiple feature tables to deliver a warm and reached its highest accuracy. Conversely, transient prediction. Based on the previous studies [14], the features accuracy describes the behavior during the warm-up phase. that contribute the most to increased predictor accuracy are To quantify transient accuracy, we flush the branch predic- identified. The most prominent ones are: tor state across different branch instruction periods and track Global History: The outcome of a branch (taken or not the change in average mispredictions per kilo-instruction • taken). A subset of the entire history is taken for a (MPKI). As we show in Figure 1 depending on how fre- specific feature. quently the state is flushed, the actual (transient) accuracy Path: A hash of the recent sequence of branch ad- can be significantly worse than the nominal, steady-state • dresses. The entries use truncations of the actual branch accuracy we normally evaluate against. addresses to save space. Our study examines the TAGE and Perceptron predictors Recency: Similar to Path this feature keeps a stack of which are heavily modular. Next, we dissect their behavior • recently encountered branches (using an LRU policy) in situations where their state gets frequently disrupted. and hashes them. Dissecting TAGE: TAGE uses a bimodal base predictor, 12 tables for the main TAGE predictor, a statistical corrector, For our experiments we use the above features in various and a loop predictor. These components vary in size and configurations, separately or hashed in combination (i.e. how much they affect the overall accuracy. We assess their Global History XOR Path). Other features can be used to contribution to the prediction as the predictor warms up for improve the prediction however, we find that they have different frequencies of state flushing. limited effect on the overall prediction. Branch Retention Buffer Predictor Predictor

BRB Entry 1 Other Predictor Address Space ID Predictor Components Components SRAM CAM Retention BRB Entry 1 Bank 0 Useful Active BRB Entry 2 Bank 1 BRB Entry N Component

Retention BRB Entry N Bank N Current Entry Selector Figure 5: The mechanism which stores the essential state Entry ID of the predictor. Most of the state is disregarded and Figure 6: Diagram of the BRB. The retained state is only the most useful state is kept to increase transient stored in separate SRAM banks which correspond to accuracy. different contexts.

2 TAGE Breakdown at 20k Branch Insts. B. The Branch Retention Buffer Retained Components: None Bimodal (B) Loop Predictor (L) From the predictor analysis, retaining even a small amount L S Stat. Corrector (S) of state per context could improve the transient accuracy 1.8 B TAGE tables (T) of the predictor, without hurting the maximum accuracy S, L B, L achieved during long uninterrupted execution. To guarantee B, S B, S, L isolation and simultaneously improve transient accuracy, 1.6 we propose a mechanism (Figure 5) where state of the branch predictor (component) is retained per context. The mechanism uses a dedicated component called the branch 1.4 retention buffer (BRB) that replaces part of the predictor. T, B, L The branch retention buffer is tightly coupled with the MPKI (Normalized) T T, B T, S branch predictor and keeps multiple separate entries. Ideally, 1.2 T, S, L storing the entire state would enable high accuracy without T, L any warm-up time. However, the size of designs today is T, B, S too large to fully store without incurring notable overheads. 1 To reduce the additional storage costs, we aim to make this 0 8 16 24 32 40 48 56 64 Size (kB) component as small as possible. We find from Figure 4 that for TAGE 10kBits (1.25kB) sized entries are ideal, as they Figure 7: Contribution to MPKI for each TAGE64 match the size of the base predictor for both the 8kB and component that is retained at 20k branch instructions. 64kB variants. As such, the entries are designed to store The bimodal reduces the MPKI by 16% while the entire components that can deliver a standalone prediction. statistical corrector by 10% despite it being 8x larger. In short, we propose a low overhead mechanism that retains partial state that delivers better accuracy than a large “cold” predictor, with no steady-state accuracy penalty. After the newly selected entry is activated, the others are We implement the BRB to allow for easy switching put into retention to save energy. While operating under the between entries without moving or copying data and thus same context, the active BRB entry is directly accessed for with no added latency during execution. This is done by predictions and does not go through the CAM. When the keeping multiple entries in separate SRAM banks. When a BRB is full and a new context requests an entry, the least switch occurs, the entry corresponding to the correct context recently used entry is evicted. Previous studies show that is selected and used instead of swapping out data from the 3 entries are enough to handle 2 communicating processes BRB. The selection uses the Address Space ID and triggers and operating system implications [19] without negatively only when a context switch occurs. A small CAM is used impacting system performance. If more contexts are active to map to the correct BRB entry ID, shown in Figure 6. and are consistently being switched out, more entries can As a context switch occurs, any delay in the rerouting of be introduced. The added entries increase the area of the BRB entries is masked by the larger overhead of storing the BRB component linearly, while timing remains the same context state, which is conducted in parallel. irrespective of entries, for realistic sizes. Retaining the reduced state makes isolation of processes efficient and, as a consequence, improves security. Keeping that reduced state separate per process (or between untrusted parts of the same process, such as browser and script engine), while emptying the rest of the structures, ensures that no state is ever shared between mutually untrusted processes and the kernel. The performance hit is softened in this case, as the preserved state increases the transient state accuracy. The obvious question, in this case, is what method is used to reduce the data in the most impactful way. This question is not trivial, as it is directly tied to the BP implementation and the amount of state that can be stored efficiently when taking into account the overhead constraints. One way to preserve state is to select certain components that provide a good balance of the amount of data stored and accuracy achieved; and discard the state of the remaining components. This “vertical cut” method can be used in the case of TAGE as it is comprised of various components that can provide accurate standalone predictions (Figure 7). For instance, the base predictor can be isolated from the Figure 8: The ParTAGE predictor combines benefits rest of the components and still provide reasonable accuracy. from both perceptron and TAGE predictors. It also Similarly, separate TAGE tables can be preserved instead of allows partial state to be preserved to improve transient the entire design, to target certain history lengths. accuracy. Other components, however, provide only complementary benefit to the predictions [14]. Therefore, it does not make sense to consider preserving the state of them by themselves. a Multiperspective Perceptron predictor using smaller and The loop predictor, for instance, is a relatively small com- fewer feature tables than the Perceptron from the literature as ponent that identifies regular loops with a fixed number of the base predictor of the TAGE design, replacing the bimodal iterations and needs few instructions to warm up. Its overall predictor. effect of on the accuracy of TAGE-SC-L is measured to be Each entry in the BRB stores an an independent small around 0.3% improvement [13]. perceptron predictor enabling contexts to keep some reduced In Perceptron, the vertical “cut” can be applied by only state separate. The rest of the branch predictor design can retaining certain features. In contrast to TAGE, this is be cleared to eliminate the possibility of side-channel leaks significantly more complicated, as the empty features do targeting the branch predictor. not properly add up to the sum of all the weights and To reduce the size of the Perceptron within the range of therefore skew the prediction outcome. To address this a the allocated budget, we use a similar configuration to the special mechanism is needed to adjust all the preserved 8kB Multiperspective Perceptron [14] with 8 smaller feature weights to deliver valid predictions. We consider this to be tables instead of 16. The limited size of the Perceptron in an impractical extra step needed every time a flush occurs. ParTAGE enables its state to be preserved when context Another way to preserve state is to store partial state for switching. In the intermediate warm-up state, ParTAGE has all the predictor components. For TAGE this can be done by to select between transient prediction of the TAGE tables naively saving a portion of each of the TAGE tables. This and the steady-state prediction of the preserved Perceptron. “horizontal” approach captures information across all of the We design two versions of ParTAGE; one that hardwires history, albeit with less accuracy than storing the entire state. the Perceptron predictor to be always chosen during the As mentioned, in Perceptron the tables are combined transient operationof the predictor (based on the context to provide accurate predictions. A “horizontal cut” can be switching frequency), and one that assesses the confidence done by merging multiple neighbouring entries, which also of the prediction of the base perceptron predictor before reduces the accuracy. selecting the outcome.

C. Perceptron amplified / reinforced TAGE IV. EXPERIMENTAL SETUP As a specific showcase of our BRB design methodology, The experiments conducted use the CBP5 framework with we propose a hybrid approach to preserving the state we call the traces from 2016 [16] as a base. The framework uses 268 ParTAGE – Perceptron amplified / reinforced TAGE, shown traces, ranging from 100 million to 1 billion instructions for in Figure 8. The branch retention buffer in this case stores both mobile and server workloads. Name Size Details Application Time(s) Cxt. CS/s Switches BiM 64kB 2-bit counter (CS) 8kbits counter 1.25kB adobereader 73 459,699 6,279 2kbits hysterisis gmail 23 228,924 9,833 BiMH 4kbits counter 625B googleslides 108 343,788 3,185 1kbits hysteresis youtube 36 418,254 11,487 64kB 37 feature tables yt playback 21 224,631 10,698 Perceptron 8kB 16 feature tables angrybirds rio 81 406,711 5,044 Loop Predictor, camera@60fps 60 1,958,557 32,643 TAGE 64kB / 8kB Statistical Corrector, geekbench 60 129,533 2,159 BiMH1 base predictor Table III: The frequency of context switches on a Google Loop Predictor, Pixel phone. ParTAGE 64kB / 8kB Statistical Corrector, 1.25KB 8 table perceptron Performance Loss when Flushing the BP 1k 10k 100k 1M 10M Table I: The evaluated branch predictors. 1.0 0.8 OoO Core Value Memory Value 0.6 Pipeline Width 3-wide L1D Size 32kB Pipeline Depth 15 L1I Size 32kB 0.4 Clk. Frequency 1 GHz L2 Size 1MB 0.2 L2 Prefetcher Stride 0

Table II: The specifications of the system used in our FFT jpeg Relative Performance qsort susan adpcm gem5 experiments [20]. dijkstra patricia rijndael Average bitcounts basicmath stringsearch We modify the CBP framework so that branch predictors Figure 9: Impact of periodical BP flushing (1k to can perform full or partial flushes on their state. This enables 10M instructions) on core performance normalized to temporal studies of the behaviour of branch predictors, a system without flushing. Results extracted from gem5 revealing the effects of full or partial loss of state. using full system simulation [20] of an Arm OoO model For our experiments, we use a variety of predictors that are running MiBench[21]. commonly used today. As a baseline design, we implement a set of bimodal type predictors, with and without hysteresis. To compare more contemporary predictors, we use the limit of core performance drop when flushing the predictor submitted TAGE-SC-L [13] and Multiperspective Perceptron periodically, ranging from 1k to 10M total instructions. We without TAGE [14] from CBP5. We lightly modify both use an Arm OoO model (Table II) in a gem5 [20] full system designs so that we can flush their designs partially or simulation running MiBench [21]. We compare two systems, completely when needed; carefully retaining their their exact one that periodically flushes the state and one that does not. steady-state behaviour. Furthermore, we implement two variants of ParTAGE V. RESULTS that express different policies for the selection of the best We perform three different types of comparisons focusing transient prediction. ParTAGE-S overrides the prediction of on simple bimodal predictors, current TAGE and Percep- the TAGE tables below a period threshold which we have set tron designs, and our ParTAGE proposal. We measure the to be 200k branch instructions. ParTAGE uses an integrated frequency of context switches on a modern mobile device confidence value that is assessed by the rest of the TAGE (Table III) and find that switches can happen on average design in order to indicate the most accurate prediction. as often as every 12k branch instructions. This leads to We focus on 8kB and 64kB predictors, similar to the significant core performance loss, as much as 15% based ones that are evaluated at CBP. A detailed list of all the on our limit study (Figure 9). evaluated predictors is shown in Table I. We assess the transient accuracy of the evaluated predictors for the cases A. Quantifying transient accuracy outlined in Section II. To achieve this, we cover a range 1) Bimodal accuracy results: We use Bimodal as a simple of flushing periods from 10 to 60M branch instructions per first experiment, consisting of a single table of counter bits. flush; extracting the optimal design for each use case. In Table IV we show how the MPKI of different bimodal We also perform a limit study, that identifies the upper designs improves as the state retention period increases. MPKI Branch instruction flushing period (instructions) Name 10 100 200 2k 20k 200k 2M 20M 40M 60M

BIM 64kB 36.77 22.22 19.52 14.12 12.13 11.40 11.16 11.14 11.15 11.15 BIM 1kB 36.78 22.26 19.58 14.40 12.87 12.59 12.52 12.51 12.51 12.51 BIMH 40kB 30.97 19.33 17.23 13.40 12.04 11.56 11.41 11.39 11.39 11.39 BIMH 10kB 30.96 19.34 17.24 13.43 12.10 11.63 11.49 11.47 11.47 11.47 BIMH 1.25kB 30.97 19.39 17.33 13.68 12.53 12.17 12.08 12.07 12.07 12.07 BIMH 625B 30.99 19.45 17.42 13.93 12.91 12.62 12.55 12.54 12.54 12.54 Table IV: Mispredction rates of a wide range of bimodal predictors across different state flushing periods measured in branch instructions.

Bimodal Transient MPKI TAGE vs Perceptron Transient Behavior 20 50

19 45 40 18 35 17 30 16 25 15 20 14 15

13 10

12 5 Misprediction per Kilo Instruction Kilo per Misprediction Misprediction per Kilo Instruction Kilo per Misprediction 11 0 100 1k 10k 100k 1M 10M 100M 10 100 1k 10k 100k 1M 10M 100M Branch instructions per flush Branch instructions per flush BiM 64kB BiMH 40kB BiM 1kB BiMH 625B TAGE 64kB Perceptron 64kB TAGE 8kB Perceptron 8kB Figure 10: Comparison of sizes and types of bimodal Figure 11: A comparison between TAGE and perceptron. predictors. Size contributes to steady-state accuracy, While both exhibit similar steady-state accuracy TAGE hysteresis to transient accuracy. Note the Y-axis begins shows better transient behavior. at 11 MPKI offset.

behaviour between TAGE and Perceptron designs. We notice Despite a 100x difference in size, the steady-state MPKI that the 8kB variant of Perceptron has the worst cold start, increase is only 12.28%. The results reveal that while size however it manages to rapidly improve and assume similar contributes to the steady-state accuracy, transient accuracy steady-state accuracy. is not affected by the size of a predictor design. The transient MPKI for a flushing period of 20k branch Instead, a variation in the design such as adding hysteresis instructions is 7.75 and 6.98 for the 8kB and 64kB TAGE improves transient accuracy, even for smaller predictors. designs, and 7.57 and 6.93 respectively for Perceptron. This happens as the hysteresis bits also affect neighbouring Switching every 20k branch instructions is within a realistic branches and ultimately warm-up the design faster. This range for applications like the ones presented in Table III. is clearly visible at smaller flushing periods where the This result shows that TAGE can deliver better steady-state misprediction is on average 18% lower for the bimodal accuracy. However, for applications that perform frequent designs with hysteresis. Figure 10 shows the difference in context switching (20k branches), Perceptron is marginally accuracy for different bimodal sizes and designs. more accurate. 2) Transcient accuracy: TAGE vs Perceptron: Our second Another observation can be extracted when comparing the set of results focuses on comparing the transient behaviour of 64kB variants with the 8kB ones at smaller windows of the two most prominent designs in modern systems; TAGE uninterrupted execution. Considering for instance, flushing and Perceptron. Figure 11 shows the different transient every 20k or 200k branch instructions, the 8kB predictors Small Perceptron vs Bimodal Comparison Perceptron Table Calibration 25 9

8.5

20 8 MPKI 7.5 15 7 6 7 8 9 10 11 Number of Feature Tables 10 (a) Calibrating the number feature of tables for 1.25kB size.

5 Perceptron Size Calibration 16 Misprediction per Kilo InstructionMisprediction per 14 0 100 1k 10k 100k 1M 10M 100M 12 Branch instructions per flush 10 BiMH 1kB Perceptron 1.7kB BiMH 625B Perceptron 1.25kB MPKI 8 Figure 12: Comparing bimodal and Perceptron designs 6 below 1.25KB as base predictors. Results show even 0.25 0.5 1 1.25 2 2.5 when reduced this size Perceptrons greatly outperforms Size (kB) competition. (b) Different sizes of perceptron with a constant of 8 feature tables. perform on average 10% and 15% worse than the 64kB Figure 13: Calibrating the small perceptron designs. However, the accuracy gap increases to 33% when observing the same designs at steady-state. Using the steady-state accuracy as a baseline and compar- of dense branch information storage, to design the base ing the transient accuracy, across all granularities as fine as predictor as we intend to store and restore it for each 20k branch instructions, we calculate how much worse the context, thus never letting it return to its transient accuracy. accuracy can be in context-switch-heavy workloads. From Furthermore, in Figure 7 the statistical corrector is shown to Figure 11, the MPKI increase is as much as 90% and 80% add little accuracy when the TAGE tables are cold, despite for TAGE and Perceptron respectively. This reinforces our its large size (8kB). We use this to experiment with larger belief that predictors today are evaluated without taking into perceptrons but maintaining a balance in terms of size. account disruptions that can occur during realistic execution. In Figure 12 we compare bimodal and Perceptron designs B. ParTAGE that will fit roughly within 1.25kB of budget. Results show that while Perceptron has higher MPKI at granularities as To improve on transient accuracy of branch prediction, fine as 200 branch instructions, its steady-state accuracy is we present the results from our proposal, ParTAGE a design significantly better than bimodal. that is influenced from both Perceptron and TAGE. Next, we motivate its design choices and proceed to analyse its Optimizing the base predictor: We reduce the size of each accuracy compared to the competitive designs today. BRB entry to 1.25kB and calibrate the amount of feature Finding the right, “small” predictor: The TAGE break- tables and size of the Perceptron predictor. We fix the size down analysis leads to the underlying idea for the hybrid pre- of the predictor initially to 1.25kB (Figure 13a). We find that dictor: replace components which use a significant amount eight feature tables deliver the most accuracy. We repeat the of area with other, retainable components that can deliver a process for 3kB perceptron entries, which increase the size higher transient accuracy. of the predictor by 10% . For this reason, we also evaluate From Figure 11 we observed that Perceptron predictors a design without the statistical corrector maintaining equal rapidly improve their MPKI over time before reaching area to the original TAGE design. steady-state accuracy. This reveals an interesting insight Figure 13b shows the best configuration for each size. about Multiperspective Perceptron predictors: compared to Note that changing the feature sizes affects the predic- bimodal predictors, the hashed and common weight values tion more than the altering the size does. We find that in modern Perceptron designs are a more effective design, for completeness, a genetic algorithm as proposed in past being able to aggressively train and and store information studies [14] can deliver even better results. We leave these in a denser format. We focus on the latter attribute, that optimisations for future studies. Transient MPKI with State Retention Overhead Compared to Steady-State TAGE 30 +200% Transient State Steady State +150% 25 +100%

20 +50% TAGE (Flush) ParTAGE-S Added Overhead +0% 15 TAGE (B) Branch instructions per flush ParTAGE TAGE ParTAGE 1kB TAGE (B) ParTAGE 3kB ParTAGE 3kB + SC 10 ParTAGE-S 3KB Figure 15: MPKI increase compared to steady-state when 5 retaining 1.25kB. In coarse granularity the effect is negligible, but for frequent flushes the improvement is Misprediction per Kilo Instruction Kilo per Misprediction 0 significant. 10 100 1k 10k 100k 1M 10M 100M Branch instructions per flush 4.30 Steady-State Performance Figure 14: Decrease of MPKI when retaining 1.25KB 4.20 and 3kB of state. Note that TAGE clears its entire state, 4.10 while all other designs keep their base predictor state 4.00 intact. 3.90 MPKI 3.80 3.70 3.60 C. ParTAGE results 1M 10M 100M We create the ParTAGE predictor based on our observa- Branch instructions per flush tions for small predictors replacing the 1.25kB bimodal with TAGE 64kB Perceptron 64kB ParTAGE 64kB a perceptron. Figure 16: The steady-state reveals that ParTAGE can Comparing ParTAGE variants: The results in Figure 14 perform competitively and even outperform existing high compare the designs featuring the BRB, all variants of accuracy designs. ParTAGE and TAGE (B) (that retains the bimodal), to TAG without the BRB. For ParTAGE, evaluate both variants that retain both 1.25kB and 3kB BRB entries. ParTAGE- as the TAGE tables have not been trained adequately. In S statically overrides the TAGE tables for a brief period contrast, ParTAGE, which simply feeds the confidence into after a flush is performed, forcing the Perceptron prediction TAGE, does not achieve the same transient accuracy. This to be selected, while ParTAGE selection prediction based happens when TAGE tables are completely cold, but their on each component confidence. The 1kB entry designs are prediction is prioritized over the more accurate one from the overall 3% larger in area, while the 3kB variant removes the perceptron component. This is also why ParTAGE does not statistical corrector, maintaining the same area budget. improve the transient accuracy at 20k branch instructions. When considering current system upper limit context We propose to solve this with better tuning of the selection switching, we see that the most accuracy is delivered by policy in the future. our proposed BRB extension with the 3kB ParTAGE variant Improving the steady-state accuracy of the base predictor which is iso-area compared to TAGE-SC-L . Figure 15 and retaining its state effectively enables efficient operation shows the improvement is roughly 20% and 15% for 3kB at finer granularities. This can be done by either increasing ParTAGE and TAGE(B) respectively. Overall, preserving the size of the branch retention buffer to fit more state, a minimal state can have a significant improvement (15- or develop predictors ranging between 1kB to 3kB with 20% less MPKI) when the state is frequently flushed but better steady-state accuracy. For instance, approaching the has a small effect at the steady-state (maximum 5% MPKI accuracy of an 8kB perceptron can further reduce the mis- increase). To maintain the same steady-state accuracy the prediction, shown in 11. statistical corrector is included (ParTAGE 3kB + SC in While the primary focus of this study is the improvement Figure 15) increasing the overall area by 10% but delivering of the transient accuracy of branch prediction, it is equally better accuracy throughout. important to maintain a competitive steady-state accuracy Comparing ParTAGE to TAGE and Perceptron: Compar- for our proposed design. We perform a direct compari- ing the different implementations of ParTAGE, we notice son between TAGE-SC-L, both Multiperspective Perceptron that while ParTAGE-S works well for the fine grain switches, versions submitted to CBP5 [16] and the best version of the transient accuracy suffers at larger flushing periods, ParTAGE. Figure 16 provides a detailed look at the steady- state MPKI compared to TAGE and the best version of Security studies [1, 5] identify practical threats that can the Multiperspective Perceptron [22, 14]. We observe that compromise the system using the branch predictor, in their ParTAGE delivers 3.726 MPKI, competitive to TAGE (3.660 work they mention branch predictor flushing as a mitigation MPKI) and even outperforming perceptron (3.826 MPKI). technique for the security aspects however they do not provide an estimate of the performance loss of such an VI.RELATED WORK approach. Our results contribute to quantifying the performance The effects of context switches on predictor accuracy and loss in such scenarios. The branch retention mechanism overall performance has been studied in the past with varied we describe, to the best of our knowledge, has not been results. Studies like [23] show that in the past when systems preciously proposed and provides a viable solution when did not switch often (above 400k instructions) and branch the performance degradation is significant. predictors could fully train in 128k instruction periods, the effects of context switching to the BP accuracy were VII.CONCLUDING REMARKS negligible. When switching at a faster rate, [11, 24, 25] show that In this work, we have focused on creating hardware the branch predictor accuracy drop is in the range of 7% - security mitigations for side-channels in the branch predictor 20%. We show that context switches on current hardware can which can be disruptive to performance when dealt in happen more frequently today than in the past (III), in cases software. We highlight realistic scenarios where this can as often as every 64k instructions. Furthermore, compared occur, such as frequent context switches and system calls, to past studies current branch predictor designs like TAGE and show that current mitigation techniques for side-channel and multiperspective perceptron take much longer to warm attacks targeting the speculation engine are both expensive up than 128k instructions. Our results show that branch mis- and impractical. We show that these disruptions create a predicts can increase by as much as 90% when applications disconnect between the reported nominal accuracy of branch today are switching aggressively and flushing. predictors and the actual one in a real world applications. Some studies [26, 27] focus on the effect that context To distinguish between the two, we introduce the notions of switches have on actual performance. They find that when steady-state and transient branch predictor accuracy. switching is done too frequently, cache misses overshadow We propose a novel mechanism, the Branch Rentention all other overheads and greatly degrade performance. Other Buffer (BRB), that keeps a minimal, isolated state per work, namely [28] shows that even when applications have context to reduce the high number of mispredictions. We all of their data in the caches, the performance hit caused propose two designs that store essential context state in by the branch predictor can be as much as 20% for 100k the BRB; first, an extension to TAGE, named TAGE (B), instruction switching frequencies for Out-of-Order cores that keeps the state of its bimodal predictor. Second, a while In-Order cores show little degradation. Our results novel hybrid branch predictor design, ParTAGE, that re- from Figure 9 agree with such findings. Out-of-Order ex- places the bimodal in TAGE with a Perceptron. We evaluate ecution successfully masks cache misses, but relies heavily these variants with a new methodology, which modifies the on speculation and branch predictor accuracy to deliver Championship Branch Prediction framework so we can clear performance. Flushing for security reasons perfectly matches predictor state across different frequencies and components. these simulations and we expect dimilar performance drop. We show that branch predictors can have as much as 90% Studies break down the contribution to accuracy of the more mispredicts than what is evaluated today at steady- components in Perceptron predictors [14], but a similar study state, under certain realistic conditions. Using the BRB we of the breakdown of the accuracy into components has not manage to guarantee BP state isolation and consequently been done so far for TAGE-type designs. increase security when context switching, while achieving While many studies track the effects of accuracy across reductions on MPKI of 15% and 20% for TAGE and our different sizes of predictors and switching frequency [29, hybrid predictor design respectively. 30, 2, 23] the notion of accuracy under frequent flushes Improving the steady-state was not the goal of this study we believe to be a novel insight, especially when tak- as we focused on isolation while simultaneously mitigat- ing necessary security precautions for side-channel attacks. ing the negative effects in the transient state. However, Studies have been conducted [19] that focus on isolating we believe that with future optimisations, our designs can the kernel from the applications in the branch predictor for improve the steady-state as well. We also aim to enhance performance. This can potentially increase security but will the transient accuracy by focusing on improving policies not completely eliminate BP side channels. Furthermore, the that select between retained and “cold” state components, in design proposes a hard partition of the entire BP state which order to fully exploit all the benefits from the base predictor. would be costly in terms of area for current large predictor Finally, we aim to focus on more accurate “small” designs designs, unless steady-state accuracy is sacrificed. in the 1kB - 3kB range. ACKNOWLEDGEMENTS [15] D. Jimenez and C. Lin, “Dynamic branch prediction with per- ceptrons,” in International Symposium on High-Performance This work was supported in part by the Engineering Computer Architecture, pp. 197–206, 2001. and Physical Research Council (EPSRC) under [16] “5th Championship Branch Prediction,” in 5th JILP Workshop grant number EP/K034448/1 PRiME: Power-efficient, on Computer Architecture Competitions (JWAC-5), 2016. Reliable, Many-core Embedded systems (www.prime- [17] A. Seznec, “Analysis of the o-geometric history length branch project.org). Experimental data used available at predictor,” in International Symposium on Computer Architec- ture, pp. 394–405, 2005. https://doi.org/10.5258/SOTON/D0739. [18] D. Tarjan and K. Skadron, “Merging path and gshare indexing This research was supported by the people of Arm Re- in perceptron branch prediction,” ACM Transactions on Ar- search. We thank our colleagues for their valuable feedback chitecture and Code Optimization, vol. 2, no. 3, pp. 280–300, that greatly assisted us in delivering a better version of this 2005. work. [19] J. Rubio and N. Vijaykrishnan, “OS-Aware Branch Prediction: Improving Control Flow Prediction for Op- REFERENCES erating Systems,” IEEE Transactions on Computers, vol. 56, no. 1, pp. 2–17, 2007. [1] P. Kocher, D. Genkin, D. Gruss, W. Haas, M. Hamburg, [20] N. Binkert, S. Sardashti, R. Sen, K. Sewell, M. Shoaib, M. Lipp, S. Mangard, T. Prescher, M. Schwarz, and Y. Yarom, N. Vaish, M. D. Hill, D. A. Wood, B. Beckmann, G. Black, “Spectre Attacks: Exploiting Speculative Execution,” ArXiv S. K. Reinhardt, A. Saidi, A. Basu, J. Hestness, D. R. Hower, e-prints, 2018. and T. Krishna, “The gem5 simulator,” ACM SIGARCH [2] A. Seznec and P. Michaud, “A case for (partially) TAgged Computer Architecture News, vol. 39, p. 1, 8 2011. GEometric history length branch prediction,” Jilp, vol. 8, [21] M. R. Guthaus, J. S. Ringenberg, D. Ernst, T. M. Austin, pp. 1 – 23, 2006. T. Mudge, and R. B. Brown, “MiBench: A free, commer- [3] Y. Kim, R. Daly, J. Kim, C. Fallin, J. H. Lee, D. Lee, cially representative embedded benchmark suite,” in IEEE C. Wilkerson, K. Lai, and O. Mutlu, “Flipping bits in memory International Workshop on Workload Characterization, pp. 3– without accessing them: An experimental study of DRAM 14, 2001. disturbance errors,” in International Symposium on Computer [22] D. A. Jimenez,´ “Multiperspective Perceptron Predictor with Architecture, pp. 361–372, 2014. TAGE,” JWAC-4: Championship Branch Prediction, 2014. [4] D. Evtyushkin, R. Riley, N. C. Abu-Ghazaleh, ECE, and [23] M. Co and K. Skadron, “The effects of context switching on D. Ponomarev, “BranchScope,” in Architectural Support for branch predictor performance,” in International Symposium Programming Languages and Operating Systems, pp. 693– on Performance Analysis of Systems and Software, pp. 77– 707, 2018. 84, 2001. [5] M. Lipp, M. Schwarz, D. Gruss, T. Prescher, W. Haas, [24] S. Pasricha and A. Veidenbaum, “Improving branch predic- S. Mangard, P. Kocher, D. Genkin, Y. Yarom, and M. Ham- tion accuracy in embedded processors in the presence of burg, “Meltdown,” ArXiv e-prints, 2018. context switches,” in International Conference on Computer [6] Y. Yarom and K. Falkner, “Flush + Reload : a High Resolu- Design, 2003. tion, Low Noise, L3 Cache Side-Channel Attack,” USENIX [25] A. S. Dhodapkar and J. E. Smith, “Saving and restoring Security 2014, pp. 1–14, 2014. implementation contexts with co-designed virtual machines,” [7] F. Liu, Y. Yarom, Q. Ge, G. Heiser, and R. B. Lee, “Last-level 2001. cache side-channel attacks are practical,” in IEEE Symposium [26] J. C. Mogul and A. Borg, “The Effect of Context Switches on on Security and Privacy, pp. 605–622, 2015. Cache Performance,” in Architectural Support for Program- [8] A. Prout, W. Arcand, D. Bestor, B. Bergeron, C. Byun, ming Languages and Operating Systems, vol. 19, pp. 75–84, V. Gadepally, M. Houle, M. Hubbell, M. Jones, A. Klein, 1991. P. Michaleas, L. Milechin, J. Mullen, A. Rosa, S. Samsi, [27] F. Liu and Y. Solihin, “Understanding the behavior and C. Yee, A. Reuther, and J. Kepner, “Measuring the Impact implications of context switch misses,” ACM Transactions on of Spectre and Meltdown,” ArXiv, pp. 1–5, 2018. Architecture and Code Optimization, vol. 7, no. 4, pp. 1–28, [9] K. N. Khasawneh, E. M. Koruyeh, C. Song, D. Evtyushkin, 2010. D. Ponomarev, and N. Abu-Ghazaleh, “SafeSpec: Banishing [28] I. Vougioukas, A. Sandberg, S. Diestelhorst, B. Al-Hashimi, the Spectre of a Meltdown with Leakage-Free Speculation,” and G. Merrett, “Nucleus: Finding the sharing limit of hetero- ArXiv, 2018. geneous cores,” ACM Transactions on Embedded Computing [10] T. Allan, B. B. Brumley, K. Falkner, J. van de Pol, and Systems, vol. 16, no. 5, 2017. Y. Yarom, “Amplifying side channels through performance [29] M. Das, A. Banerjee, and B. Sardar, “An empirical study on degradation,” in Annual Conference on Computer Security performance of branch predictors with varying storage bud- Applications, pp. 422–435, 2016. gets,” in International Symposium on Embedded Computing [11] M. Evers, P.-Y. Chang, and Y. N. Patt, “Using hybrid branch and System Design, pp. 1–5, 12 2017. predictors to improve branch prediction accuracy in the [30] C. Zhou, L. Huang, Z. Li, T. Zhang, and Q. Dou, “Design presence of context switches,” in International Symposium Space Exploration of TAGE Branch Predictor with Ultra- on Computer Architecture, vol. 24, pp. 3–11, 1996. Small RAM,” in Great Lakes Symposium on VLSI, pp. 281– [12] T.-Y. Yeh and Y. N. Patt, “Two-level adaptive training branch 286, 2017. prediction,” in International Symposium on Microarchitec- ture, pp. 51–61, 1991. [13] A. Seznec, “TAGE-SC-L branch predictors,” JWAC-4: Cham- pionship Branch Prediction, 2014. [14] D. A. Jimenez,´ “Multiperspective Perceptron Predictor,” JWAC-4: Championship Branch Prediction, 2014. YOMO: You Only Mispredict Once. Transferring Predictor State for Efficient HMPs

Perfomance when Retaining Branch Predictor State Abstract—Modern mobile SoCs use high-performance out-of- 1.75 order (OoO) cores and energy-efficient in-order (InO) cores. 1.50 Recent work shows that migrating tasks between OoO and InO cores can unlock additional performance and energy efficiency. In 1.25 many cases, however, the cost of migration outweighs the benefits. 1.00 Sharing resources such as caches can reduce the migration cost. 0.75 For other components, like the branch predictor, which are 0.50 No Migration tightly integrated to the core and designed to meet different Migration w/o BP 0.25 Migration w/ BP needs, sharing can be complex and impractical. In this work, Cycle per Instructions we present a novel mechanism that migrates branch predictor 3.24 3.25 3.26 3.27 3.28 3.29 Instruction Count (in Billions) state between cores to reduce the task migration cost. This mechanism does not increase prediction latency or violate the Fig. 1: Execution excerpt of two identical OoO systems design constraints of the cores. To minimize communication, we migrating every 1k instructions. One retains all internal com- transfer a small part of the branch predictor state, sufficient ponents when migrating, the other loses the branch predictor to achieve high prediction accuracy. Our results show that our mechanism reduces mispredictions by as much as 62% and state. Depending on the amount of instruction-level paral- recovers lost performance by as much as 35% in systems with lelism, performance can drop by as much as 35%. Results frequent migrations. generated using gem5 running qsort from MiBench [7], [8]. Index Terms—Heterogeneous multiprocessing, OoO, Out-of- Order execution, speculation, branch prediction.

I.INTRODUCTION for practical ways to transfer their state. In this work, we identify that branch predictor state has large impact on perfor- Many devices today, laptops and smartphones for instance, mance after a migration, particularly for OoO cores that rely are limited by an energy budget that prevents them from on branch prediction to expose instruction-level parallelism. constantly delivering maximum performance. As such, pro- This can be seen in Figure 1 where frequent migrations cessor cores are usually designed to address very specific can drop OoO performance by as much as 35%. We focus power/performance points. To improve overall system effi- on transferring branch predictor state as the added latency ciency while being able to deliver high performance when overhead prohibits sharing [1], [9]. Furthermore, the size and needed, heterogeneous multiprocessors can be used, which accuracy requirements are different between heterogeneous traditionally combine high performance Out-of-Order (OoO) cores, and consequently their designs are incompatible. cores with energy efficient In-Order (InO) cores [1]–[3]. To make best use of the two designs points, heterogeneous We propose transferring minimal branch predictor state be- systems migrate applications between the cores. To guarantee tween cores that does not add much overhead, but notably de- high performance, demanding phases run on the OoO core creases mispredictions after a migration, to ultimately improve while low instruction level parallelism (ILP) phases can be overall performance. Our contributions are the following: run on the InO core to conserve energy. Previous studies show that reducing the migration overheads We show that when migrating frequently a large portion • allows for aggressive switching at a finer granularity and of the overheads are caused by branch predictor cold improves energy efficiency [1]–[6]. Earlier proposals have misses. We find that this is especially true for OoO cores, tried to merge front-ends of different cores and distribute where we measure an average performance loss of 40%. blocks of instructions to separate back-ends [2], [4]. In theory, We show that shared predictors are impractical, as OoO • this can lead to better efficiency, as migration costs are and InO cores have different prediction requirements reduced and frequent switching becomes practical. However, and constraints that cannot be satisfied by one universal front-ends fusing is complex and and adds delay to latency design. Moreover, the extra latency negates any benefits. sensitive components, like the Level 1 caches (L1) and the We identify the branch predictor state that eliminates • branch predictor. Furthermore, InO and OoO front-ends cater a large fraction of the migration cold misses when to different operation points. Hence a shared front-end cannot transferred. We use a novel predictor design based on satisfy both OoO performance and InO efficiency constraints. current TAGE and perceptron predictors. Results show a Instead, we approach this differently by finding the compo- 62% reduction in the misprediction rate, which recovers nents that significantly reduce migration overheads and search significant performance during frequent migrations. II.BACKGROUND PC PC H[0:L1] PC H[0:L2] PC H[0:L3] In this Section we present relevant prior art that merges cores, reduces migration overheads and consequently tries to increase overall performance/efficiency in heterogeneous Hash HashHash Hash Hash Hash cores. While the potential benefits of branch predictor sharing are significant, increased complexity can negate any actual improvement. As we focus on maintaining branch predictor T0 T1 T2 T3 Base state through migrations, we provide detail for branch predic- Predictor Tag Tag Tag tor designs used in our work.

Fine grain migration in heterogeneous systems Loop Predictor Prior work investigates ways to mitigate the hardware overheads that occur when migrating between cores [2], [4], [9], [10]. Specifically, recent work has been focusing on fusing Statistical heterogeneous core front-ends to reduce the cost of migrations Corrector between the core back-ends [2]–[6], [9]–[11]. These proposals prediction share caches, translation buffer, branch predictor and even the Fig. 2: A depiction of a TAGE branch predictor. Tables use pipeline in order to eliminate the cost of switching between history length (H) for indexing and tag matching to ensure OoO and InO cores. In principle, low overhead switching and valid predictions. accurate scheduling in fused cores enable applications to use the OoO core during phases that can deliver high performance, and swiftly switch to the InO core for low performance phases. (TBP ) can unlock the energy savings mentioned in the studies This is claimed to save as much as 36% energy for 1% above [2]–[6], [11] performance loss compared to a system running constantly on the high performance OoO core, effectively combining InO Branch predictor overview efficiency with OoO performance [2], [4]. Similar approaches We focus our investigation on the most prominent designs merge multiple smaller InO cores to create larger OoO cores used today, the TAGE and perceptron predictors. which retain their state [12], [13] when switching. TAGE: This design looks up multiple tagged tables using These designs have significant drawbacks in reality, as a hash of the program counter (PC) and part of the global sharing requires notably more complex wiring which increases history (H). Each table is indexed by a different history length latency [9]. Additionally, multiplexing components like the following a geometric series. The lengths are calculated using: caches and the translation lookaside buffer (TLB) also adds i 1 L(i) = int(a − L(1) + 0.5) latency which degrades overall performance. This forces such × designs to lower the frequency dramatically (186 MHz in [9]) where L is the length of the global history, i is the table to operate. Finally, energy savings reported [2], [3], [5], [6] do identifier and a is the geometric factor [15]. If no prediction not take into account the increase in energy consumption when is available the predictor falls back to a base predictor, a simple using the larger front-end with the InO core. Certain studies bimodal predictor with hysteresis. The latest design, TAGE- consider these overheads [1] and explore sharing schemes SC-L, also features a statistical corrector to improve on biased that do not introduce latency in the critical path. The overall cases that TAGE consistently mispredicts and a specialized transfer overhead is calculated to be: loop predictor [16]. The design is shown in Figure 2. tied to core Multiperspective perceptron: The perceptron predictor, which is loosely based on neural network theory, calculates T = T + T + T + T + T + T migration pip BP reg T LB L1 L2 a prediction based on the sum of all weight-history products non-shareable z }| { shareable [17]. This original design required many calculations and where T is the migration| {z warm-up} time for each component.|{z} only took into account the global history. A more accurate The study shows that the overheads during a migration are version is the multiperspective perceptron [18], which directly split into 3 categories. First, the easily shareable components adds weights which are indexed with different features (global like the Level 2 cache (L2). Secondly, those tied to the core, history, branch address, recency etc.). The prediction uses like Level 1 caches (L1) and the TLB, where sharing intro- fewer calculations: duces extra latency which causes slowdown. Finally, the non- N shareable internal components that are inherently different in Prediction = Wi(fi) i=1 heterogeneous cores, like the branch predictor. Recent studies X show that sharing the L2 and the last level of a hierarchical where Wi are the features tables with the weights, and fi is TLB is enough to mitigate the effects caused by memory the value of a specific feature. We focus on this version of accesses. [1], [14]. Reducing the branch predictor overheads perceptron in the rest of this work. OoO Core Pattern History TAGE PC bits Prediction Prediction +

Other ... Components feature 1 feature 2

InO Core ... = = = = ... = = Small Base feature N Predictor State Transfer Predictor

(a) The transfer mechanism between cores. (b) The architecture of a bimodal predictor. (c) The design of the perceptron predictor. Fig. 3: Our design makes the InO core’s predictor identical to the base predictor used in the OoO TAGE. The base predictor can either be a bimodal used in the original TAGE (b), or be swapped for a small perceptron predictor (c).

III.BRANCH PREDICTOR STATE TRANSFER Name Size Details 8kbits counter As shown in Figure 1, to unlock the benefits of high bimodal 1.25kB 2kbits hysteresis frequency migrations in heterogeneous cores, the branch pre- 8 features: dictor needs to maintain high accuracy after each switch. 1.25kB perceptron GHIST (2), GHISTPATH (3), IMLI, However, sharing the branch predictor is not feasible for two 3kB reasons. First, the latency added to a multiplexed predictor will GHISTMODPATH, RECENCYPOS eliminate any benefits gained by keeping the state between Loop Predictor, migrations, as the branch predictor is constantly invoked. TAGE 64kB Statistical Corrector, Second, InO and OoO cores have been designed to operate 1.25kB bimodal base predictor at different performance/power points. OoO cores trade power Loop Predictor 156B, and area for high performance, while InO cores remain small 64kB TAGE-P Statistical Corrector 8kB, and energy efficient. As such their front ends and in particular 66kB their branch predictors, are designed differently. The OoO 1.25kB / 3kB perceptron base predictor core needs a large and accurate branch predictor to support TABLE I: The evaluated branch predictors. deep speculation that unlocks instruction-level parallelism, while the InO core uses much smaller designs since it is less and a TAGE predictor for the OoO core. This allows for the reliant on speculation. As a consequence, it is impossible to state of the InO bimodal to be easily transferable to the OoO share the same design without under-provisioning the OoO TAGE’s identical bimodal fallback so that, when a migration core’s accuracy or notably increasing the area and power occurs, it resumes execution with accuracy. This is shown in consumption of the InO core for little added performance. Figures 3a and 3b. To achieve the benefits of sharing without its drawbackss, Using a multiperspective perceptron as the high accuracy we propose a mechanism that transfers some state between the predictor for the OoO core makes transferring the state over two predictors when a migration is triggered, eliminating the to a smaller design significantly harder. This happens because complexity stemming from component sharing. Furthermore, for every prediction the weights from each feature need to modern branch predictors combine predictions from multiple be added to provide a valid prediction. Removing the extra components. We identify common components between the features that exist only in the large perceptron would lead two branch predictors (source and destination) and transfer a to imbalanced summations that could lead to mispredictions. small part of the state that corresponds to the common com- Removing entries across all features also suffers from the same ponents. Both the TAGE and perceptron predictors perform problem as each feature entry is used for multiple predictions accurately when they are large enough to store enough state, when combined with the other table entries. However one as large as 64kB for high performance cores [16], [18]. Far benefit of this type of predictors is that the state is densely smaller branch predictors, around 1-3 kB for perceptrons, have stored and can achieve higher accuracy than conventional been used successfully in modern InO cores [19]. small sized predictors. We therefore focus our experiments on Transferring state from one core to the other during a migra- large, accurate, predictors for the OoO cores and a smaller tion must be efficient for the benefits to outweigh the costs. bimodal predictor for the InO core. Our branch predictor Making the observation that the TAGE predictor (Figure 2) design for the OoO core uses a 64kB TAGE, while the InO uses a fallback bimodal predictor, we consider a heterogeneous core usese a 1.25kB biomodal predictor that matches the system that uses a bimodal branch predictor for the InO core fallback predictor in the large core’s TAGE predictor. As the 1k 10k 100k 1M 10M 1k 10k 100k 1M 10M 1.0 1.0 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0

FFT jpeg FFT jpeg Relative Performance qsort susan Relative Performance qsort susan adpcm adpcm dijkstra patricia rijndael Average dijkstra patricia rijndael Average bitcounts bitcounts basicmath basicmath stringsearch stringsearch (a) InO performance breakdown when migrating w/o transferring (b) OoO performance breakdown when migrating w/o transferring the BP state (colored bars) and with the BP state (white bars). the BP state (colored bars) and with the BP state (white bars). Fig. 4: Performance degradation relative to the same core that runs uninterrupted in MiBench for different sharing schemes. Migrations are performed between cores that share all the caches and the TLB.

TAGE predictor is modular, the base predictor can be replaced OoO Core Value Pipeline Width 3-wide Shared Resources Value by another predictor with greater accuracy that can be better Pipeline Depth 15 suited to the needs of the small InO core. With this starting Operating Frequency 1 GHz L1D Size 32kB L1I Size 32kB point we also consider a design that swaps the base bimodal In-Order Core Value L2 Size 1MB with a small perceptron predictor as shown in Figure 3c. We L2 Associativity 16-way Pipeline Width 2-wide L2 Prefetcher Stride experiment with two designs for the OoO core that use this Pipeline depth 8 hybrid (TAGE-P), one that replaces the 1.25kB bimodal for Operating Frequency 1 GHz an equally sized perceptron and one where we use a 3kB TABLE II: The specifications of the system used in our perceptron instead, directly matching a small perceptron in the experiments. InO core. Details of the used predictors are shown in Table I.

IV. EVALUATION &RESULTS The data collected from the clone of the simulator is paired to the main run to create samples that are completely We present two studies that, combined, show the impact of aligned. This allows for a direct comparison of the perfor- migrations without the branch predictor state, and how trans- mance, as shown in Figure 1. For each sample taken, we ferring minimal state can reduce the overheads significantly. compare the instructions per cycle (IPC) from both simu- Calculating the branch predictor impact when migrating lations of each phase to determine the relative performance IPCmain/IP Cmigration. We perform the experiment for both We perform an experiment that measures the performance OoO and InO cores comparing two systems, one that shares all degradation when migrating without keeping any branch pre- the components, caches, TLBs and the branch predictor, and dictor state. To achieve this, we use gem5 [7] to simulate an another that does not share the branch predictor. Additionally Arm system. we sample its execution at regular intervals to we repeat the experiment for different migration periods simulate frequent context switches. For each sample we clone ranging from 1k instructions to 10M instructions. Details of the simulator, running the execution twice: the simulated system are shown in Table II. 1) We flush the pipeline and the branch predictor to simu- From Figure 4 it can be seen that while the InO core loses late a system which has just migrated a task to another at most roughly 12% of its performance when not migrating (identical) core and without any branch predictor state. the branch predictor state, the OoO core loses as much as 38% 2) We flush only the pipeline to simulate a system which of its performance when performing migration as frequent as has just migrated a task to another (identical) core while every 1000 instructions. This is because the InO core is 2- transferring the complete branch predictor state. wide and cannot reorder instructions to mask memory latency Upon cloning the simulation state, the original simulation and is therefore heavily affected by cache misses. On the continues and periodically records the performance of the other hand, the wider OoO core is able to speculate and system and creates new clones until it reaches the end of uncover instruction-level parallelism to mask misses. However, execution. A cloned simulation performs a migration that to do so, it needs accurate speculation to break dependency executes for one period and then terminates. The migration is chains to feed the pipeline. Effectively the more aggressive the approximated by clearing all the private and non-transferable core, the larger the need for accurate branch prediction. These state in an otherwise identical system. This is done in the results make evident the fact that while the InO can cope with simulator without the knowledge of the simulated operating migrations every 1k instructions, the performance lost by the system, effectively isolating the hardware cost of a migration OoO core prohibit such systems, if the branch predictor state from the software overhead. is not “warm”. Component Contribution to MPKI of 64kB TAGE 3.5 Migration Period (insts.) 3.0 1k 2.5 10k 2.0 100k 1.5 1M 1.0 10M 0.5 100M 0.0 MPKI Increase (Norm.)

Loop Predictor (L) 156B Branch Predictor Components Retained Fig. 5: A component wise breakdown of the misprediction rate for TAGE across the range of migration periods evaluated. While the TAGE tables (T) unlock almost all of the accuracy, a transfer of 55kB is prohibitive. Transferring the bimodal base predictor offers the best MPKI decrease per kB, especially for small migration periods.

Accuracy of TAGE Transferring State OoO Migration and Transfer Overheads 16 Migration w/o BP 14 1.25 Migration w/ BP 12 No Migration 10 1.00 Transfer 1.25kB 8 Transfer 3kB

MPKI 6 4 0.75 2 0 1k 10k 100k 1M 10M 100M 0.50 Migration Period (instructions) TAGE (baseline) TAGE-B TAGE-P 64kB TAGE-P 66kB 0.25 Fig. 6: Misprediction of TAGE variants that transfer the base Instruction Cycles per predictor compared to a TAGE that does not transfer the state upon migration. Benefits of transferring become increasingly 0 1k 10k 100k 1M 100M noticeable with more frequent migrations Migration Period (instructions) Fig. 7: CPI stacks show the breakdown of the overheads with When transferring the branch predictor state the OoO core and without branch predictor state. When factoring in the practically eliminates the performance degradation as it re- transfer cost, using a fast SRAM cache bus, a 3kB perceptron covers 35% of lost performance. Compared to a system that can be transferred without prohibitive overheads. does not migrate, the overhead of migration is less than 3%. This remaining overhead is due to the empty pipeline the framework to allow flushing of the branch predictor state, and reorder buffer. When comparing the migrations between full or only of selected components. This allows us to dissect identical InO cores that transfer the branch predictor state, we how each component contributes to the prediction for different observe that the improvement is virtually negligible as less migration frequencies. As seen in Figure 5, we dissect how than 1% of performance is reclaimed, resulting in performance each component of TAGE contributes to the misprediction loss compared to an InO core with no migrations of 7%. when it is transferred over during a migration. Results show Comparing InO and OoO, migrations become more efficient that retaining the TAGE tables delivers close to maximum for the OoO core when transferring the branch predictor state accuracy. However to retain the tables across migrations, 55kB even for migrations with 1k instruction period. of data need to be transferred; making fine grain migrations impractical. The bimodal and statistical corrector comparably Transferring branch predictor state reduce MPKI, 32% and 44% for 1k migrations respectively To reduce the effects of the branch predictor, we evaluate for example. However, the statistical corrector requires 7.3kB the misprediction per kilo instruction (MPKI) of TAGE-SC-L of state to be transferred to achieve this, as opposed to [16]. We use the championship branch prediction framework 1.25kB required by the bimodal. Overall, the bimodal is the to assess the improvements in accuracy when migrating at component with the highest MPKI reduction per kB. finer granularities [16], [18], [20]. The framework uses over In Figure 6 we flush the entire 64kB TAGE predictor 250 traces from mobile and server applications, both long across different migration periods and compare it with our and short, to cover a wide variety of cases [20]. We modify designs TAGE-B which retains the bimodal and TAGE-P that retains a base perceptron predictor (1.25kB and 3kB REFERENCES variants). Even when preserving minimal state, the accuracy [1] I. Vougioukas, A. Sandberg, S. Diestelhorst, B. Al-Hashimi, and G. Mer- of the TAGE predictor is dramatically improved, especially rett, “Nucleus: Finding the sharing limit of heterogeneous cores,” ACM for fine grain migrations where the MPKI reduction is in the Transactions on Embedded Computing Systems, vol. 16, no. 5s, 2017. [2] A. Lukefahr, S. Padmanabha, R. Das, F. M. Sleiman, R. Dreslinski, T. F. best case 62% using the TAGE predictor with the 3kB base Wenisch, and S. Mahlke, “Composite cores: Pushing heterogeneity into perceptron (TAGE-P 66kB). Figure 7 calculates the total cost a core,” in Proceedings - 2012 IEEE/ACM 45th International Symposium of a migration for an OoO core, taking into account the transfer on Microarchitecture, MICRO 2012, pp. 317–328, 2012. [3] S. Padmanabha, A. Lukefahr, R. Das, and S. Mahlke, “DynaMOS,” in of state, as a cycle per instruction (CPI) stack. When the Proceedings of the 48th International Symposium on Microarchitecture BP state is retained, the CPI overheads are reduced by 0.43. - MICRO-48, vol. -, pp. 322–333, 2015. Assuming a 256-bit bus for the transfer, the CPI increase for [4] C. Fallin, C. Wilkerson, and O. Mutlu, “The heterogeneous block architecture,” in 2014 32nd IEEE International Conference on Computer the 1kB and 3kB base predictors is 0.04 and 0.12 respectively Design, ICCD 2014, vol. -, pp. 386–393, 2014. for 1k migrations. For coarser migrations the transfer overhead [5] A. Lukefahr, S. Padmanabha, R. Das, F. M. Sleiman, R. G. Dreslinski, is marginal and does not affect performance. T. F. Wenisch, and S. Mahlke, “Exploring fine-grained heterogeneity with composite cores,” IEEE Transactions on Computers, vol. 65, no. 2, pp. 535–547, 2016. V. RELATED WORK [6] A. Lukefahr, S. Padmanabha, R. Das, R. Dreslinski, T. F. Wenisch, and S. Mahlke, “Heterogeneous microarchitectures trump voltage scaling for Recent studies try to improve efficiency in heterogeneous low-power cores,” Proceedings of the 23rd international conference on multiprocessors without sacrificing performance [1]–[6], [10]– Parallel architectures and compilation - PACT ’14, pp. 237–250, 2014. [7] N. Binkert, S. Sardashti, R. Sen, K. Sewell, M. Shoaib, N. Vaish, [12]. Most of the proposals use hardware migrations to shift M. D. Hill, D. A. Wood, B. Beckmann, G. Black, S. K. Reinhardt, execution between high performance and energy efficiency. A. Saidi, A. Basu, J. Hestness, D. R. Hower, and T. Krishna, “The gem5 Such systems can lead to significant energy savings, as much simulator,” ACM SIGARCH Computer Architecture News, vol. 39, p. 1, 8 2011. as 36% [1], [3], [4]. To accomplish fast migrations, designs [8] M. R. Guthaus, J. S. Ringenberg, D. Ernst, T. M. Austin, T. Mudge, and propose sharing front-ends between cores, to eliminate the R. B. Brown, “MiBench: A free, commercially representative embedded need for large transfers. However, other research shows that, benchmark suite,” in 2001 IEEE International Workshop on Workload Characterization, WWC 2001, pp. 3–14, IEEE, 2001. when accounting multiplexing and wiring overheads for shared [9] E. Rotenberg, B. H. Dwiel, E. Forbes, Z. Zhang, R. Widialaksono, components the performance actually drops [1], [9]. Sharing R. Basu Roy Chowdhury, N. Tshibangu, S. Lipa, W. R. Davis, and the L2 caches as part of the translations [14] has been found P. D. Franzon, “Rationale for a 3D heterogeneous multi-core processor,” 2013 IEEE 31st International Conference on Computer Design, ICCD, to be more beneficial than full sharing, practically eliminating no. 978, pp. 154–168, 2013. migration overheads in InO cores [1]. In our work, we find that [10] S. Navada, N. K. Choudhary, S. V. Wadhavkar, and E. Rotenberg, “A the branch predictor also factors in the migration overheads, unified view of non-monotonic core selection and application steering in heterogeneous chip multiprocessors,” in Parallel Architectures and especially for frequent migrations to OoO cores. To the best of Compilation Techniques - Conference Proceedings, PACT, pp. 133–144, our knowledge, transferring some part of the branch predictor IEEE, 10 2013. state to reduce these effects has not been researched before. [11] A. Lukefahr, S. Padmanabha, R. Das, F. M. Sleiman, R. G. Dreslinski, T. F. Wenisch, and S. Mahlke, “Exploring fine-grained heterogeneity with composite cores,” IEEE Transactions on Computers, vol. 65, no. 2, VI.CONCLUSIONS pp. 535–547, 2016. [12] Khubaib, M. A. Suleman, M. Hashemi, C. Wilkerson, and Y. N. Patt, Heterogeneous multiprocessors traditionally feature OoO “MorphCore: An energy-efficient microarchitecture for high perfor- and InO cores, enabling operation at high performance and mance ILP and high throughput TLP,” Proceedings - 2012 IEEE/ACM 45th International Symposium on Microarchitecture, MICRO 2012, high energy efficiency respectively. By leveraging migrations, pp. 305–316, 2012. such systems can improve efficiency, executing on the InO [13] B. Gopireddy, C. Song, J. Torrellas, N. S. Kim, A. Agrawal, and core for memory bound phases with low ILP and switching to A. Mishra, “ScalCore: Designing a core for voltage scalability,” in 2016 IEEE International Symposium on High Performance Computer the OoO core when it can deliver high performance. Studies Architecture (HPCA), pp. 681–693, 2016. show that high frequency core switching could exploit smaller [14] A. Bhattacharjee and M. Martonosi, “Characterizing the TLB behavior execution phases and save as much as 36% energy without of emerging parallel workloads on chip multiprocessors,” in Parallel Architectures and Compilation Techniques - Conference Proceedings, sacrificing performance. However, these savings cannot be PACT, pp. 29–40, 2009. achieved currently due the overhead of migrations. Sharing [15] A. Seznec and P. Michaud, “A case for (partially) TAgged GEometric the L2 cache and the L2 TLB eliminates part of the overhead, history length branch prediction,” Jilp, vol. 8, pp. 1 – 23, 2006. [16] A. Seznec, “TAGE-SC-L branch predictors,” JWAC-4: Championship but after switching, the cold branch predictor causes a notable Branch Prediction, no. 267175, 2014. performance drop. In this work, we focus on the branch [17] D. Jimenez and C. Lin, “Dynamic branch prediction with percep- predictor, which cannot be shared as this adds latency to trons,” in Proceedings HPCA Seventh International Symposium on High- Performance Computer Architecture, pp. 197–206. the path of execution and branch predictor designs for InO [18] D. A. Jimenez,´ “Multiperspective Perceptron Predictor,” JWAC-4: Cham- and OoO cores are designed differently. Instead, we propose pionship Branch Prediction, 2014. transferring a subset of the OoO core’s predictor to the InO [19] Arm Ltd., “ARM Cortex-A55 Core - Technical Reference Manual.” 2017. core and vice versa to reduce mispredictions. Our results show [20] “5th Championship Branch Prediction,” in 5th JILP Workshop on that, even for migrations as frequent as 1k instructions, this Computer Architecture Competitions (JWAC-5) - ISCA ’16. can reduce mispredictions by 62% on average and help recover as much as 35% of the performance lost otherwise. Bibliography

5th JILP Workshop on Computer Architecture Competitions (JWAC-5) (2016). 5th Championship Branch Prediction.

Akram, S., Sartor, J. B., Craeynest, K. V., Heirman, W., and Eeckhout, L. (2016). Boosting the Priority of Garbage. ACM Transactions on Architecture and Code Optimization, 13(1):1–25.

Allan, T., Brumley, B. B., Falkner, K., van de Pol, J., and Yarom, Y. (2016). Amplifying side channels through performance degradation. In Proceedings of the 32nd Annual Conference on Computer Security Applications - ACSAC ’16, pages 422–435.

Arm Ltd. (2013). big. LITTLE Technology : The Future of Mobile. Arm white paper.

Arm Ltd., editor (2017a). Arm Architecture Reference Manual: ARMv8. Arm Ltd.

Arm Ltd. (2017b). Arm Cortex-A55 Core - Technical Reference Manual.

Barbalace, A., Murray, A., Lyerly, R., and Ravindran, B. (2014). Towards Operating System Support for Heterogeneous-ISA Platforms. In Workshop on Systems for Future Multicore Architectures (SFMA).

Baum, F., Raghuraman, A., and Baum, F. (2016). Making Full use of Emerging ARM-based Heterogeneous Multicore SoCs Making Full use of Emerging ARM-based Heterogeneous Multicore SoCs. ERTS 2016.

Becchi, M. and Crowley, P. (2008). Dynamic Thread Assignment on Heterogeneous Multiproces- sor Architectures. Journal of Instruction-Level Parallelism, 10:1–26.

Bhattacharjee, A., Lustig, D., and Martonosi, M. (2011). Shared last-level TLBs for chip multiprocessors. In Proceedings - International Symposium on High-Performance Computer Architecture, pages 62–73. IEEE.

Bhattacharjee, A. and Martonosi, M. (2009a). Characterizing the TLB behavior of emerging par- allel workloads on chip multiprocessors. In Parallel Architectures and Compilation Techniques - Conference Proceedings, PACT, pages 29–40.

Bhattacharjee, A. and Martonosi, M. (2009b). Thread criticality predictors for dynamic perfor- mance, power, and resource management in chip multiprocessors. Proceedings - International Symposium on Computer Architecture (ISCA), 37(3):290. 158 BIBLIOGRAPHY 159

Bhattacharjee, A. and Martonosi, M. (2010). Inter-core cooperative TLB for chip multiprocessors. ACM SIGPLAN Notices, 45(3):359.

Binkert, N., Sardashti, S., Sen, R., Sewell, K., Shoaib, M., Vaish, N., Hill, M. D., Wood, D. A., Beckmann, B., Black, G., Reinhardt, S. K., Saidi, A., Basu, A., Hestness, J., Hower, D. R., and Krishna, T. (2011). The gem5 simulator. ACM SIGARCH Computer Architecture News, 39(2):1.

Bischoff, A. S. (2016). User-experience-aware system optimisation for mobile systems. PhD thesis, University of Southampton.

Buck, I. and Hanrahan, P. (2003). Data Parallel Computation on Graphics Hardware. Elements.

Butko, A., Bruguier, F., Gamatie,´ A., Sassatelli, G., Novo, D., Torres, L., and Robert, M. (2016). Full-System Simulation of big.LITTLE Multicore Architecture for Performance and Energy Exploration. In Proceedings - IEEE 10th International Symposium on Embedded Multicore/Many-Core Systems-on-Chip, MCSoC 2016, pages 201–208.

Butko, A., Gamatie,´ A., Sassatelli, G., Torres, L., and Robert, M. (2015). Design exploration for next generation high-performance manycore on-chip systems: Application to big.LITTLE architectures. In Proceedings of IEEE Computer Society Annual Symposium on VLSI, ISVLSI, volume 07-10-July, pages 551–556.

Ceze, L., Tuck, J., Ca??caval, C., and Torrellas, J. (2006). Bulk disambiguation of speculative threads in multiprocessors. Proceedings - International Symposium on Computer Architecture, 2006:227–238.

Chang, P. Y., Hao, E., Yeh, T. Y., and Patt, Y. (1996). Branch classification: A new mechanism for improving branch predictor performance. International Journal of Parallel Programming, 24(2):133–158.

Chaudhry, S., Caprioli, P., Yip, S., and Tremblay, M. (2005). High-performance throughput computing. IEEE Micro, 25(3):32–45.

Chaudhry, S., Cypher, R., Ekman, M., Karlsson, M., Landin, A., Yip, S., Zeffer, H., and Tremblay, M. (2009a). Rock: A high-performance CMT processor. IEEE Micro, 29(2):6–16.

Chaudhry, S., Cypher, R., Ekman, M., Karlsson, M., Landin, A., Yip, S., Zeffer, H., and Tremblay, M. (2009b). Simultaneous speculative threading. Proceedings of the 36th annual international symposium on Computer architecture - ISCA ’09.

Chen, X., Xu, Z., Kim, H., Gratz, P., Hu, J., Kishinevsky, M., and Ogras, U. (2012). In-network monitoring and control policy for DVFS of CMP networks-on-chip and last level caches. Proceedings of the 2012 6th IEEE/ACM International Symposium on Networks-on-Chip, NoCS 2012, 18(4):43–50.

Chien, A. A. and Karamcheti, V. (2013). Moore’s law: The first ending and a new beginning. BIBLIOGRAPHY 160

Chih-Chieh Lee, Chen, I.-C., and Mudge, T. (2002). The bi-mode branch predictor. In Pro- ceedings of 30th Annual International Symposium on Microarchitecture, pages 4–13. IEEE Comput. Soc.

Chronaki, K., Rico, A., Badia, R. M., and Ayguade,´ E. (2015). Criticality-aware dynamic task scheduling for heterogeneous architectures. Proceedings of the 29th.

Danowitz, A., Kelley, K., Mao, J., and Stevenson, J. P. (2012). CPU DB: Recording microproces- sor history. communicationS of the acm, 55(55).

Dennard, R. H., Cai, J., and Kumar, A. (2007). A perspective on today’s scaling challenges and possible future directions. Solid-State Electronics, 51(4 SPEC. ISS.):518–525.

Devuyst, M., Venkat, A., and Tullsen, D. M. (2012). Execution migration in a heterogeneous-ISA chip multiprocessor. In ACM SIGARCH Computer Architecture News, volume 40, page 261, New York, New York, USA. ACM Press.

Esmaeilzadeh, H., Blem, E., St. Amant, R., Sankaralingam, K., and Burger, D. (2012). Dark silicon and the end of multicore scaling. IEEE Micro, 32(3):122–134.

Evers, M., Chang, P.-Y., and Patt, Y. N. (1996). Using hybrid branch predictors to improve branch prediction accuracy in the presence of context switches. In Proceedings of the 23rd annual international symposium on Computer architecture - ISCA ’96, volume 24, New York, New York, USA. ACM Press.

Evtyushkin, D., Riley, R., Abu-Ghazaleh, N. C., ECE, and Ponomarev, D. (2018). BranchScope. In Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems - ASPLOS ’18, pages 693–707.

Eyerman, S. and Eeckhout, L. (2011). Fine-grained DVFS using on-chip regulators. ACM Transactions on Architecture and Code Optimization, 8(1):1–24.

Fallin, C., Wilkerson, C., and Mutlu, O. (2014). The heterogeneous block architecture. In 2014 32nd IEEE International Conference on Computer Design, ICCD 2014.

Fedorova, A., Saez, J. C., Shelepov, D., and Prieto, M. (2009). Maximizing power efficiency with asymmetric multicore systems. Communications of the ACM, 52(12):48.

Forbes, E., Zhang, Z., Widialaksono, R., Dwiel, B., Chowdhury, R. B. R., Srinivasan, V., Lipa, S., Rotenberg, E., Davis, W. R., and Franzon, P. D. (2016). Under 100-cycle thread migration latency in a single-ISA heterogeneous multi-core processor. In 2015 IEEE Hot Chips 27 Symposium, HCS 2015, pages 1–1. IEEE.

Goldstine, H. and Goldstine, A. (1996). The Electronic Numerical Integrator and Computer (ENIAC). IEEE Annals of the History of Computing, 18(1):10–16.

Google Developers (2013). Octane - The Benchmark Suite. BIBLIOGRAPHY 161

Gopireddy, B., Song, C., Torrellas, J., Kim, N. S., Agrawal, A., and Mishra, A. (2016). ScalCore: Designing a core for voltage scalability. In 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 681–693.

Graham, R. L., Knuth, D. E., and Patashnik, O. (1989). Concrete Mathematics: A Foundation for Computer Science, volume 2. Addison-Wesley.

Gupta, V., Brett, P., Koufaty, D., Reddy, D., and Hahn, S. (2012). The Forgotten ’Uncore’: On the Energy-Efficiency of Heterogeneous Cores. 2012 USENIX Federated Conferences Week, page 6.

Guthaus, M. R., Ringenberg, J. S., Ernst, D., Austin, T. M., Mudge, T., and Brown, R. B. (2001). MiBench: A free, commercially representative embedded benchmark suite. In 2001 IEEE International Workshop on Workload Characterization, WWC 2001, pages 3–14. IEEE.

Gutierrez, A., Dreslinski, R. G., and Mudge, T. (2014). Evaluating private vs. shared last-level caches for energy efficiency in asymmetric multi-cores. In Proceedings - International Confer- ence on Embedded Computer Systems: Architectures, Modeling and Simulation, SAMOS.

Hennessy, J. L. and Patterson, D. A. (2006). Computer Architecture, Fourth Edition: A Quantita- tive Approach. Elsevier.

Hill, M. D. and Marty, M. R. (2008). Amdahl ’s Law in the Multicore Era. Computer, 41(July):33– 38.

Ipek, E., Kirman, M., Kirman, N., and Martinez, J. F. (2007). Core fusion: accommodating software diversity in chip multiprocessors. SIGARCH Comput. Archit. News, 35(2):186–197.

Jacob, B., Ng, S. W., and Wang, D. T. (2008). Memory systems : cache, DRAM, disk. Morgan Kaufmann Publishers.

Jimenez, D., Keckler, S., and Lin, C. (2000). The impact of delay on the design of branch predictors. International Symposium on Microarchitecture, pages 67–76.

Jimenez, D. and Lin, C. (2001). Dynamic branch prediction with perceptrons. In Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture, pages 197–206.

Jimenez, D. A. (2003). Reconsidering complex branch predictors. In Proceedings - International Symposium on High-Performance Computer Architecture, volume 12, pages 43–52. IEEE Comput. Soc.

Jimenez,´ D. A. (2014a). Multiperspective Perceptron Predictor. JWAC-4: Championship Branch Prediction.

Jimenez,´ D. A. (2014b). Multiperspective Perceptron Predictor with TAGE. JWAC-4: Champi- onship Branch Prediction. BIBLIOGRAPHY 162

Jimenez,´ D. a. and Lin, C. (2002). Neural methods for dynamic branch prediction. ACM Transactions on Computer Systems, 20(4):369–397.

Kamil, S., Shalf, J., and Strohmaier, E. (2008). Power efficiency in high performance computing. Ipdps, pages 1–8.

Kampe, M., Stenstrom,¨ P., and Dubois, M. (2002). The FAB predictor: Using Fourier analysis to predict the outcome of conditional branches. In Proceedings - International Symposium on High-Performance Computer Architecture, volume 2002-Janua, pages 223–232. IEEE Computer. Soc.

Karpuzcu, U. R., Greskamp, B., and Torrellas, J. (2009). The BubbleWrap many-core: Popping cores for sequential acceleration. 2009 42nd Annual IEEEACM International Symposium on Microarchitecture MICRO, -(December):447–458.

Khan, O. and Kundu, S. (2010). A Self-Adaptive Scheduler for Asymmetric Multi-cores. Proceedings of the 20th symposium on Great lakes symposium on VLSI - GLSVLSI ’10, page 397.

Khasawneh, K. N., Koruyeh, E. M., Song, C., Evtyushkin, D., Ponomarev, D., and Abu-Ghazaleh, N. (2018). SafeSpec: Banishing the Spectre of a Meltdown with Leakage-Free Speculation. ArXiv.

Khubaib (2014). Performance and Energy Efficiency via an Adaptive MorphCore Architecture. PhD thesis, University of Texas Austin.

Khubaib, Suleman, M. A., Hashemi, M., Wilkerson, C., and Patt, Y. N. (2012). MorphCore: An energy-efficient microarchitecture for high performance ILP and high throughput TLP. Proceedings - 2012 IEEE/ACM 45th International Symposium on Microarchitecture, MICRO 2012, pages 305–316.

Kim, N. S., Austin, T., Blaauw, D., Mudge, T., Flautner, K., Hu, J. S., Jane Irwin, M., Kandemir, M., and Narayanan, V. (2003). Leakage Current: Moore’s Law Meets Static Power.

Kim, Y., Daly, R., Kim, J., Fallin, C., Lee, J. H., Lee, D., Wilkerson, C., Lai, K., and Mutlu, O. (2014). Flipping bits in memory without accessing them: An experimental study of DRAM disturbance errors. In Proceedings - International Symposium on Computer Architecture, pages 361–372.

Kocher, P., Genkin, D., Gruss, D., Haas, W., Hamburg, M., Lipp, M., Mangard, S., Prescher, T., Schwarz, M., and Yarom, Y. (2018). Spectre Attacks: Exploiting Speculative Execution. ArXiv.

Kondo, M., Sasaki, H., and Nakamura, H. (2007). Improving fairness, throughput and energy- efficiency on a chip multiprocessor through DVFS. ACM SIGARCH Computer Architecture News, 35(1):31. BIBLIOGRAPHY 163

Koufaty, D., Reddy, D., and Hahn, S. (2010). Bias Scheduling in Heterogeneous Multicore Architectures. Proceedings of the 5th European conference on Computer systems, pages 125–138.

Kumar, R., Farkas, K. I., Jouppi, N. P., Ranganathan, P., and Tullsen, D. M. (2003). Single-ISA heterogeneous multi-core architectures: The potential for processor power reduction. In Proceedings of the Annual International Symposium on Microarchitecture, MICRO, volume 2003-Janua, pages 81–92.

Kumar, R., Tullsen, D. M., and Jouppi, N. P. (2006). Core architecture optimization for hetero- geneous chip multiprocessors. Proceedings of the 15th international conference on Parallel architectures and compilation techniques - PACT ’06, page 23.

Kumar, R., Tullsen, D. M., Ranganathan, P., Jouppi, N. P., and Farkas, K. I. (2004). Single-ISA Heterogeneous Multi-Core Architectures for Multithreaded Workload Performance. Proceed- ings of the 31st annual international symposium on Computer architecture, page 64.

Lee, J. K., Packard, H., and Smith, A. J. (1984). Branch Prediction Strategies and Branch Target Buffer Design. Computer, 17(1):6–22.

Lipp, M., Schwarz, M., Gruss, D., Prescher, T., Haas, W., Mangard, S., Kocher, P., Genkin, D., Yarom, Y., and Hamburg, M. (2018). Meltdown. ArXiv e-prints.

Liu, F., Yarom, Y., Ge, Q., Heiser, G., and Lee, R. B. (2015). Last-level cache side-channel attacks are practical. In Proceedings - IEEE Symposium on Security and Privacy, volume 2015-July, pages 605–622.

Loh, G. H. (2006). Revisiting the performance impact of branch predictor latencies. In Perfor- mance Analysis of Systems and Software, 2006 IEEE International Symposium on IEEE, pages 59–69. IEEE.

Loh, G. H., Henry, D. S., Krishnamurthy, A., and Edu, Y. (2003). Exploiting Bias in the Hysteresis Bit of 2-bit Saturating Counters in Branch Predictors. Journal of Instruction-Level Parallelism, 5:1–32.

Lukefahr, A., Padmanabha, S., Das, R., Dreslinski, R., Wenisch, T. F., and Mahlke, S. (2014). Heterogeneous microarchitectures trump voltage scaling for low-power cores. Proceedings of the 23rd international conference on Parallel architectures and compilation - PACT ’14, pages 237–250.

Lukefahr, A., Padmanabha, S., Das, R., Sleiman, F. M., Dreslinski, R., Wenisch, T. F., and Mahlke, S. (2012). Composite cores: Pushing heterogeneity into a core. In Proceedings - 2012 IEEE/ACM 45th International Symposium on Microarchitecture, MICRO 2012, pages 317–328.

Lukefahr, A., Padmanabha, S., Das, R., Sleiman, F. M., Dreslinski, R. G., Wenisch, T. F., and Mahlke, S. (2016). Exploring fine-grained heterogeneity with composite cores. IEEE Transactions on Computers, 65(2):535–547. BIBLIOGRAPHY 164

Lustig, D., Bhattacharjee, A., and Martonosi, M. (2013). TLB Improvements for Chip Multipro- cessors. ACM Transactions on Architecture and Code Optimization, 10(1):1–38.

Mathur, V. K. (1991). How Well Do We Know Pareto Optimality? The Journal of Economic Education, 22(2):172.

Mcfarling, S. (1993). Combining Branch Predictors. Western Research Laboratory, (June):1–29.

Miftakhutdinov, R., Ebrahimi, E., and Patt, Y. N. (2012). Predicting performance impact of DVFS for realistic memory systems. Proceedings - 2012 IEEE/ACM 45th International Symposium on Microarchitecture, MICRO 2012, pages 155–165.

Mittal, S. (2016). A Survey Of Techniques for Architecting and Managing Asymmetric Multicore Processors. ACM Computing Surveys, 48(3, February 2016).

Mittal, S. (2019). A survey of techniques for dynamic branch prediction. Concurrency Computa- tion, 31(1).

Moore, G. E. (1998). Cramming more components onto integrated circuits. Proceedings of the IEEE, 86(1):82–85.

Mucci, P. J., London, K., and Thurman, J. (1998). The CacheBench Report. CEWES MSRC/PET, 19(TR/98-25).

Navada, S., Choudhary, N. K., Wadhavkar, S. V., and Rotenberg, E. (2013). A unified view of non-monotonic core selection and application steering in heterogeneous chip multiprocessors. In Parallel Architectures and Compilation Techniques - Conference Proceedings, PACT, pages 133–144. IEEE.

Nikov, K., Nunez-Yanez, J. L., and Horsnell, M. (2015). Evaluation of Hybrid Run-Time Power Models for the ARM Big.LITTLE Architecture. 2015 IEEE 13th International Conference on Embedded and Ubiquitous Computing, pages 205–210.

Padmanabha, S. and Lukefahr, A. (2013). Trace based phase prediction for tightly-coupled heterogeneous cores. Proceedings of the International Symposium on Microarchitecture, pages 445–456.

Padmanabha, S., Lukefahr, A., Das, R., and Mahlke, S. (2015). DynaMOS. In Proceedings of the 48th International Symposium on Microarchitecture - MICRO-48, pages 322–333.

Pannuto, P., Lee, Y., Kuo, Y. S., Foo, Z. Y., Kempke, B., Kim, G., Dreslinski, R. G., Blaauw, D., and Dutta, P. (2016). MBus: A System Integration Bus for the Modular Microscale Computing Class. IEEE Micro, 36(3):60–70.

Patsilaras, G., Choudhary, N. K., and Tuck, J. (2012). Efficiently Exploiting Memory Level Parallelism on Asymmetric Coupled Cores in the Dark Silicon Era. ACM Trans. Architec. Code Optim, 8(21). BIBLIOGRAPHY 165

Pell, O. and Mencer, O. (2011). Surviving the end of frequency scaling with reconfigurable dataflow computing. ACM SIGARCH Computer Architecture News, 39(4):60.

Pham, B., Bhattacharjee, A., Eckert, Y., and Loh, G. H. (2014). Increasing TLB reach by exploiting clustering in page translations. In Proceedings - International Symposium on High-Performance Computer Architecture, pages 558–567. IEEE.

Pham, B., Vaidyanathan, V., Jaleel, A., and Bhattacharjee, A. (2012). CoLT: Coalesced large-reach TLBs. In Proceedings - 2012 IEEE/ACM 45th International Symposium on Microarchitecture, MICRO 2012, pages 258–269. IEEE.

Pricopi, M., Muthukaruppan, T. S., Venkataramani, V., Mitra, T., and Vishin, S. (2013). Power- performance modeling on asymmetric multi-cores. 2013 International Conference on Compil- ers, Architecture and Synthesis for Embedded Systems, CASES 2013.

Prodromou, A., Venkat, A., and Tullsen, D. M. (2019). Deciphering Predictive Schedulers for Heterogeneous-ISA Multicore Architectures. In Proceedings of the 10th International Work- shop on Programming Models and Applications for Multicores and Manycores - PMAM’19, pages 51–60, New York, New York, USA. ACM Press.

Prout, A., Arcand, W., Bestor, D., Bergeron, B., Byun, C., Gadepally, V., Houle, M., Hubbell, M., Jones, M., Klein, A., Michaleas, P., Milechin, L., Mullen, J., Rosa, A., Samsi, S., Yee, C., Reuther, A., and Kepner, J. (2018). Measuring the Impact of Spectre and Meltdown. ArXiv, pages 1–5.

Randal E. Bryant, D. R. O. (2015). Computer Systems: A Programmer’s Perspective, volume 130. Prentice Hall.

Rangan, K. K., Wei, G.-Y., and Brooks, D. (2009). Thread motion: Fine-Grained Power Management for Multi-Core Systems. Proceedings of the 36th annual international symposium on Computer architecture - ISCA ’09, page 302.

Reddy, B. K., Walker, M. J., Balsamo, D., Diestelhorst, S., Al-Hashimi, B. M., and Merrett, G. V. (2017). Empirical CPU power modelling and estimation in the gem5 simulator. In 2017 27th International Symposium on Power and Timing Modeling, Optimization and Simulation, PATMOS 2017, volume 2017-Janua, pages 1–8. IEEE.

Rotenberg, E., Dwiel, B. H., Forbes, E., Zhang, Z., Widialaksono, R., Basu Roy Chowdhury, R., Tshibangu, N., Lipa, S., Davis, W. R., and Franzon, P. D. (2013). Rationale for a 3D heterogeneous multi-core processor. In 2013 IEEE 31st International Conference on Computer Design, ICCD 2013, pages 154–168. IEEE.

Rubio, J. and Vijaykrishnan, N. (2007). OS-Aware Branch Prediction: Improving Microprocessor Control Flow Prediction for Operating Systems. IEEE Transactions on Computers, 56(1):2–17.

Sadeghi, A., Raahemifar, K., Fathy, M., and Asad, A. (2015). Lighting the Dark-Silicon 3D Chip Multi-processors by Exploiting Heterogeneity in . Proceedings - IEEE BIBLIOGRAPHY 166

9th International Symposium on Embedded Multicore/Manycore SoCs, MCSoC 2015, pages 182–186.

Sanchez, D. and Kozyrakis, C. (2013). ZSim: Fast and Accurate Microarchitectural Simulation of Thousand-Core Systems. Proceedings of the 40th Annual International Symposium on Computer Architecture - ISCA ’13, 41:475.

Sandberg, A., Nikoleris, N., Carlson, T. E., Hagersten, E., Kaxiras, S., and Black-Schaffer, D. (2015). Full Speed Ahead: Detailed Architectural Simulation at Near-Native Speed. In 2015 IEEE International Symposium on Workload Characterization, pages 183–192.

Sankaralingam, K., Nagarajan, R., Liu, H., Kim, C., Huh, J., Burger, D., Keckler, S. W., and Moore, C. (2003). Exploiting ILP, TLP, and DLP with the polymorphous . IEEE Micro, 23(6).

Sankaralingam, K., Nagarajan, R., McDonald, R., Desikan, R., Drolia, S., Govindan, M. S., Gratz, P., Gulati, D., Hanson, H., Kim, C., Liu, H., Ranganathan, N., Sethumadhavan, S., Sharif, S., Shivakumar, P., Keckler, S. W., and Burger, D. (2006). Distributed microarchitectural protocols in the TRIPS prototype processor. Proceedings of the Annual International Symposium on Microarchitecture, MICRO, -(January 2016):480–491.

Sarma, S. and Dutt, N. (2015). Cross-Layer Exploration of Heterogeneous Multicore Processor Configurations. 2015 28th International Conference on VLSI Design, -(JANUARY):147–152.

Sarma, S., Muck, T., Bathen, L. A. D., Dutt, N., and Nicolau, A. (2015). SmartBalance. Proceedings of the 52nd Annual Design Automation Conference on - DAC ’15, -(October):1–6.

Sasaki, H., Ikeda, Y., Kondo, M., and Nakamura, H. (2007). An intra-task dvfs technique based on statistical analysis of hardware events. In Proceedings of the 4th Internationa Conference on Computing Frontiers, page 123.

Schumpeter, J. A. (1949). Vilfredo Pareto (1848-1923). The Quarterly Journal of Economics, 63(2):147–173.

Seznec, A. (2005). Analysis of the o-geometric history length branch predictor. In Proceedings - International Symposium on Computer Architecture, pages 394–405.

Seznec, A. (2011). A 64 kbytes ISL-TAGE branch predictor. JWAC-2: Championship Branch Prediction, pages 2–5.

Seznec, A. (2014). TAGE-SC-L branch predictors. JWAC-4: Championship Branch Prediction.

Seznec, A. and Michaud, P. (2006). A case for (partially) TAgged GEometric history length branch prediction. Journal of Instruction-Level Parallelism JILP, 8:1–23.

Shelepov, D., Saez Alcaide, J. C., Jeffery, S., Fedorova, A., Perez, N., Huang, Z. F., Blagodurov, S., and Kumar, V. (2009). Hass: A Scheduler for Heterogenous Multicore Systems. ACM SIGOPS Operating Systems Review, 43(2):66. BIBLIOGRAPHY 167

Smith, J. E. (2003). A study of branch prediction strategies. In 25 years of the international symposia on Computer architecture (selected papers) - ISCA ’98, pages 202–215, New York, New York, USA. ACM Press.

Spiliopoulos, V., Kaxiras, S., and Keramidas, G. (2011). Green governors: A framework for continuously adaptive DVFS. 2011 International Green Computing Conference and Workshops, IGCC 2011.

Srinivasan, S., Kurella, N., Koren, I., and Kundu, S. (2016). Exploring heterogeneity within a core for improved power efficiency. IEEE Transactions on Parallel and Distributed Systems, 27(4):1057–1069.

Srinivasan, S., Zhao, L., Illikkal, R., and Iyer, R. (2011). Efficient interaction between OS and architecture in heterogeneous platforms. ACM SIGOPS Operating Systems Review, 45(1):62.

Stallings, W. (2011). Operating Systems: Internals and Design Principles. Prentice Hall.

Tanenbaum, A. S. and Bos, H. (2014). Modern Operating Systems, volume 2. Pearson.

Tarjan, D. and Skadron, K. (2005). Merging path and gshare indexing in perceptron branch prediction. ACM Transactions on Architecture and Code Optimization, 2(3):280–300.

Tarsa, S. J., Kumar, A. P., and Kung, H. T. (2014). Workload prediction for adaptive power scaling using deep learning. 2014 IEEE International Conference on IC Design & Technology, pages 1–5.

Tomusk, E., Dubach, C., and O’Boyle, M. (2014). Measuring flexibility in single-ISA heteroge- neous processors. Proceedings of the 23rd international conference on Parallel architectures and compilation - PACT ’14, pages 495–496.

Tomusk, E., Dubach, C., and O’Boyle, M. (2015). Diversity: A Design Goal for Heterogeneous Processors. IEEE Computer Architecture Letters, pages 1–1.

Tomusk, E. and O’Boyle, M. (2013). Weak heterogeneity as a way of adapting multicores to real workloads. In Proceedings of the 3rd International Workshop on Adaptive Self-Tuning Computing Systems - ADAPT ’13, pages 1–3, New York, New York, USA. ACM Press.

Toumey, C. (2016). Less is Moore. Nature Nanotechnology, 11(1):2–3.

Venkat, A. (2019). Composite-ISA Cores : Enabling Multi-ISA Heterogeneity Using a Single ISA. In High Performance Computer Architecture (HPCA). HPCA.

Venkat, A. and Tullsen, D. M. (2014). Harnessing ISA diversity: Design of a heterogeneous-ISA chip multiprocessor. In Proceedings - International Symposium on Computer Architecture, pages 121–132.

Venkat, A. and Tullsen, D. M. (2016). HIPStR – Heterogeneous-ISA Program State . In ASPLOS ’16:Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems, pages 727–741. BIBLIOGRAPHY 168

Vet, J. Y., Carribault, P., and Cohen, A. (2013). Multigrain Affinity for Heterogeneous Work Stealing. Conference on High-Performance and Embedded Architectures and (HiPEAC ’12).

Vougioukas, I., Sandberg, A., Diestelhorst, S., Al-Hashimi, B., and Merrett, G. (2017). Nucleus: Finding the sharing limit of heterogeneous cores. ACM Transactions on Embedded Computing Systems, 16(5s).

Vougioukas, I., Sandberg, A., Diestelhorst, S., Al-Hashimi, B., and Merrett, G. (2018). Will it blend? Merging heterogeneous cores. In Adaptive Many-Core Architectures and Systems Workshop.

Vougioukas, I., Sandberg, A., Nikoleris, N., Diestelhorst, S., Al-Hashimi, B., and Merrett, G. (2019). BRB: Mitigating Branch Predictor Side-Channels. In 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA).

Walker, M., Bischoff, S., Diestelhorst, S., Merrett, G., and Al-Hashimi, B. (2018a). Hardware- Validated CPU Performance and Energy Modelling. In Proceedings - 2018 IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS 2018, pages 44–53.

Walker, M., Diestelhorst, S., Merrett, G., and Al-Hashimi, B. (2018b). Accurate and stable empirical CPU power modelling for multi- and many-core systems. In Adaptive Many-Core Architectures and Systems Workshop.

Weste, N. H. and Eshraghian, K. (2010). Principles of CMOS VLSI Design. Principles of CMOS VLSI Design, 2:231–237.

Yan, M., Choi, J., Skarlatos, D., Morrison, A., Fletcher, C., and Torrellas, J. (2018). Invisispec: Making speculative execution invisible in the cache hierarchy. In Proceedings of the Annual International Symposium on Microarchitecture, MICRO, volume 2018-Octob, pages 428–441. IEEE.

Yarom, Y. and Falkner, K. (2014). Flush + Reload : a High Resolution , Low Noise , L3 Cache Side-Channel Attack. USENIX Security 2014, pages 1–14.

Yeh, T.-Y. and Patt, Y. N. (2003). Two-level adaptive training branch prediction. In Proceedings of the 24th annual international symposium on Microarchitecture - MICRO 24, pages 51–61, New York, New York, USA. ACM Press.

Yeh, T.-Y. and Patt, Y. N. (2004a). A comparison of dynamic branch predictors that use two levels of branch history. ACM SIGARCH Computer Architecture News, 21(2):257–266.

Yeh, T.-Y. and Patt, Y. N. (2004b). Alternative implementations of two-level adaptive branch prediction. ACM SIGARCH Computer Architecture News, 20(2):124–134.

Yoo, S., Shim, Y., Lee, S., Lee, S.-a., and Kim, J. (2015). A case for bad big.LITTLE switching. HotPower, pages 1–5. BIBLIOGRAPHY 169

Young, C. and Smith, M. D. (2004). Improving the accuracy of static branch prediction using branch correlation. ACM SIGOPS Operating Systems Review, 28(5):232–241.

Yu, K., Han, D., Youn, C., Hwang, S., and Lee, J. (2013). Power-aware task scheduling for big.LITTLE mobile processor. In ISOCC 2013 - 2013 International SoC Design Conference, pages 208–212. IEEE.