Energy Efficient Branch Prediction

Energy Efficient Branch Prediction Michael Andrew Hicks A thesis submitted in partial fulfilment of the requirements of the University of Hertfordshire for the degree of Doctor of Philosophy December 2007 To my family and friends. Contents 1 Introduction 1 1.1 Thesis Statement . 1 1.2 Motivation and Energy Efficiency . 1 1.3 Branch Prediction . 3 1.4 Contributions . 4 1.5 Dissertation Structure . 5 2 Energy Efficiency in Modern Processor Design 7 2.1 Transistor Level Power Dissipation . 7 2.1.1 Static Dissipation . 8 2.1.2 Dynamic Dissipation . 9 2.1.3 Energy Efficiency Metrics . 9 2.2 Transistor Level Energy Efficiency Techniques . 10 2.2.1 Clock Gating and Vdd Gating . 10 2.2.2 Technology Scaling . 11 2.2.3 Voltage Scaling . 11 2.2.4 Logic Optimisation . 11 2.3 Architecture & Software Level Efficiency Techniques . 11 2.3.1 Activity Factor Reduction . 12 2.3.2 Delay Reduction . 12 2.3.3 Low Power Scheduling . 12 2.3.4 Frequency Scaling . 13 2.4 Branch Prediction . 13 2.4.1 The Branch Problem . 13 2.4.2 Dynamic and Static Prediction . 14 2.4.3 Dynamic Predictors . 15 2.4.4 Power Consumption . 18 2.5 Summary . 18 3 Related Techniques 20 3.1 The Prediction Probe Detector (Hardware) . 20 3.1.1 Implementation . 20 3.1.2 Pipeline Gating . 22 i 3.2 Software Based Approaches . 23 3.2.1 Hinting and Hint Instructions . 23 3.3 Analysis and Summary . 24 4 Initial Investigation and Preliminary Research 26 4.1 Research Question Focus . 26 4.2 Static Methods to Avoid Dynamic Branch Prediction . 27 4.2.1 Delay Region Scheduling . 27 4.2.2 Static Prediction and Instruction Hints . 29 4.2.3 Guarded Execution . 30 4.3 Hardware Multithreading . 31 4.4 Initial Experiments . 31 4.4.1 Removing Dynamic Branch Predictors . 31 4.4.2 Instruction Stream Research (HTracer) . 33 4.4.3 I-Cache Experimentation . 34 4.5 Summary . 34 5 The Combined Approach 36 5.1 Local Delay Region Scheduling . 36 5.2 Profiling . 38 5.2.1 Assigning a Static Branch Behaviour . 39 5.2.2 Adaptive Branch Bias Measurement (ABBM) . 40 5.3 The Combined Algorithm . 40 5.4 Hardware Implementation . 41 5.4.1 Instruction Set Modifications . 42 5.4.2 Hardware Modifications . 44 5.5 Summary . 46 6 Simulation Tools 47 6.1 Introduction . 47 6.2 Simulator (HWattch) . 47 6.2.1 Architecture Model . 49 6.2.2 Architecture Modifications . 52 6.2.3 Profiling Enhancement . 55 6.2.4 Instruction Set (PISA) . 56 6.2.5 Compiler (Custom GCC) . 59 6.3 Scheduler and Static Prediction Assigner (HACA) . 59 6.3.1 Combined Algorithm: Practical Implementation . 59 6.4 EEMBC . 63 6.4.1 Sub-Suites and Benchmarks . 63 6.4.2 Bespoke Build System for the Combined Algorithm . 65 6.5 Summary . 65 ii 7 Simulations and Results 66 7.1 Introduction . 66 7.2 The Baseline Models . 66 7.2.1 The Branch Predictor . 67 7.2.2 Scalar Processor . 67 7.2.3 Multiple Instruction Issue Processor . 68 7.3 Preamble To Results . 70 7.3.1 Metrics . 70 7.3.2 Calculation of Averages and ‘Weighted Averages’ . 74 7.3.3 Important Summary Notes . 75 7.4 Scalar Processor Results . 75 7.4.1 Benchmark Breakdown . 75 7.4.2 Averages . 78 7.5 Two Instruction Issue Processor Results . 79 7.5.1 Benchmark Breakdown . 79 7.5.2 Averages . 79 7.6 Sixteen Instruction Issue Processor Results . 83 7.6.1 Benchmark Breakdown . 83 7.6.2 Averages . 86 7.7 Overall Analysis . 86 7.7.1 Results Summary . 88 8 Comparisons and Enhancements 90 8.1 Comparison of ABBM with Fixed Bias Level and Compiler Heuris- tics . 90 8.1.1 Results and Analysis . 91 8.2 Reducing Set Associativity in the Branch Target Buffer . 92 8.2.1 Results and Analysis . 92 8.3 Summary . 94 9 Conclusion and Discussion 96 9.1 Thesis Summary . 96 9.1.1 Key Novelties and Contributions . 97 9.2 Generalisation . 97 9.3 Critique . 99 9.3.1 Local Delay Region . 99 9.3.2 Hint Bits . 100 9.3.3 Timing Issues . 101 9.3.4 Profiling Duration . 102 9.3.5 Profiling on a ‘Real’ Architecture . 102 9.3.6 Dependency on Datasets . 104 9.4 Related Work Comparison . 104 9.4.1 Prediction Probe Detector . 105 9.5 Future Work . 106 iii 9.5.1 Maximising the Fetch Window of Wide Issue Processors . 106 9.5.2 Hinting Libraries . 107 9.5.3 Combining with the Prediction Probe Detector . 107 9.5.4 Hints and Context Switching . 108 9.5.5 Profiling and Processor-Wide Power Saving . 109 9.6 Concluding Remarks . 109 Bibliography 111 Glossary 120 Appendix A: Published Papers i Towards an Energy Efficient Branch Prediction Scheme. ii Reducing the Branch Power Cost In Embedded Processors. iii HTracer: A Dynamic Instruction Stream Research Tool iv Enhancing the I-cache to Reduce the Power Consumption. Appendix B: Technical Reports i An Introduction to Power Consumption Issues in Processor Design ii HTracer V0.5: A User Guide Appendix C: Additional Background Appendix D: Raw Data iv List of Figures 2.1 An example five stage processor pipeline . 14 2.2 An example of a modern dynamic predictor architecture . 15 3.1 The Prediction Probe Detector . 21 5.1 An example of local delayed branch scheduling . 37 5.2 The basic structure of profiling . 39 5.3 Block model of the profiling and hinting regime . 42 5.4 Hardware modifications required in the instruction fetch stage . 45 6.1 The Wattch simulator in relation to SimpleScalar . 48 6.2 The Wattch simulator pipeline . 49 6.3 A logical represention of the IF stage hint-bits showing ‘1,1’ . 53 6.4 A logical represention of the EXE stage hint-bits showing ‘1,0’ . 54 6.5 The PISA instruction format . 56 6.6 The location of the two hint-bits within the branch instruction format 58 7.1 Scalar baseline global power savings (%) compared with ideal (free) prediction . 77 7.2 Scalar baseline average global power savings (%) compared with ideal (free) prediction . 78 7.3 2-way issue baseline global power savings (%) compared with ideal (free) prediction . 81 7.4 2-way issue baseline average power savings (%) compared with ideal (free) prediction . 82 7.5 16-way issue baseline global power savings (%) compared with ideal (free) prediction . 85 7.6 16-way issue baseline average power savings compared with ideal (free) prediction . 87 8.1 Average Change in the dynamic instruction stream after resizing the BTB from four-way to two-way set-associativity . 93 8.2 Additional power saving after resizing the BTB . 94 v 9.1 Series and parallel i-cache/branch predictor access. (1) and (2) represent the direction and target address predictors, respectively . 101 vi List of Tables 5.1 Static and dynamic branch occurrence for each PISA branch, and its occurrence across the whole EEMBC benchmark suite . 43 6.1 Static and dynamic branch occurrence for each PISA branch, and its occurrence across the whole EEMBC benchmark suite . 57 6.2 The full EEMBC benchmark suite with descriptions . 64 7.1 Scalar Processor Baseline Configuration . 69 7.2 2-Way Issue Processor Baseline Configuration . 71 7.3 16-Way Issue Processor Baseline Configuration . 72 7.4 Benchmark breakdown results for scalar baseline processor . 76 7.5 Average benchmark results for scalar baseline processor . 78 7.6 Benchmark breakdown results for two-way issue baseline processor 80 7.7 Average benchmark results for two-way issue baseline processor . 82 7.8 Benchmark breakdown results for sixteen-way issue baseline processor . 84 7.9 Average benchmark results for sixteen-way issue baseline.

Energy Efficient Branch Prediction

Parallel Architectures

"LISARM: Embedded ARM Platform Design and Optimization" Thesis

Energy Efficient Branch Prediction

Constant-Time Foundations for the New Spectre Era

ARM Cortex-A Series Programmer's Guide

A Systematic Evaluation of Transient Execution Attacks and Defenses

SPECCFI: Mitigating Spectre Attacks Using CFI Informed Speculation

Control Flow Speculation for Distributed Architectures

Spectre Returns! Speculation Attacks Using the Return Stack Buffer

Hyperflow: a Processor Architecture for Nonmalleable, Timing-Safe Information Flow Security

Shared Frontend for Manycore Server Processors

Cost Effective Speculation with the Omnipredictor Arthur Perais, André Seznec