Energy Efficient Branch Prediction

View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by University of Hertfordshire Research Archive Energy Efficient Branch Prediction Michael Andrew Hicks A thesis submitted in partial fulfilment of the requirements of the University of Hertfordshire for the degree of Doctor of Philosophy December 2007 To my family and friends. Contents 1 Introduction 1 1.1 Thesis Statement . 1 1.2 Motivation and Energy Efficiency . 1 1.3 Branch Prediction . 3 1.4 Contributions . 4 1.5 Dissertation Structure . 5 2 Energy Efficiency in Modern Processor Design 7 2.1 Transistor Level Power Dissipation . 7 2.1.1 Static Dissipation . 8 2.1.2 Dynamic Dissipation . 9 2.1.3 Energy Efficiency Metrics . 9 2.2 Transistor Level Energy Efficiency Techniques . 10 2.2.1 Clock Gating and Vdd Gating . 10 2.2.2 Technology Scaling . 11 2.2.3 Voltage Scaling . 11 2.2.4 Logic Optimisation . 11 2.3 Architecture & Software Level Efficiency Techniques . 11 2.3.1 Activity Factor Reduction . 12 2.3.2 Delay Reduction . 12 2.3.3 Low Power Scheduling . 12 2.3.4 Frequency Scaling . 13 2.4 Branch Prediction . 13 2.4.1 The Branch Problem . 13 2.4.2 Dynamic and Static Prediction . 14 2.4.3 Dynamic Predictors . 15 2.4.4 Power Consumption . 18 2.5 Summary . 18 3 Related Techniques 20 3.1 The Prediction Probe Detector (Hardware) . 20 3.1.1 Implementation . 20 3.1.2 Pipeline Gating . 22 i 3.2 Software Based Approaches . 23 3.2.1 Hinting and Hint Instructions . 23 3.3 Analysis and Summary . 24 4 Initial Investigation and Preliminary Research 26 4.1 Research Question Focus . 26 4.2 Static Methods to Avoid Dynamic Branch Prediction . 27 4.2.1 Delay Region Scheduling . 27 4.2.2 Static Prediction and Instruction Hints . 29 4.2.3 Guarded Execution . 30 4.3 Hardware Multithreading . 31 4.4 Initial Experiments . 31 4.4.1 Removing Dynamic Branch Predictors . 31 4.4.2 Instruction Stream Research (HTracer) . 33 4.4.3 I-Cache Experimentation . 34 4.5 Summary . 34 5 The Combined Approach 36 5.1 Local Delay Region Scheduling . 36 5.2 Profiling . 38 5.2.1 Assigning a Static Branch Behaviour . 39 5.2.2 Adaptive Branch Bias Measurement (ABBM) . 40 5.3 The Combined Algorithm . 40 5.4 Hardware Implementation . 41 5.4.1 Instruction Set Modifications . 42 5.4.2 Hardware Modifications . 44 5.5 Summary . 46 6 Simulation Tools 47 6.1 Introduction . 47 6.2 Simulator (HWattch) . 47 6.2.1 Architecture Model . 49 6.2.2 Architecture Modifications . 52 6.2.3 Profiling Enhancement . 55 6.2.4 Instruction Set (PISA) . 56 6.2.5 Compiler (Custom GCC) . 59 6.3 Scheduler and Static Prediction Assigner (HACA) . 59 6.3.1 Combined Algorithm: Practical Implementation . 59 6.4 EEMBC . 63 6.4.1 Sub-Suites and Benchmarks . 63 6.4.2 Bespoke Build System for the Combined Algorithm . 65 6.5 Summary . 65 ii 7 Simulations and Results 66 7.1 Introduction . 66 7.2 The Baseline Models . 66 7.2.1 The Branch Predictor . 67 7.2.2 Scalar Processor . 67 7.2.3 Multiple Instruction Issue Processor . 68 7.3 Preamble To Results . 70 7.3.1 Metrics . 70 7.3.2 Calculation of Averages and ‘Weighted Averages’ . 74 7.3.3 Important Summary Notes . 75 7.4 Scalar Processor Results . 75 7.4.1 Benchmark Breakdown . 75 7.4.2 Averages . 78 7.5 Two Instruction Issue Processor Results . 79 7.5.1 Benchmark Breakdown . 79 7.5.2 Averages . 79 7.6 Sixteen Instruction Issue Processor Results . 83 7.6.1 Benchmark Breakdown . 83 7.6.2 Averages . 86 7.7 Overall Analysis . 86 7.7.1 Results Summary . 88 8 Comparisons and Enhancements 90 8.1 Comparison of ABBM with Fixed Bias Level and Compiler Heuris- tics . 90 8.1.1 Results and Analysis . 91 8.2 Reducing Set Associativity in the Branch Target Buffer . 92 8.2.1 Results and Analysis . 92 8.3 Summary . 94 9 Conclusion and Discussion 96 9.1 Thesis Summary . 96 9.1.1 Key Novelties and Contributions . 97 9.2 Generalisation . 97 9.3 Critique . 99 9.3.1 Local Delay Region . 99 9.3.2 Hint Bits . 100 9.3.3 Timing Issues . 101 9.3.4 Profiling Duration . 102 9.3.5 Profiling on a ‘Real’ Architecture . 102 9.3.6 Dependency on Datasets . 104 9.4 Related Work Comparison . 104 9.4.1 Prediction Probe Detector . 105 9.5 Future Work . 106 iii 9.5.1 Maximising the Fetch Window of Wide Issue Processors . 106 9.5.2 Hinting Libraries . 107 9.5.3 Combining with the Prediction Probe Detector . 107 9.5.4 Hints and Context Switching . 108 9.5.5 Profiling and Processor-Wide Power Saving . 109 9.6 Concluding Remarks . 109 Bibliography 111 Glossary 120 Appendix A: Published Papers i Towards an Energy Efficient Branch Prediction Scheme. ii Reducing the Branch Power Cost In Embedded Processors. iii HTracer: A Dynamic Instruction Stream Research Tool iv Enhancing the I-cache to Reduce the Power Consumption. Appendix B: Technical Reports i An Introduction to Power Consumption Issues in Processor Design ii HTracer V0.5: A User Guide Appendix C: Additional Background Appendix D: Raw Data iv List of Figures 2.1 An example five stage processor pipeline . 14 2.2 An example of a modern dynamic predictor architecture . 15 3.1 The Prediction Probe Detector . 21 5.1 An example of local delayed branch scheduling . 37 5.2 The basic structure of profiling . 39 5.3 Block model of the profiling and hinting regime . 42 5.4 Hardware modifications required in the instruction fetch stage . 45 6.1 The Wattch simulator in relation to SimpleScalar . 48 6.2 The Wattch simulator pipeline . 49 6.3 A logical represention of the IF stage hint-bits showing ‘1,1’ . 53 6.4 A logical represention of the EXE stage hint-bits showing ‘1,0’ . 54 6.5 The PISA instruction format . 56 6.6 The location of the two hint-bits within the branch instruction format 58 7.1 Scalar baseline global power savings (%) compared with ideal (free) prediction . 77 7.2 Scalar baseline average global power savings (%) compared with ideal (free) prediction . 78 7.3 2-way issue baseline global power savings (%) compared with ideal (free) prediction . 81 7.4 2-way issue baseline average power savings (%) compared with ideal (free) prediction . 82 7.5 16-way issue baseline global power savings (%) compared with ideal (free) prediction . 85 7.6 16-way issue baseline average power savings compared with ideal (free) prediction . 87 8.1 Average Change in the dynamic instruction stream after resizing the BTB from four-way to two-way set-associativity . 93 8.2 Additional power saving after resizing the BTB . 94 v 9.1 Series and parallel i-cache/branch predictor access. (1) and (2) represent the direction and target address predictors, respectively . 101 vi List of Tables 5.1 Static and dynamic branch occurrence for each PISA branch, and its occurrence across the whole EEMBC benchmark suite . 43 6.1 Static and dynamic branch occurrence for each PISA branch, and its occurrence across the whole EEMBC benchmark suite . 57 6.2 The full EEMBC benchmark suite with descriptions . 64 7.1 Scalar Processor Baseline Configuration . 69 7.2 2-Way Issue Processor Baseline Configuration . 71 7.3 16-Way Issue Processor Baseline Configuration . 72 7.4 Benchmark breakdown results for scalar baseline processor . 76 7.5 Average benchmark results for scalar baseline processor . 78 7.6 Benchmark breakdown results for two-way issue baseline processor 80 7.7 Average benchmark results for two-way issue baseline processor . 82 7.8 Benchmark breakdown results for sixteen-way issue baseline processor . ..

Energy Efficient Branch Prediction

Branch Prediction Side Channel Attacks

BRANCH PREDICTORS Mahdi Nazm Bojnordi Assistant Professor School of Computing University of Utah

18-741 Advanced Computer Architecture Lecture 1: Intro And

State of the Art Regarding Both Compiler Optimizations for Instruction Fetch, and the Fetch Architectures for Which We Try to Optimize Our Applications

Branch Prediction for Network Processors

Trends in Processor Architecture

Parallel Architectures

"LISARM: Embedded ARM Platform Design and Optimization" Thesis

T13: Advanced Processors – Branch Prediction

Bias-Free Branch Predictor

An Evaluation of Multiple Branch Predictor and Trace Cache Advanced Fetch Unit Designs for Dynamically Scheduled Superscalar Processors Slade S

Branchscope: a New Side-Channel Attack on Directional Branch Predictor