Architectural and Software Optimizations for Next-Generation Heterogeneous Low-Power Mobile Application Processors

UC San Diego UC San Diego Electronic Theses and Dissertations Title Architectural and Software Optimizations for Next- Generation Heterogeneous Low-Power Mobile Application Processors / Permalink https://escholarship.org/uc/item/96m459f8 Author Bournoutian, Garo Publication Date 2014 Peer reviewed|Thesis/dissertation eScholarship.org Powered by the California Digital Library University of California UNIVERSITY OF CALIFORNIA, SAN DIEGO Architectural and Software Optimizations for Next-Generation Heterogeneous Low-Power Mobile Application Processors A dissertation submitted in partial satisfaction of the requirements for the degree Doctor of Philosophy in Computer Science (Computer Engineering) by Garo Bournoutian Committee in charge: Professor Alex Orailoglu, Chair Professor Chung-Kuan Cheng Professor Sadik Esener Professor William Griswold Professor Ryan Kastner 2014 Copyright Garo Bournoutian, 2014 All rights reserved. The dissertation of Garo Bournoutian is approved, and it is acceptable in quality and form for publication on microfilm and electronically: Chair University of California, San Diego 2014 iii DEDICATION To my parents, for their unwavering love, support, and encouragement. iv EPIGRAPH Strive not to be a success, but rather to be of value. —Albert Einstein v TABLE OF CONTENTS Signature Page ................................... iii Dedication ...................................... iv Epigraph ....................................... v Table of Contents .................................. vi List of Figures .................................... ix List of Tables .................................... xi Acknowledgements ................................. xii Vita ......................................... xiv Abstract of the Dissertation ............................. xvi Chapter 1 Introduction ............................. 1 1.1 Heterogeneous Processor Topology ............. 3 1.2 Distributed Software Ecosystem ............... 6 1.3 Vision of Dissertation ..................... 7 1.3.1 Enhancing Compiler-Device Interface ........ 8 1.3.2 Application-Specific Code Optimizations ...... 11 1.3.3 Dynamically Tuning Microarchitecture ....... 13 1.4 Dissertation Summary .................... 15 Chapter 2 Related Work ............................ 17 2.1 Mobile Ecosystem ...................... 17 2.1.1 Software Power Estimation Techniques ....... 17 2.1.2 Usage Pattern Learning Techniques ......... 18 2.1.3 Communication Power Saving Techniques ..... 19 2.2 Cache Performance and Power Optimization ......... 21 2.2.1 Cache Pipeline Stall Optimization Techniques . 21 2.2.2 Cache Power Optimization Techniques ....... 22 2.2.3 Cache Coherence Optimization Techniques ..... 25 2.3 Branch Prediction Optimization ............... 27 2.3.1 Predictor Optimization Techniques .......... 27 2.3.2 Target Buffer Optimization Techniques ....... 29 vi Chapter 3 Eliminating Control Flow Overheads ................ 31 3.1 Motivation ........................... 32 3.2 Methodology ......................... 37 3.2.1 Extraction of Source Code Metadata ......... 39 3.2.2 Pruning Unused Classes and Methods ........ 41 3.2.3 Optimizing Dynamic Dispatch Sites ......... 41 3.2.4 On-Device Application Optimization Flow ..... 43 3.3 Experimental Results ..................... 44 3.4 Conclusions .......................... 48 Chapter 4 Resolving Memory Stalls in In-Order Processors ......... 50 4.1 Motivation ........................... 52 4.2 Methodology ......................... 54 4.2.1 Compile-Time Transformations ........... 55 4.2.2 Run-Time Run-Ahead Mechanism .......... 59 4.3 Experimental Results ..................... 63 4.4 Conclusions .......................... 69 Chapter 5 Reducing Branch Target Buffer Power ............... 71 5.1 Motivation ........................... 73 5.2 Methodology ......................... 76 5.2.1 Extracting Source Code Metadata .......... 78 5.2.2 On-Device Analysis and Optimization ........ 79 5.2.3 Generation of ACBTB Tables ............ 79 5.2.4 Handling Mobile Software Polymorphism ...... 81 5.3 Experimental Results ..................... 83 5.4 Conclusions .......................... 88 Chapter 6 Reducing Memory Subsystem Power ................ 90 6.1 Motivation ........................... 91 6.2 Methodology ......................... 95 6.2.1 Compiler-Driven Threshold Selection ........ 98 6.2.2 Cache Expansion Mechanism ............ 103 6.2.3 Cache Contraction Mechanism ............ 105 6.3 Experimental Results ..................... 108 6.4 Conclusions .......................... 119 Chapter 7 Reducing Cache Coherence Power ................. 122 7.1 Motivation ........................... 124 7.2 Methodology ......................... 127 7.2.1 Modified Read-Miss Snooping Behavior ....... 129 7.2.2 Transitioning Between Policies ............ 130 7.3 Experimental Results ..................... 131 vii 7.4 Conclusions .......................... 137 Chapter 8 Reducing Instruction Pipeline Power ................ 140 8.1 Motivation ........................... 141 8.2 Implementation ........................ 144 8.2.1 Hardware Architecture ................ 144 8.2.2 Software-Driven Reset Thresholds .......... 147 8.3 Experimental Results ..................... 148 8.4 Conclusions .......................... 151 Chapter 9 Conclusions ............................. 153 Bibliography .................................... 158 viii LIST OF FIGURES Figure 1.1: Architectural Overview of Heterogeneous Processor Topology ... 4 Figure 1.2: Example DVFS Curves for Different Architectures ......... 5 Figure 1.3: On-Device Optimization Framework ................. 9 Figure 1.4: Extended Compiler-Processor Interface ............... 10 Figure 3.1: Example Objective-C Application .................. 34 Figure 3.2: High-Level On-Device Optimization Framework .......... 38 Figure 3.3: Example Class Hierarchy and Method Partial Ordering ....... 40 Figure 3.4: Reduction in Polymorphic Sites ................... 45 Figure 3.5: Executed Polymorphic Calls ..................... 46 Figure 3.6: Ratio of Control Flow Instructions .................. 47 Figure 4.1: Average Distribution of Memory Instructions ............ 51 Figure 4.2: GSM Example Basic Block ..................... 53 Figure 4.3: Overview of Compiler and Hardware Interaction .......... 54 Figure 4.4: Compile-Time Basic Block Reordering Algorithm ......... 56 Figure 4.5: Excerpt from FFT algorithm with DFG and Reordering ....... 57 Figure 4.6: Basic Block Memory Independence Annotation (memTable) . 59 Figure 4.7: Average Number of Load Instructions per Basic Block ....... 60 Figure 4.8: Average and 90th Percentile Number of Independent Instructions per Basic Block ............................ 61 Figure 4.9: Load Instruction Buffer (loadBuf) .................. 62 Figure 4.10: Run-Time Independent Instruction Execution Behavioral Algorithm 64 Figure 4.11: Success of Filling Stalls With Independent Instructions ....... 65 Figure 4.12: L1 Cache Miss Rates ......................... 66 Figure 4.13: Overall Run-Time Execution Improvement ............. 68 Figure 4.14: Overall Energy Consumption Contribution ............. 69 Figure 5.1: Typical Branch Prediction Architecture ............... 74 Figure 5.2: High-Level On-Device Optimization Framework .......... 77 Figure 5.3: Application Customizable BTB (ACBTB) Architecture ....... 80 Figure 5.4: Overview of Compiler and Hardware Interaction .......... 80 Figure 5.5: Proposed BTB + ACBTB Architecture ............... 82 Figure 5.6: Total Branch Target Prediction Subsystem Power Improvement . 85 Figure 5.7: Percent of Execution Cycles Where Standard BTB is Turned Off . 86 Figure 5.8: Target Address Miss-Cycles Reduction ............... 87 Figure 6.1: L2 Cache Access and Miss Rate Distribution for h264dec ..... 92 Figure 6.2: High-Level Architectural Overview of Proposed Design ...... 95 Figure 6.3: High-Level Cache Implementation .................. 96 Figure 6.4: Cache Behavior State Diagram .................... 97 Figure 6.5: Overview of Compiler and Hardware Interaction .......... 99 ix Figure 6.6: Example Gated Primary Set Redirecting Queries to Secondary Lo- cation ................................. 106 Figure 6.7: State Transition Diagram for Pair of Cache Sets ........... 107 Figure 6.8: L2 Cache Miss Rate Impact ..................... 111 Figure 6.9: L2 Cache Access and Miss Rate Distribution for h264dec After Us- ing Application-Specific Approach ................. 113 Figure 6.10: Global Application Execution Time Impact ............. 115 Figure 6.11: Total Cache Subsystem Power Improvement ............ 118 Figure 7.1: Quad-Core Cache Hierarchy ..................... 123 Figure 7.2: (a) MESI Write-Back State Diagram; (b) Simplified Write-Through State Diagram; (c) Listing of Valid States Between Pair of Caches . 125 Figure 7.3: Proposed Cache Architecture ..................... 128 Figure 7.4: Reductions in L1 Read Miss Snooping ................ 133 Figure 7.5: Increases in L1 Write Miss Snooping ................ 135 Figure 7.6: Global Execution Time Overhead .................. 135 Figure 7.7: Total Cache Subsystem Power Improvement ............ 137 Figure 8.1: Typical Mobile Out-of-Order Processor Pipeline .......... 142 Figure 8.2: SPEC Integer Benchmark Instruction Mix .............. 143 Figure 8.3: SPEC Floating-Point Benchmark Instruction Mix .......... 143 Figure 8.4: Proposed Adaptive Pipeline Architecture .............. 144 Figure 8.5: Overview of Compiler and Hardware Interaction .......... 147 Figure 8.6: Integer Benchmarks Pipeline Power Reduction ..........

Load more