Configurable Energy-Efficient Co-Processors to Scale The
Total Page:16
File Type:pdf, Size:1020Kb
UNIVERSITY OF CALIFORNIA, SAN DIEGO Configurable Energy-efficient Co-processors to Scale the Utilization Wall A dissertation submitted in partial satisfaction of the requirements for the degree Doctor of Philosophy in Computer Science by Ganesh Venkatesh Committee in charge: Professor Steven Swanson, Co-Chair Professor Michael Taylor, Co-Chair Professor Pamela Cosman Professor Rajesh Gupta Professor Dean Tullsen 2011 Copyright Ganesh Venkatesh, 2011 All rights reserved. The dissertation of Ganesh Venkatesh is approved, and it is acceptable in quality and form for publication on microfilm and electronically: Co-Chair Co-Chair University of California, San Diego 2011 iii DEDICATION To my dear parents and my loving wife. iv EPIGRAPH The lurking suspicion that something could be simplified is the world's richest source of rewarding challenges. |Edsger Dijkstra v TABLE OF CONTENTS Signature Page.................................. iii Dedication..................................... iv Epigraph.....................................v Table of Contents................................. vi List of Figures.................................. ix List of Tables................................... xi Acknowledgements................................ xii Vita and Publications.............................. xv Abstract of the Dissertation........................... xvii Chapter 1 Introduction............................1 1.1 Utilization Wall.......................2 1.2 Specialization for converting transistors into performance4 1.3 Patchable Conservation Cores: Energy efficient circuits with processor-like lifetimes................5 1.4 Utilizing Conservation Cores to Design Mobile Applica- tion Processors.......................6 1.5 Quasi-ASICs: Trading Area for Energy by Exploiting Similarity across Irregular Codes..............7 1.6 Organization.........................8 Chapter 2 Arsenal: Baseline Architecture and Tool Chain.........9 2.1 Arsenal: Massively Heterogeneous Multiprocessors.......................9 2.1.1 Hardware Organization............... 10 2.1.2 Arsenal Design Flow................ 11 2.1.3 Execution Model.................. 11 2.2 System Overview...................... 12 2.2.1 Specialized Processor Hardware Design...... 13 2.2.2 The CPU/Specialized-processor Interface..... 15 2.2.3 The Runtime System................ 15 2.3 Methodology......................... 15 2.3.1 Toolchain...................... 17 2.3.2 Simulation infrastructure.............. 17 vi 2.3.3 Synthesis....................... 18 2.3.4 Power measurements................ 18 Chapter 3 Patchable Conservation Cores: Energy-efficient circuits with processor-like lifetimes...................... 20 3.1 Case for Reconfigurability in Application Specific Circuits....................... 21 3.2 Reconfigurability support in Conservation Cores............................. 24 3.3 Patching Algorithm..................... 25 3.3.1 Basic block mapping................ 27 3.3.2 Control flow mapping................ 28 3.3.3 Register mapping.................. 29 3.3.4 Patch generation.................. 30 3.3.5 Patched execution example............. 31 3.4 Methodology......................... 31 3.4.1 Generation of Patchable Conservation Core Hardware...................... 32 3.4.2 Generating Configuration Patch.......... 32 3.5 Results............................ 33 3.5.1 Energy savings................... 33 3.5.2 Longevity...................... 37 3.6 Patching Overhead Analysis and Optimizations..... 38 3.6.1 Cost of Reconfigurability.............. 38 3.6.2 Optimizations to Reduce the Patching Overheads 40 3.7 Backward Patching..................... 43 3.7.1 Methodology.................... 47 3.7.2 Results........................ 47 3.8 Conclusion.......................... 50 Chapter 4 Utilizing Conservation Cores to Design Mobile Application Processors............................. 53 4.1 Applicability of Conservation Cores to Android..... 54 4.1.1 Android Platform Analysis............. 54 4.1.2 Case for using c-cores to optimize mobile applica- tion processors................... 56 4.2 Android Software Stack................... 56 4.3 Designing Conservation Cores for Android........ 58 4.3.1 Profiling the Android System........... 58 4.3.2 Characterizing the hotspots in the Android system 59 4.4 Results............................ 67 4.4.1 Area requirement.................. 67 4.4.2 Energy-Area Tradeoff................ 69 vii 4.5 Conclusion.......................... 71 Chapter 5 Quasi-ASICs: Trading Area for Energy by Exploiting Similar- ity across Irregular Codes.................... 73 5.1 Motivation.......................... 76 5.2 Quasi-ASIC Design Flow.................. 82 5.2.1 Dependence Graph Generation........... 83 5.2.2 Mining for Similar Code Patterns......... 83 5.2.3 Merging Program Dependence Graphs with simi- lar code structure.................. 84 5.2.4 Qasic Generation.................. 87 5.2.5 Modifying Application Code to utilize Qasics.. 90 5.3 Qasic-selection heuristic.................. 91 5.4 Methodology......................... 92 5.4.1 Designing Qasics for the target application set. 94 5.4.2 QASIC Hardware Design.............. 94 5.5 Results............................ 97 5.5.1 Evaluating the Qasic-selection Heuristic..... 97 5.5.2 Evaluating Qasic's Area and Energy Efficiency. 98 5.6 Conclusion.......................... 104 Chapter 6 Related Work........................... 106 6.1 Heterogeneous Architectures................ 106 6.2 Automatically-designed Specialized Cores......... 110 Chapter 7 Summary............................. 113 Bibliography................................... 115 viii LIST OF FIGURES Figure 2.1: The high-level structure of an Arsenal system.......... 10 Figure 2.2: The high-level design flow of an Arsenal system......... 11 Figure 2.3: The high-level structure of the baseline architecture....... 12 Figure 2.4: Specialized Core Design Example................. 14 Figure 2.5: The C-to-hardware toolchain................... 16 Figure 3.1: Observed changes in the source code across application versions 23 Figure 3.2: An example showing handling of changes in control flow across versions............................... 26 Figure 3.3: Basic block matching example................... 27 Figure 3.4: Patching Algorithm Toolchain................... 30 Figure 3.5: Conservation Core versus MIPS Processor............ 35 Figure 3.6: System-level Efficiency of a c-core-enabled Tile Architecture.. 36 Figure 3.7: Conservation core effectiveness over time............. 37 Figure 3.8: Area and power breakdown for patchable c-cores........ 39 Figure 3.9: Impact of patching optimizations on the area and energy re- quirements of patchable c-cores.................. 44 Figure 3.10: Conservation Core design with improved backward compatibil- ity support............................. 45 Figure 3.11: The Backward Compatible Conservation Core Toolchain.... 46 Figure 3.12: Improvements in the backward compatibility of Conservation Cores................................ 48 Figure 3.13: Effectiveness of Backward Compatible Conservation Core over time................................. 49 Figure 4.1: Android Software Stack...................... 55 Figure 4.2: Static Instruction Count vs Dynamic Instruction Count Coverage 60 Figure 4.3: Basic block count vs Application execution coverage...... 61 Figure 4.4: Breakdown of Application Runtime across Android Software Stack 62 Figure 4.5: Dynamic Execution Coverage vs Static Instructions Converted into Conservation cores...................... 66 Figure 4.6: Android Execution Coverage vs Conservation core Area.... 68 Figure 4.7: Android Execution Coverage vs Conservation core Area.... 70 Figure 5.1: Qasic's ability to trade off between area and energy efficiency. 75 Figure 5.2: Similar code patterns present across the hotspots of a Sat Solver tool................................. 77 Figure 5.3: Similarity available across hotspots of diverse application set (Table 5.1)............................. 79 Figure 5.4: Quantifying Similarity Present Within and Across Application Domains............................... 80 ix Figure 5.5: Qasic Design Flow........................ 82 Figure 5.6: Qasic Example.......................... 85 Figure 5.7: Expression Merging........................ 88 Figure 5.8: Inferred Dependence Example................... 88 Figure 5.9: Greedy Clustering Algorithm for Designing Qasics....... 92 Figure 5.10: Qasic Toolchain.......................... 93 Figure 5.11: Micro-Benchmark Set: Eight simple loops............ 95 Figure 5.12: Coverage vs Qasic count..................... 95 Figure 5.13: Qasic quality vs. Qasic Count................. 96 Figure 5.14: Impact of generalization on Qasic Area and Energy Efficiency for the Micro-benchmarks..................... 96 Figure 5.15: Impact of generalization on the area and energy Efficiency of Qasics targeting commonly used data structures........ 99 Figure 5.16: Scalability of Qasic's approach.................. 100 Figure 5.17: Impact of generalization on Qasic Area and Energy Efficiency for the Benchmark Set....................... 101 Figure 5.18: Impact of generalization on Qasic Performance for our Bench- mark Set.............................. 101 Figure 5.19: Energy efficiency of a Qasic-enabled system........... 103 x LIST OF TABLES Table 1.1: The Utilization Wall........................3 Table 3.1: Conservation core statistics..................... 34 Table 3.2: Conservation core details...................... 41 Table 3.3: Area costs of patchability...................... 41 Table 5.1: A Diverse