Software Assists to On-Chip Memory Hierarchy of Manycore Embedded Systems

UC Irvine UC Irvine Electronic Theses and Dissertations Title Software Assists to On-chip Memory Hierarchy of Manycore Embedded Systems Permalink https://escholarship.org/uc/item/8bx2m2hw Author Namaki Shoushtari, Abdolmajid Publication Date 2018 License https://creativecommons.org/licenses/by/4.0/ 4.0 Peer reviewed|Thesis/dissertation eScholarship.org Powered by the California Digital Library University of California UNIVERSITY OF CALIFORNIA, IRVINE Software Assists to On-chip Memory Hierarchy of Manycore Embedded Systems DISSERTATION submitted in partial satisfaction of the requirements for the degree of DOCTOR OF PHILOSOPHY in Computer Science by Majid Namaki Shoushtari Dissertation Committee: Professor Nikil Dutt, Chair Professor Alex Nicolau Professor Elaheh Bozorgzadeh 2018 Portion of Chapter 2 c 2017 IEEE Portion of Chapter 2 c 2018 ACM Portion of Chapter 3 c 2015 IEEE Portion of Chapter 3 c 2017 IEEE All other materials c 2018 Majid Namaki Shoushtari DEDICATION In memory of my dear aunt Shahnaz, and my beloved grandmothers. ii TABLE OF CONTENTS Page LIST OF FIGURES vii LIST OF TABLES xi ACKNOWLEDGMENTS xii CURRICULUM VITAE xiii ABSTRACT OF THE DISSERTATION xvi 1 Introduction1 1.1 Technology Implications.............................1 1.1.1 Emerging Manycore Architectures....................1 1.1.2 Memory Subsystem of Manycores....................3 1.2 Workload Implications..............................5 1.2.1 Data-intensive Workloads and Variation in their Memory Requirements5 1.2.2 Error-Resilient Workloads........................6 1.3 Thesis Contributions and Organization.....................7 2 Software-Programmable On-chip Memory Hierarchy 10 2.1 Introduction.................................... 10 2.2 Prior Work on SPMs in Multicores and Manycores............... 14 2.3 Target System Architecture Template...................... 16 2.3.1 SPM Programming Model........................ 17 2.3.2 Architecture Model............................ 20 2.3.3 Execution Model............................. 21 2.3.4 Coherence Issues............................. 22 2.3.5 OS/Runtime Support........................... 23 2.4 Motivational Example.............................. 24 2.5 ShaVe-ICE Details................................ 27 2.5.1 SPM Allocator.............................. 28 2.5.2 Allocation Policies............................ 28 2.5.2.1 Non-preemptive Allocation Heuristics............. 29 2.5.2.1.1 Local Allocator(LA)................. 29 2.5.2.1.2 Random Remote Allocator (RRA)......... 30 iii 2.5.2.1.3 Closest Neighborhood Allocator (CNA)....... 30 2.5.2.2 Preemptive Allocation Heuristics............... 31 2.5.2.2.1 Closest Neighborhood with Guaranteed Local Share Allocator (CNGLSA)................. 31 2.5.3 Hardware Support............................ 33 2.5.3.1 Distributed Memory Management Units........... 33 2.5.3.2 Network Protocol........................ 34 2.5.4 Data Movements in ShaVe-ICE..................... 35 2.5.4.1 Allocations/Deallocations................... 35 2.5.4.2 Accesses............................. 35 2.5.4.3 Data Migration......................... 36 2.6 SPM Sharing Evaluations............................ 37 2.6.1 Experimental Setup............................ 37 2.6.2 Remote vs. Off-chip Access Latencies.................. 37 2.6.3 Memory Microbenchmark........................ 39 2.6.4 Performance Comparisons: ShaVe-ICE vs. Software Cache...... 41 2.6.5 Performance Comparisons: ShaVe-ICE Policies............. 42 2.6.5.1 Scenario 1: Core Underutilization............... 44 2.6.5.2 Scenario 2: Variation in Memory Working Set Size..... 46 2.6.5.3 Scenario 3: Reducing Network Traffic............. 47 2.6.5.4 Scenario 4: Guaranteeing SPM Share for Locally-pinned Thread 48 2.6.6 Energy Comparisons: ShaVe-ICE Policies................ 50 2.6.7 Experiments with Real Workload Mixes................ 51 2.7 Discussion on Overheads............................. 54 2.8 Cache+SPM and Shared Data Support Evaluations.............. 56 2.8.1 Experiment 1: Coherence Overhead Due to False Sharing....... 57 2.8.2 Experiment 2: Coherence Overhead Due to Shared Data....... 58 2.8.3 Experiment 3: Dynamic Partitioning of Local Memory........ 59 2.8.4 Experiment 4: Sharing SPMs Between Cores.............. 60 3 Approximate On-chip Data Storage 62 3.1 Introduction.................................... 62 3.2 Prior Work on Memory Approximation..................... 64 3.2.1 Taxonomy of Prior Research on Approximate Memory Management. 64 3.2.2 Overview of Prior Research and Practices............... 69 3.3 Partially-Forgetful Memories........................... 80 3.4 Relaxed Cache.................................. 82 3.4.1 Introduction................................ 82 3.4.2 Motivation................................. 84 3.4.3 Hardware Support............................ 86 3.4.3.1 Tuning Relaxed Ways Based on Architectural Knobs.... 87 3.4.3.2 Defect Map Generation and Storage............. 88 3.4.3.3 Non-Criticality Table...................... 90 3.4.3.4 Making Cache Controller Aware of Block's Tag....... 91 3.4.4 Software Support............................. 91 iv 3.4.4.1 Programmer-Driven Application Modifications........ 91 3.4.4.1.1 Data Criticality Declaration............. 91 3.4.4.1.2 Cache Configuration................. 92 3.4.4.2 Runtime System........................ 95 3.4.5 Evaluations................................ 95 3.4.5.1 Experimental Setup...................... 95 3.4.5.2 Benchmarks........................... 97 3.4.5.3 Experimental Results..................... 97 3.4.5.3.1 Leakage Energy Savings............... 97 3.4.5.3.2 Fidelity Analysis................... 98 3.4.5.3.3 Performance Analysis................ 99 3.5 QuARK Cache.................................. 101 3.5.1 Introduction................................ 101 3.5.2 STT-MRAM Reliability/Energy Trade-off Knobs........... 103 3.5.2.1 STT-MRAM Basics...................... 103 3.5.2.2 Reliability-Energy Knobs................... 104 3.5.3 The QuARK Approach.......................... 105 3.5.3.1 Software Support........................ 106 3.5.3.2 Hardware Support....................... 107 3.5.3.2.1 QuARK Cache Approximation Table........ 108 3.5.3.2.2 QuARK Cache Controller.............. 109 3.5.3.2.3 Support for Cache Fillings and Write-backs.... 109 3.5.4 Evaluations................................ 111 3.5.4.1 Experimental Setup...................... 111 3.5.4.2 Benchmarks........................... 112 3.5.4.3 Experimental Results..................... 113 3.6 Write-Skip SPM.................................. 116 3.6.1 Introduction................................ 116 3.6.2 The Write-Skip Approach........................ 117 3.6.2.1 Read-Before-Write....................... 117 3.6.2.2 Approximate Equality..................... 117 3.6.3 Evaluations................................ 118 3.6.3.1 Approximate Value Locality.................. 118 3.6.3.2 Output Fidelities........................ 120 3.6.3.3 Energy Consumption in On-chip Memory.......... 121 3.7 Controlled Memory Approximation....................... 123 3.7.1 Introduction................................ 123 3.7.2 Problem Modeling............................ 125 3.7.2.1 Application Class........................ 125 3.7.2.2 Monitoring Quality at Runtime................ 126 3.7.2.3 Memory Approximation Knob(s)............... 126 3.7.3 Quality Control with Feedback Control Theory............ 127 3.7.4 Case Study: Video Edge Detection................... 129 3.7.4.1 Application Description and Error Metric.......... 129 3.7.4.2 Memory Approximation Knob................. 129 v 3.7.5 System Identification........................... 129 3.7.5.1 Controller............................ 131 3.7.5.2 Fault Injection Mechanism................... 132 3.7.5.3 Input Dependency....................... 133 3.7.5.4 QoS Tracking.......................... 133 3.7.5.5 Comparison........................... 133 4 Conclusions and Future Directions 137 4.1 Technical Contributions............................. 137 4.2 Future Directions................................. 138 Bibliography 140 vi LIST OF FIGURES Page 1.1 High-level overeview of the proposed software assisted memory hierarchy...8 2.1 Number of memory accesses measured periodically for the execution of the wikisort and fft benchmarks: utilization of memory resources changes within a thread and between threads........................... 12 2.2 Software-assisted memory hierarchy vs. hardware-managed memory hierarchy. Relative size of gray components shows their contribution in managing memory hierarchy...................................... 13 2.3 Number of CPU cycles that a thread has to be stalled for an SPM API to be handled by the OS/HW. SPM page size is assumed to be 64 bytes...... 19 2.4 A code snippet using SPM APIs......................... 19 2.5 A 4X4 example of target manycore architecture with software-assisted memory hierarchy...................................... 20 2.6 SPM manager................................... 23 2.7 ShaVe-ICE allocator is invoked when a thread Ti mapped on core Cj calls any of the SPM APIs. It virtualizes and ultimately shares the entire distributed SPM space between all the concurrently running threads. Dedicated hardware assist enables remote allocations and remote memory accesses......... 25 2.8 An example showing how sharing the entire SPM space improves the overall performance of a 2X2 system with 3 thread running on it. The left-hand side column shows the state of SPM allocations when allocations are limited to the local SPM only; and right-hand

Load more