Using Tracing to Enhance Data Cache Performance in Cpus
Total Page:16
File Type:pdf, Size:1020Kb
Using Tracing to Enhance Data Cache Performance in CPUs The creation of a Trace-Assisted Cache to increase cache hits and decrease runtime Jonathan Paul Rainer A thesis presented for the degree of Doctor of Philosophy (PhD) Computer Science University of York September 2020 Abstract The processor-memory gap is widening every year with no prospect of reprieve. More and more latency is being added to program runtimes as memory cannot satisfy the demands of CPUs quickly enough. In the past, this has been alleviated through caches of increasing complexity or techniques like prefetching, to give the illusion of faster memory. However, these techniques have drawbacks because they are reactive or rely on incomplete information. In general, this leads to large amounts of latency in programs due to processor stalls. It is our contention that through tracing a program’s data accesses and feeding this information back to the cache, overall program runtime can be reduced. This is achieved through a new piece of hardware called a Trace-Assisted Cache (TAC). This uses traces to gain foreknowledge of the memory requests the processor is likely to make, allowing them to be actioned before the processor requests the data, overlapping memory and computation instructions. Comparing the TAC against a standard CPU without a cache, we see improvements in runtimes of up to 65%. However, we see degraded performance of around 8% on average when compared to Set-Associative and Direct-Mapped caches. This is because improvements are swamped by high overheads and synchronisation times between components. We also see that benchmarks that exhibit several qualities: a balance of computation and memory instructions and keeping data well spread out in memory fare better using TAC than other benchmarks on the same hardware. Overall this demonstrates that whilst there is potential to reduce runtime via increasing the agency of the cache through Trace Assistance, it requires a highly efficient implementation to be competitive otherwise any potential gains are negated by the increase in overheads. i Contents Abstract i Contents ii List of Tables vi List of Figures vi List of Listings viii Research Data ix Acknowledgements xi Declaration xii I Introduction1 1 Introduction 3 1.1 Motivation . 3 1.1.1 Caches & Memory Hierarchies . 5 1.1.2 Predicting Dynamic Behaviour . 6 1.1.3 Tracing & Trace Assisted Caching . 7 1.2 Thesis Aims . 8 1.3 Thesis Structure . 9 II Background 11 2 Literature Review 13 2.1 Introduction . 13 2.2 Cache Intrinsic Techniques . 15 2.2.1 Cache Replacement Policy . 17 2.2.2 Augmenting Cache Architectures . 36 2.2.3 Summary . 44 2.3 Cache Extrinsic Techniques . 46 2.3.1 Prefetching . 46 2.3.2 Scheduling . 51 2.3.3 Program Transformation & Data Layout . 53 ii Contents 2.3.4 Summary . 54 2.4 Incorporating Tracing to Reduce Latency . 54 2.4.1 Tracing as a Control Loop . 55 2.4.2 Tracing for In-Silicon Debugging . 56 2.5 Review Summary . 57 2.5.1 Potential for the Application of Tracing . 58 3 Trace Assisted Caching 59 3.1 Motivation . 59 3.1.1 Defining the Key Problems . 60 3.1.2 Exploring the Design Space . 63 3.1.3 Tracing . 66 3.2 High Level Design . 66 3.2.1 Trace Recorder . 67 3.2.2 Intelligent Cache & Memory System . 70 3.3 Justification of Success . 74 3.3.1 Protections against Performance Degradation . 74 III Experiments 79 4 Implementing the Platform 81 4.1 Pre-existing Components . 81 4.2 Trace Recorder (Gouram) . 83 4.2.1 A Note on Names . 83 4.3 Trace Assisted Cache (Enokida) . 84 4.3.1 Trace Repository . 84 4.3.2 Trace Assisted Cache (Enokida) . 93 4.4 Experimental Hardware . 107 4.4.1 Expectations of the Platform . 107 4.4.2 Implementing the Platform . 110 4.5 Summary . 118 5 Experiments & Results 121 5.1 Experimental Setup . 121 5.1.1 Use of the Experimental Approach . 122 5.1.2 Process for Each Experiment . 124 5.1.3 Specific Experimental Concerns . 125 5.2 Results . 127 5.3 Exploring Cache Metrics . 131 5.3.1 Standard Caches . 131 5.3.2 Trace-Assisted Caches . 135 5.3.3 Summary . 141 IV Analysis & Conclusion 143 6 Analysis 145 6.1 Causes of Lack of Improvement . 145 iii Contents 6.1.1 Lack of Capacity to Improve . 145 6.1.2 Gap Between Memory Instructions . 146 6.1.3 Types of Misses . 151 6.1.4 Overheads Incurred . 151 6.2 Benchmark by Benchmark Analysis . 155 6.2.1 janne_complex . 155 6.2.2 fac ................................ 157 6.2.3 fibcall ............................. 160 6.2.4 duff ............................... 162 6.2.5 insertsort . 165 6.2.6 fft1 ............................... 167 6.3 Resolving Problems of the Implementation . 168 6.3.1 Move Towards an OoO architecture . 168 6.3.2 Reducing Overheads . 172 6.4 Applicability of Results . 173 6.4.1 Dependence on Processor Configuration . 173 6.4.2 Dependence on ISA Choice . 174 6.4.3 Dependence on the Size of the Cache . 175 6.4.4 Dependence on Choice of Benchmark . 176 6.5 Programs That Benefit from Trace Assisted Caching . 177 6.6 Summary . 178 7 Conclusion & Further Work 179 7.1 Answering the Research Questions . 179 7.2 Contributions . 180 7.3 Future Work . 181 7.3.1 Applications to High Performance Computing . 181 7.3.2 Improving the Fidelity and Stored Size of Captured Traces 182 7.3.3 Quantifying the Link Between Slack and Effectiveness . 183 7.3.4 Expanding the TAC to Other Processors . 183 V Appendices 185 A Trace Recorder (Gouram) Implementation 187 A.1 RI5CY Memory Protocol . 187 A.2 The IF Module . 190 A.2.1 Instruction Fetch State Machine . 190 A.2.2 Branch Decision State Machine . 192 A.2.3 The Output State Machine . 194 A.3 Examples . 195 A.3.1 Simple Load Example . 195 A.3.2 Complex Branching Example . 199 A.4 The EX Module . 204 A.4.1 The Main State Machine . 205 A.4.2 Example . 207 B Trace Recorder (Gouram) Implementation 211 iv Contents C Calculation of Trace-Assisted Cache Overheads 213 C.1 Cache Hit (No Preemptive Action) . 215 C.2 Cache Hit (Following a Preemptive Hit) . 215 C.3 Cache Hit (Following a Preemptive Miss) . 216 C.4 Cache Hit (Following a Preemptive Miss & Writeback) . 217 C.5 Cache Miss (No Preemptive Action) . 217 C.6 Cache Miss (With Writeback) . 217 C.7 Summary & Final Table . 218 Acronyms 219 Bibliography 223 v List of Tables 2.1 Potential Runtime Reductions Across Cache Policies . 34 4.1 List of Names . 84 4.2 Hardware Utilisation - Direct-Mapped Standard Cache . 114 4.3 Hardware Utilisation - Set-Associative Standard Cache . 115 4.4 Hardware Utilisation - Direct-Mapped Trace-Assisted Cache (TAC) 116 4.5 Hardware Utilisation - Set-Associative TAC . 117 4.6 Hardware Utilisation - VexRiscv . 119 5.1 Improvement/Degradation of Runtime Between Hardware Variants 132 5.2 Cache Behaviour - Standard Direct-Mapped . 133 5.3 Cache Behaviour - Standard Set-Associative . 134 5.4 Cache Behaviour - Direct-Mapped TAC - Hits . 137 5.5 Cache Behaviour - Direct-Mapped TAC - Misses and Writebacks . 138 5.6 Cache Behaviour - Set-Associative TAC - Hits . 139 5.7 Cache Behaviour - Set-Associative TAC - Misses and Writebacks . 140 6.1 Memory Operations per Cache Operation Table . 151 6.2 Length of Time Taken For Operations in a Standard Cache . 153 6.3 Length of Time Taken For Operations in a TAC . 155 B.1 Runtime Data - All Hardware Variants . 212 C.1 Length of Time Taken For Operations in a TAC - Reproduction . 218 List of Figures 1.1 Processor Memory Performance Gap Graph . 4 1.2 Memory Hierarchy . 5 2.1 Memory Hierarchy - Simple Embedded System . 13 2.2 Memory Hierarchy - Complex Desktop . 14 2.3 Literature Review Structure . 16 2.4 Flooding Example . 22 2.5 Graph of Miss Ratios Across Cache Replacement Policies . 33 vi List of Figures 2.6 Extended Miss Rate Comparison for Cache Intrinsic Techniques . 45 2.7 The Limitations of Static Analysis . 55 3.1 Pipeline Diagram - No Extra Information . 62 3.2 Pipeline Diagram - With Complete Information . 62 3.3 Basic Processor Architecture . 67 3.4 Basic Architecture with Trace Recorder . 68 3.5 Program Fragment with Trace . 69 3.6 Illustration of Preemptive Scenarios . 72 3.7 Illustration of a Block on Preemptive Actions . 73 3.8 Illustration of a Sub-optimal Preemptive Scenario . 76 3.9 Illustration of Worst-Case Preemptive Scenario . 76 4.1 The Basic Implemented Architecture . ..