Using Tracing to Enhance Data Cache Performance in Cpus

Total Page:16

File Type:pdf, Size:1020Kb

Using Tracing to Enhance Data Cache Performance in Cpus Using Tracing to Enhance Data Cache Performance in CPUs The creation of a Trace-Assisted Cache to increase cache hits and decrease runtime Jonathan Paul Rainer A thesis presented for the degree of Doctor of Philosophy (PhD) Computer Science University of York September 2020 Abstract The processor-memory gap is widening every year with no prospect of reprieve. More and more latency is being added to program runtimes as memory cannot satisfy the demands of CPUs quickly enough. In the past, this has been alleviated through caches of increasing complexity or techniques like prefetching, to give the illusion of faster memory. However, these techniques have drawbacks because they are reactive or rely on incomplete information. In general, this leads to large amounts of latency in programs due to processor stalls. It is our contention that through tracing a program’s data accesses and feeding this information back to the cache, overall program runtime can be reduced. This is achieved through a new piece of hardware called a Trace-Assisted Cache (TAC). This uses traces to gain foreknowledge of the memory requests the processor is likely to make, allowing them to be actioned before the processor requests the data, overlapping memory and computation instructions. Comparing the TAC against a standard CPU without a cache, we see improvements in runtimes of up to 65%. However, we see degraded performance of around 8% on average when compared to Set-Associative and Direct-Mapped caches. This is because improvements are swamped by high overheads and synchronisation times between components. We also see that benchmarks that exhibit several qualities: a balance of computation and memory instructions and keeping data well spread out in memory fare better using TAC than other benchmarks on the same hardware. Overall this demonstrates that whilst there is potential to reduce runtime via increasing the agency of the cache through Trace Assistance, it requires a highly efficient implementation to be competitive otherwise any potential gains are negated by the increase in overheads. i Contents Abstract i Contents ii List of Tables vi List of Figures vi List of Listings viii Research Data ix Acknowledgements xi Declaration xii I Introduction1 1 Introduction 3 1.1 Motivation . 3 1.1.1 Caches & Memory Hierarchies . 5 1.1.2 Predicting Dynamic Behaviour . 6 1.1.3 Tracing & Trace Assisted Caching . 7 1.2 Thesis Aims . 8 1.3 Thesis Structure . 9 II Background 11 2 Literature Review 13 2.1 Introduction . 13 2.2 Cache Intrinsic Techniques . 15 2.2.1 Cache Replacement Policy . 17 2.2.2 Augmenting Cache Architectures . 36 2.2.3 Summary . 44 2.3 Cache Extrinsic Techniques . 46 2.3.1 Prefetching . 46 2.3.2 Scheduling . 51 2.3.3 Program Transformation & Data Layout . 53 ii Contents 2.3.4 Summary . 54 2.4 Incorporating Tracing to Reduce Latency . 54 2.4.1 Tracing as a Control Loop . 55 2.4.2 Tracing for In-Silicon Debugging . 56 2.5 Review Summary . 57 2.5.1 Potential for the Application of Tracing . 58 3 Trace Assisted Caching 59 3.1 Motivation . 59 3.1.1 Defining the Key Problems . 60 3.1.2 Exploring the Design Space . 63 3.1.3 Tracing . 66 3.2 High Level Design . 66 3.2.1 Trace Recorder . 67 3.2.2 Intelligent Cache & Memory System . 70 3.3 Justification of Success . 74 3.3.1 Protections against Performance Degradation . 74 III Experiments 79 4 Implementing the Platform 81 4.1 Pre-existing Components . 81 4.2 Trace Recorder (Gouram) . 83 4.2.1 A Note on Names . 83 4.3 Trace Assisted Cache (Enokida) . 84 4.3.1 Trace Repository . 84 4.3.2 Trace Assisted Cache (Enokida) . 93 4.4 Experimental Hardware . 107 4.4.1 Expectations of the Platform . 107 4.4.2 Implementing the Platform . 110 4.5 Summary . 118 5 Experiments & Results 121 5.1 Experimental Setup . 121 5.1.1 Use of the Experimental Approach . 122 5.1.2 Process for Each Experiment . 124 5.1.3 Specific Experimental Concerns . 125 5.2 Results . 127 5.3 Exploring Cache Metrics . 131 5.3.1 Standard Caches . 131 5.3.2 Trace-Assisted Caches . 135 5.3.3 Summary . 141 IV Analysis & Conclusion 143 6 Analysis 145 6.1 Causes of Lack of Improvement . 145 iii Contents 6.1.1 Lack of Capacity to Improve . 145 6.1.2 Gap Between Memory Instructions . 146 6.1.3 Types of Misses . 151 6.1.4 Overheads Incurred . 151 6.2 Benchmark by Benchmark Analysis . 155 6.2.1 janne_complex . 155 6.2.2 fac ................................ 157 6.2.3 fibcall ............................. 160 6.2.4 duff ............................... 162 6.2.5 insertsort . 165 6.2.6 fft1 ............................... 167 6.3 Resolving Problems of the Implementation . 168 6.3.1 Move Towards an OoO architecture . 168 6.3.2 Reducing Overheads . 172 6.4 Applicability of Results . 173 6.4.1 Dependence on Processor Configuration . 173 6.4.2 Dependence on ISA Choice . 174 6.4.3 Dependence on the Size of the Cache . 175 6.4.4 Dependence on Choice of Benchmark . 176 6.5 Programs That Benefit from Trace Assisted Caching . 177 6.6 Summary . 178 7 Conclusion & Further Work 179 7.1 Answering the Research Questions . 179 7.2 Contributions . 180 7.3 Future Work . 181 7.3.1 Applications to High Performance Computing . 181 7.3.2 Improving the Fidelity and Stored Size of Captured Traces 182 7.3.3 Quantifying the Link Between Slack and Effectiveness . 183 7.3.4 Expanding the TAC to Other Processors . 183 V Appendices 185 A Trace Recorder (Gouram) Implementation 187 A.1 RI5CY Memory Protocol . 187 A.2 The IF Module . 190 A.2.1 Instruction Fetch State Machine . 190 A.2.2 Branch Decision State Machine . 192 A.2.3 The Output State Machine . 194 A.3 Examples . 195 A.3.1 Simple Load Example . 195 A.3.2 Complex Branching Example . 199 A.4 The EX Module . 204 A.4.1 The Main State Machine . 205 A.4.2 Example . 207 B Trace Recorder (Gouram) Implementation 211 iv Contents C Calculation of Trace-Assisted Cache Overheads 213 C.1 Cache Hit (No Preemptive Action) . 215 C.2 Cache Hit (Following a Preemptive Hit) . 215 C.3 Cache Hit (Following a Preemptive Miss) . 216 C.4 Cache Hit (Following a Preemptive Miss & Writeback) . 217 C.5 Cache Miss (No Preemptive Action) . 217 C.6 Cache Miss (With Writeback) . 217 C.7 Summary & Final Table . 218 Acronyms 219 Bibliography 223 v List of Tables 2.1 Potential Runtime Reductions Across Cache Policies . 34 4.1 List of Names . 84 4.2 Hardware Utilisation - Direct-Mapped Standard Cache . 114 4.3 Hardware Utilisation - Set-Associative Standard Cache . 115 4.4 Hardware Utilisation - Direct-Mapped Trace-Assisted Cache (TAC) 116 4.5 Hardware Utilisation - Set-Associative TAC . 117 4.6 Hardware Utilisation - VexRiscv . 119 5.1 Improvement/Degradation of Runtime Between Hardware Variants 132 5.2 Cache Behaviour - Standard Direct-Mapped . 133 5.3 Cache Behaviour - Standard Set-Associative . 134 5.4 Cache Behaviour - Direct-Mapped TAC - Hits . 137 5.5 Cache Behaviour - Direct-Mapped TAC - Misses and Writebacks . 138 5.6 Cache Behaviour - Set-Associative TAC - Hits . 139 5.7 Cache Behaviour - Set-Associative TAC - Misses and Writebacks . 140 6.1 Memory Operations per Cache Operation Table . 151 6.2 Length of Time Taken For Operations in a Standard Cache . 153 6.3 Length of Time Taken For Operations in a TAC . 155 B.1 Runtime Data - All Hardware Variants . 212 C.1 Length of Time Taken For Operations in a TAC - Reproduction . 218 List of Figures 1.1 Processor Memory Performance Gap Graph . 4 1.2 Memory Hierarchy . 5 2.1 Memory Hierarchy - Simple Embedded System . 13 2.2 Memory Hierarchy - Complex Desktop . 14 2.3 Literature Review Structure . 16 2.4 Flooding Example . 22 2.5 Graph of Miss Ratios Across Cache Replacement Policies . 33 vi List of Figures 2.6 Extended Miss Rate Comparison for Cache Intrinsic Techniques . 45 2.7 The Limitations of Static Analysis . 55 3.1 Pipeline Diagram - No Extra Information . 62 3.2 Pipeline Diagram - With Complete Information . 62 3.3 Basic Processor Architecture . 67 3.4 Basic Architecture with Trace Recorder . 68 3.5 Program Fragment with Trace . 69 3.6 Illustration of Preemptive Scenarios . 72 3.7 Illustration of a Block on Preemptive Actions . 73 3.8 Illustration of a Sub-optimal Preemptive Scenario . 76 3.9 Illustration of Worst-Case Preemptive Scenario . 76 4.1 The Basic Implemented Architecture . ..
Recommended publications
  • Intel® Architecture Instruction Set Extensions and Future Features
    Intel® Architecture Instruction Set Extensions and Future Features Programming Reference May 2021 319433-044 Intel technologies may require enabled hardware, software or service activation. No product or component can be absolutely secure. Your costs and results may vary. You may not use or facilitate the use of this document in connection with any infringement or other legal analysis concerning Intel products described herein. You agree to grant Intel a non-exclusive, royalty-free license to any patent claim thereafter drafted which includes subject matter disclosed herein. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document. All product plans and roadmaps are subject to change without notice. The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade. Code names are used by Intel to identify products, technologies, or services that are in development and not publicly available. These are not “commercial” names and not intended to function as trademarks. Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be ob- tained by calling 1-800-548-4725, or by visiting http://www.intel.com/design/literature.htm. Copyright © 2021, Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries.
    [Show full text]
  • An Evaluation of Multiple Branch Predictor and Trace Cache Advanced Fetch Unit Designs for Dynamically Scheduled Superscalar Processors Slade S
    Louisiana State University LSU Digital Commons LSU Master's Theses Graduate School 2004 An evaluation of multiple branch predictor and trace cache advanced fetch unit designs for dynamically scheduled superscalar processors Slade S. Maurer Louisiana State University and Agricultural and Mechanical College, [email protected] Follow this and additional works at: https://digitalcommons.lsu.edu/gradschool_theses Part of the Electrical and Computer Engineering Commons Recommended Citation Maurer, Slade S., "An evaluation of multiple branch predictor and trace cache advanced fetch unit designs for dynamically scheduled superscalar processors" (2004). LSU Master's Theses. 335. https://digitalcommons.lsu.edu/gradschool_theses/335 This Thesis is brought to you for free and open access by the Graduate School at LSU Digital Commons. It has been accepted for inclusion in LSU Master's Theses by an authorized graduate school editor of LSU Digital Commons. For more information, please contact [email protected]. AN EVALUATION OF MULTIPLE BRANCH PREDICTOR AND TRACE CACHE ADVANCED FETCH UNIT DESIGNS FOR DYNAMICALLY SCHEDULED SUPERSCALAR PROCESSORS A Thesis Submitted to the Graduate Faculty of the Louisiana State University and Agricultural and Mechanical College in partial fulfillment of the requirements for the degree of Master of Science in Electrical Engineering in The Department of Electrical and Computer Engineering by Slade S. Maurer B.S., Southeastern Louisiana University, Hammond, Louisiana, 1998 December 2004 ACKNOWLEDGEMENTS This thesis would not be possible without several contributors. I want to thank my advisor Professor David Koppelman for his guidance and support. He brought to my attention this fascinating thesis topic and trained me to modify RSIML to support my research.
    [Show full text]
  • Improving the Utilization of Micro-Operation Caches in X86 Processors Jagadish B
    Improving the Utilization of Micro-operation Caches in x86 Processors Jagadish B. Kotra, John Kalamatianos AMD Research, USA. Email: [email protected], [email protected] Abstract—Most modern processors employ variable length, hidden from the Instruction Set Architecture (ISA) to ensure Complex Instruction Set Computing (CISC) instructions to backward compatibility. Such an ISA level abstraction enables reduce instruction fetch energy cost and bandwidth require- processor vendors to implement an x86 instruction differently ments. High throughput decoding of CISC instructions requires energy hungry logic for instruction identification. Efficient CISC based on their custom micro-architectures. However, this ad- instruction execution motivated mapping them to fixed length ditional step of translating each variable length instruction in micro-operations (also known as uops). To reduce costly decoder to fixed length uops incurs high decode latency and consumes activity, commercial CISC processors employ a micro-operations more power thereby affecting the instruction dispatch band- cache (uop cache) that caches uop sequences, bypassing the width to the back-end [5, 34, 22]. decoder. Uop cache’s benefits are: (1) shorter pipeline length for uops dispatched by the uop cache, (2) lower decoder energy To reduce decode latency and energy consumption of x86 consumption, and, (3) earlier detection of mispredicted branches. decoders, researchers have proposed caching the decoded uops In this paper, we observe that a uop cache can be heavily in a separate hardware structure called uop cache2 [51, 36]. An fragmented under certain uop cache entry construction rules. uop cache stores recently decoded uops anticipating temporal Based on this observation, we propose two complementary reuse [17, 51].
    [Show full text]
  • I See Dead Μops: Leaking Secrets Via Intel/AMD Micro-Op Caches
    I See Dead µops: Leaking Secrets via Intel/AMD Micro-Op Caches Xida Ren Logan Moody Mohammadkazem Taram Matthew Jordan University of Virginia University of Virginia University of California, San Diego University of Virginia [email protected] [email protected] [email protected] [email protected] Dean M. Tullsen Ashish Venkat University of California, San Diego University of Virginia [email protected] [email protected] translation; the resulting translation penalty is a function of Abstract—Modern Intel, AMD, and ARM processors translate complex instructions into simpler internal micro-ops that are the complexity of the x86 instruction, making the micro-op then cached in a dedicated on-chip structure called the micro- cache a ripe candidate for implementing a timing channel. op cache. This work presents an in-depth characterization study However, in contrast to the rest of the on-chip caches, the of the micro-op cache, reverse-engineering many undocumented micro-op cache has a very different organization and design; features, and further describes attacks that exploit the micro- op cache as a timing channel to transmit secret information. as a result, conventional cache side-channel techniques do not In particular, this paper describes three attacks – (1) a same directly apply – we cite several examples. First, software is thread cross-domain attack that leaks secrets across the user- restricted from directly indexing into the micro-op cache, let kernel boundary, (2) a cross-SMT thread attack that transmits alone access its
    [Show full text]
  • The Pentium 4 and the G4e: an Architectural Comparison
    http://arstechnica.com/cpu/01q2/p4andg4e/p4andg4e-1.html The Pentium 4 and the G4e: An Architectural Comparison Part I by Jon Stokes Copyright Ars Technica, LLC 1998-2001. The following disclaimer applies to the information, trademarks, and logos contained in this document. Neither the author nor Ars Technica make any representations with respect to the contents hereof. Materials available in this document are provided "as is" with no warranty, express or implied, and all such warranties are hereby disclaimed. Ars Technica assumes no liability for any loss, damage or expense from errors or omissions in the materials available in this document, whether arising in contract, tort or otherwise. The material provided here is designed for educational use only. The material in this document is copyrighted by Ars Technica, LLC., and may not be reprinted or electronically reproduced unless prior written consent is obtained from Ars Technica, LLC. Links can be made to any of these materials from a WWW page, however, please link to the original document. Copying and/or serving from your local site is only allowed with permission. As per copyright regulations, "fair use" of selected portions of the material for educational purposes is permitted by individuals and organizations provided that proper attribution accompanies such utilization. Commercial reproduction or multiple distribution by any traditional or electronic based reproduction/publication method is prohibited. Any mention of commercial products or services within this document does not constitute an endorsement. "Ars Technica" is trademark of Ars Technica, LLC. All other trademarks and logos are property of their respective owners. Copyright © 1998-2001 Ars Technica, LLC 1 http://arstechnica.com/cpu/01q2/p4andg4e/p4andg4e-1.html The Pentium 4 and the G4e: an Architectural Comparison by Jon "Hannibal" Stokes When the Pentium 4 hit the market in November of 2000, it was the first major new x86 microarchitecture from Intel since the Pentium Pro.
    [Show full text]
  • The Microarchitecture of Intel, AMD and VIA Cpus: an Optimization Guide for Assembly Programmers and Compiler Makers
    3. The microarchitecture of Intel, AMD, and VIA CPUs An optimization guide for assembly programmers and compiler makers By Agner Fog. Technical University of Denmark. Copyright © 1996 - 2021. Last updated 2021-08-17. Contents 1 Introduction ....................................................................................................................... 7 1.1 About this manual ....................................................................................................... 7 1.2 Microprocessor versions covered by this manual ........................................................ 8 2 Out-of-order execution (All processors except P1, PMMX) .............................................. 10 2.1 Instructions are split into µops ................................................................................... 10 2.2 Register renaming .................................................................................................... 11 3 Branch prediction (all processors) ................................................................................... 12 3.1 Prediction methods for conditional jumps .................................................................. 12 3.2 Branch prediction in P1 ............................................................................................. 18 3.3 Branch prediction in PMMX, PPro, P2, and P3 ......................................................... 21 3.4 Branch prediction in P4 and P4E .............................................................................. 23 3.5 Branch
    [Show full text]
  • Dynamic Optimization of Micro-Operations
    Dynamic Optimization of Micro-Operations Brian Slechta David Crowe Brian Fahs Michael Fertig Gregory Muthler Justin Quek Francesco Spadini Sanjay J. Patel Steven S. Lumetta Center for Reliable and High-Performance Computing Department of Electrical and Computer Engineering University of Illinois at Urbana-Champaign Abstract As complex instructions are individually decoded into constituent micro-operations, an opportunity is created to Inherent within complex instruction set architectures such globally optimize their constituent micro-operations. For ex- as x86 are inefficiencies that do not exist in a simpler ISAs. ample, a basic block’s micro-operations, optimized as a unit, Modern x86 implementations decode instructions into one or can be more efficient than micro-operations generated one more micro-operations in order to deal with the complexity instruction at a time. For simpler ISAs, such as PowerPC, of the ISA. Since these micro-operations are not visible to the SPARC, and MIPS, this type of optimization is performed compiler, the stream of micro-operations can contain redun- during compilation, as the emitted code is effectively at the dancies even in statically optimized x86 code. Within a pro- same level as the micro-operations of a complex ISA. This cessor implementation, however, barriers at the ISA level do optimization opportunity thus exists with complex ISAs, even not apply, and these redundancies can be removed by opti- when basic blocks are statically optimized. Leveraging this mizing the micro-operation stream. opportunity, of course, requires raising the architectural state In this paper, we explore the opportunities to optimize code boundaries to the basic block level.
    [Show full text]
  • Superscalar Processors
    Superscalar Processors Superscalar Processor Multiple Independent Instruction Pipelines; each with multiple stages Instruction-Level Parallelism determine dependencies between nearby instructions o input of one instruction depends upon the output of a preceding instruction locate nearby independent instructions o issue & complete instructions in an order different than specified in the code stream uses branch prediction methods rather than delayed branches RISC or CICS Super Pipelined Processor Pipeline stages can be segmented into n distinct non-overlapping parts each of which can execute in of a clock cycle Limitations of Instruction-Level Parallelism True Data Dependency (Flow Dependency, Write-After-Read [WAR] Dependency) o Second Instruction requires data produced by First Instruction Procedural Dependency o Branch Instruction – Instructions following either Branches-Taken Branches-NotTaken have a procedural-dependency on the branch o Variable-Length Instructions Partial Decoding is required prior to the fetching of the subsequent instruction, i.e., computing PC value Resource Conflicts o Memory, cache, buses, register ports, file ports, functional units access Output Dependency o See “In-Order Issue Out-of-Order Completion” below Anti-dependency o See “Out-of-Order Issue Out-of-Order Completion” below Instruction-Level Parallelism Instructions in sequence are independent instructions can be executed in parallel by overlapping Degree of Instruction-Level Parallelism depends upon o Frequencies of True Data Dependencies & Procedural
    [Show full text]
  • Software Trace Cache
    ACM International Conference on Supercomputing 25th Anniversary Volume Author Retrospective for Software Trace Cache Alex Ramirez1,2, Ayose J. Falcón3, Oliverio J. Santana4, Mateo Valero1,2 1 Universitat Politècnica de Catalunya, BarcelonaTech 2 Barcelona Supercomputing Center 3 Intel Labs, Barcelona 4 Universidad de Las Palmas de Gran Canaria {[email protected]} ABSTRACT The Software Trace Cache is a compiler transformation, or a post- In superscalar processors, capable of issuing and executing compilation binary optimization, that extends the seminar work of multiple instructions per cycle, fetch performance represents an Pettis and Hansen PLDI'90 to perform the reordering of the upper bound to the overall processor performance. Unless there is dynamic instruction stream into sequential memory locations some form of instruction re-use mechanism, you cannot execute using profile information from previous executions. The major instructions faster than you can fetch them. advantage compared to the trace cache, is that it is a software Instruction Level Parallelism, represented by wide issue out of optimization, and does not require additional hardware to capture order superscalar processors, was the trending topic during the the dynamic instruction stream, nor additional memories to store end of the 90's and early 2000's. It is indeed the most promising them. Instructions are still captured in the regular instruction way to continue improving processor performance in a way that cache. The major disadvantage is that it can only capture one of does not impact application development, unlike current multicore the dynamic instruction sequences to store it sequentially. Any architectures which require parallelizing the applications (a other control flow will result in taken branches, and interruptions process that is still far from being automated in the general case).
    [Show full text]