Architectural Support for Efficient On-Chip Parallel Execution
Total Page:16
File Type:pdf, Size:1020Kb
UC San Diego UC San Diego Electronic Theses and Dissertations Title Architectural support for efficient on-chip parallel execution Permalink https://escholarship.org/uc/item/0r1593xh Author Brown, Jeffery Alan Publication Date 2010 Peer reviewed|Thesis/dissertation eScholarship.org Powered by the California Digital Library University of California UNIVERSITY OF CALIFORNIA, SAN DIEGO Architectural Support for Efficient On-chip Parallel Execution A dissertation submitted in partial satisfaction of the requirements for the degree Doctor of Philosophy in Computer Science by Jeffery Alan Brown Committee in charge: Professor Dean Tullsen, Chair Professor Brad Calder Professor Sadik Esener Professor Tim Sherwood Professor Steven Swanson 2010 Copyright Jeffery Alan Brown, 2010 All rights reserved. The dissertation of Jeffery Alan Brown is approved, and it is acceptable in quality and form for publication on microfilm and electronically: Chair University of California, San Diego 2010 iii DEDICATION To Mom. iv EPIGRAPH Do what you think is interesting, do something that you think is fun and worthwhile, because otherwise you won’t do it well anyway. —Brian Kernighan Numerical examples, are good for your soul. —T. C. Hu v TABLE OF CONTENTS Signature Page .................................. iii Dedication ..................................... iv Epigraph ..................................... v Table of Contents ................................. vi List of Figures .................................. ix List of Tables ................................... x Acknowledgements ................................ xi Vita and Publications .............................. xiv Abstract of the Dissertation ........................... xvi Chapter 1 Introduction ............................ 1 1.1 Complications from Parallelism .............. 2 1.2 Memory Latency & Instruction Scheduling ........ 3 1.3 Cache Coherence on a CMP Landscape .......... 4 1.4 Thread Migration ...................... 5 1.4.1 Explicit Thread State: Registers .......... 6 1.4.2 Implicit State: Working Set Migration ...... 7 Chapter 2 Experimental Methodology & Metrics .............. 8 2.1 Execution-driven Simulation ................ 8 2.1.1 SMTSIM ....................... 9 2.1.2 RSIM ........................ 10 2.2 Metrics ............................ 11 2.2.1 Weighted Speedup ................. 11 2.2.2 Normalized Weighted Speedup ........... 13 2.2.3 Interval Weighted Speedup ............. 14 2.2.4 Interval IPC & Post-migrate Speedup ....... 15 Chapter 3 Handling Long-Latency Loads on Simultaneous Multithread- ing Processors ........................... 17 3.1 Introduction ......................... 17 3.2 The Impact of Long-latency Loads ............. 18 3.3 Related Work ........................ 20 3.4 Methodology ......................... 22 vi 3.5 Metrics ............................ 26 3.6 Detecting and Handling Long-latency Loads ....... 26 3.7 Alternate Flush Mechanisms ................ 31 3.8 Response Time Experiments ................ 33 3.9 Generality of the Load Problem .............. 36 3.10 Summary .......................... 39 Chapter 4 Coherence Protocol Design for Chip Multiprocessors ...... 42 4.1 Introduction ......................... 42 4.2 Related Work ........................ 44 4.3 A CMP Architecture with Directory-based Coherence .. 46 4.3.1 Architecture ..................... 46 4.3.2 Baseline Coherence Protocol ............ 49 4.4 Accelerating Coherence via Proximity Awareness ..... 51 4.5 Methodology ......................... 56 4.6 Analysis and Results .................... 56 4.7 Summary .......................... 63 Chapter 5 The Shared-Thread Multiprocessor ............... 65 5.1 Introduction ......................... 65 5.2 Related Work ........................ 67 5.3 The Baseline Multi-threaded Multi-core Architecture .. 68 5.3.1 Chip multiprocessor ................. 69 5.3.2 Simultaneous-Multithreaded cores ......... 69 5.3.3 Long-latency memory operations ......... 70 5.4 Shared-Thread Storage: Mechanisms & Policies ..... 71 5.4.1 Inactive-thread store ................ 71 5.4.2 Shared-thread control unit ............. 72 5.4.3 Thread-transfer support .............. 72 5.4.4 Scaling the Shared-Thread Multiprocessor .... 74 5.4.5 Thread control policies — Hiding long latencies . 75 5.4.6 Thread control policies — Rapid rebalancing ... 79 5.5 Methodology ......................... 82 5.5.1 Simulator configuration ............... 82 5.5.2 Workloads ...................... 84 5.5.3 Metrics ........................ 85 5.6 Results and Analysis .................... 86 5.6.1 Potential gains from memory stalls ........ 86 5.6.2 Rapid migration to cover memory latency .... 87 5.6.3 Rapid migration for improved scheduling ..... 89 5.7 Summary .......................... 91 vii Chapter 6 Fast Thread Migration via Working Set Prediction ....... 93 6.1 Introduction ......................... 93 6.2 Related Work ........................ 96 6.3 Baseline Multi-core Architecture .............. 97 6.4 Motivation: Performance Cost of Migration ........ 98 6.5 Architectural Support for Working Set Migration .... 102 6.5.1 Memory logger ................... 103 6.5.2 Summary generator ................. 107 6.5.3 Summary-driven prefetcher ............. 108 6.6 Methodology ......................... 110 6.6.1 Simulator configuration ............... 110 6.6.2 Workloads ...................... 110 6.6.3 Metrics ........................ 112 6.7 Analysis and Results .................... 113 6.7.1 Bulk cache transfer ................. 113 6.7.2 Limits of prefetching ................ 115 6.7.3 I-stream prefetching ................. 116 6.7.4 D-stream prefetching ................ 117 6.7.5 Combined prefetchers ................ 118 6.7.6 Allowing previous-instance cache re-use ...... 119 6.7.7 Impact on other threads .............. 120 6.7.8 Adding a shared last-level cache .......... 121 6.7.9 Simple hardware prefetchers ............ 121 6.8 Summary .......................... 122 Chapter 7 Conclusion ............................. 124 7.1 Memory Latency in Multithreaded Processors ...... 125 7.2 Cache Coherence for CMPs ................ 126 7.3 Multithreading Among Cores ............... 127 7.3.1 Registers: Thread Migration & Scheduling .... 128 7.3.2 Memory: Working Set Prediction & Migration .. 129 7.4 Final Remarks ........................ 130 Bibliography ................................... 131 viii LIST OF FIGURES Figure 1.1: Example chip-level parallel architecture ............. 2 Figure 2.1: Interval weighted speedup compression function ......... 15 Figure 3.1: Performance impact of long-latency loads ............ 19 Figure 3.2: Throughput benefit of a simple flushing mechanism ....... 28 Figure 3.3: Comparison of long-load detection mechanisms ......... 29 Figure 3.4: Performance of flush-point selection techniques ......... 30 Figure 3.5: Performance of alternative flush mechanisms ........... 33 Figure 3.6: Mean response times in an open system ............. 35 Figure 3.7: Impact on alternate SMT fetch policies .............. 37 Figure 3.8: Performance with different instruction queue sizes ........ 38 Figure 4.1: Baseline chip multiprocessor .................... 47 Figure 4.2: A traditional multiprocessor .................... 48 Figure 4.3: Proximity-aware coherence .................... 52 Figure 4.4: Potential benefit from proximity-aware coherence ........ 58 Figure 4.5: Reduction in L2 miss-service latency ............... 60 Figure 4.6: Reply-network utilization ..................... 61 Figure 4.7: Speedup from proximity-aware coherence ............. 62 Figure 5.1: The Shared-Thread Multiprocessor ................ 71 Figure 5.2: Idle time from memory stalls in fully-occupied cores ...... 87 Figure 5.3: Performance of stall-covering schemes .............. 88 Figure 5.4: Performance of dynamic schedulers ................ 89 Figure 6.1: Baseline multi-core processor ................... 99 Figure 6.2: Performance cost of migration ................... 100 Figure 6.3: Memory logger overview ...................... 104 Figure 6.4: Summary-driven prefetcher .................... 109 Figure 6.5: Impact of bulk cache transfers ................... 114 Figure 6.6: The limits of a future-oracle prefetcher .............. 115 Figure 6.7: Comparison of instruction stream prefetchers .......... 116 Figure 6.8: Comparison of data stream prefetchers .............. 117 Figure 6.9: Speedup from combined instruction and data prefetchers .... 118 Figure 6.10: Speedup when allowing cache reuse ................ 120 ix LIST OF TABLES Table 3.1: Single-threaded benchmarks .................... 23 Table 3.2: Multi-threaded workloads ...................... 24 Table 3.3: Processor configuration ....................... 25 Table 4.1: Architecture details ......................... 57 Table 4.2: Workloads .............................. 57 Table 5.1: Architecture details ......................... 83 Table 5.2: Component benchmarks ...................... 84 Table 5.3: Composite workloads ........................ 85 Table 6.1: Activity record fields ........................ 103 Table 6.2: Baseline processor parameters ................... 111 Table 6.3: Prefetcher activity and accuracy .................. 119 x ACKNOWLEDGEMENTS I thank my advisor, Dean Tullsen, for making this dissertation possible; for setting an example of professional and personal integrity for us all to aspire toward; and especially for standing by me during the tough years – when experiments weren’t working – displaying unwavering confidence at times when I was