Exploring, Defining, and Exploiting Recent Store

EXPLORING, DEFINING, AND EXPLOITING RECENT STORE VALUE LOCALITY by Kevin M. Lepak A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy (Electrical Engineering) at the UNIVERSITY OF WISCONSIN-MADISON 2003 Abstract i This thesis is motivated by the growing differential between main memory and microprocessor core performance. Increased integration, enabled by Moore’s law, has pro- vided a substantial compound improvement in core performance. Integration has benefit- ted main memory latency less significantly, leading to an expanding memory-gap. Furthermore, in multiprocessors, increasing integration has allowed enlarging on-chip cache structures to continue reducing capacity and conflict misses; however, communication misses still remain, limiting performance of multithreaded workloads. Locality in both temporal and spatial dimensions has been exploited historically by computer archi- tects to improve memory system performance. Recently, a new locality dimension has emerged unveiling additional potential for performance improvement. Value locality describes a program behavior phenomenon in which values recur in programs. Many researchers have examined value locality as a means to improve memory system performance. However, most research has focused on predicting load values, as it is believed that loads are latency critical. In contrast, conven- tional wisdom says stores are not latency critical and need only be buffered and forwarded for acceptable performance. In this thesis, we show that stores should be examined as a means of improving memory performance for both uniprocessors and multiprocessors and that stores exhibit significant value locality. For example, approximately 40% of stores are update silent; they write the same value which already exists at the memory location, thus contributing no change in system state. We show numerous methods of exploiting store value locality to increase performance. In uniprocessors, we detail improvements in core efficiency; in multiprocessors, significant reductions in communication between processors. We focus predominantly on multiprocessors, making a fundamental contribution in redefining multiprocessor sharing to consider two dimensions of store value locality. Furthermore, we describe both speculative and non-speculative methods which achieve substantial performance benefit by exploiting store value locality in both scientific and commercial workloads. Many of our proposals can be integrated into existing microprocessor designs with coherence protocol changes, while others rely on existing coherence mechanisms to reap tangible benefit. We perform a detailed performance evaluation, using ii full-system, execution-driven, simulation to show the merits of different designs. Acknowledgements iii I would like to thank my parents, Ken Lepak and Carol Lepak for their constant support throughout my educational career. I owe my strong work-ethic and drive to their influence and I am truly grateful for that. I thank my mom especially for her willingness to encourage my interest in technology and engineering at a very young age. I also thank my siblings, Kurt, Ann, and Julie, for their support and honor the memory of Kristopher. My advisor, Prof. Mikko Lipasti has influenced me in numerous ways. He taught me about academic engineering research and how to ask the right questions. I also thank him for his academic and financial support throughout my academic career. In addition, I am a truly a better person having worked with Mikko. I learned through example how to be true to yourself, those important to you, and your beliefs, even under exceptional cir- cumstances. I additionally thank the members of my thesis committee, Prof. Jim Smith, Prof. Mark Hill, Prof. David Wood, and Prof. Guri Sohi for their feedback on my thesis research, this thesis document, and stimulating discussions throughout my graduate tenure. Additionally, I thank Prof. Jim Goodman and Prof. Ras Bodik who assisted with my preliminary proposal and illuminated avenues to be explored in this thesis. On a collaborative level, I thank Jason Cantin for stimulating coherence validation discussion and PHARMsim visualization tools, Ravi Rajwar for detailed explanations regarding SLE, and all members of the PHARM research group for their technical assistance and contribution to shared research infrastructure during my tenure. I especially thank Trey Cain for his substantial PHARMsim development effort, and Ilhyun Kim for many stimulating discussions. I also thank Ilhyun and Gordie Bell for cooperating with me on some research related to this thesis. I thank Jay Pickett, Mitch Alsup, and Ben Sander from Advanced Micro Devices who taught me much about real-life microprocessor design iv and architecture design/evaluation during my long internship between finishing my M.S. and returning to Wisconsin to finish my Ph.D. On a personal level, I am lucky to have worked with Trey, Gordie, and Ilhyun. Being Mikko’s first students, we have been through a lot together and at times it feels like we have accomplished herculean tasks. I thank Brian Mestan, Tim Snyder, Chris Belmas, and Greg Zagorski for always being true friends to whom I could relate my feelings. I thank Milo Martin for encouraging me to attend graduate school in the first place and for our interactions throughout my graduate career. I especially thank Katie Hubbard who has seen my metamorphosis throughout graduate school, was there through some of the most turbulent times, and reminded me who I really was during some of the toughest moments. I thank Greg, Tim, Razvan Cheveresan, and Matt Ramsay for providing welcome diversions from the rigors of graduate school. I thank all the members of MAUL ultimate frisbee for accepting me onto the field of honor twice a week for the last five years, providing much needed exercise and stress relief, although I never could throw . at all. Finally, I thank the Department of Electrical Engineering staff for their assistance with administrative tasks throughout my graduate career. I especially thank Bruce Orchard, the network administrator, who was always willing to let us do whatever we needed to get the job done—and help us put the pieces back together when things went horribly awry. Table of Contents v Chapter 1: Introduction . 1 1.1 Performance Impact of Communication Misses in Multiprocessor Systems . 3 1.2 Exploiting Value Locality to Improve Memory System Performance . 5 1.3 Update Silent Stores. 7 1.4 Temporally Silent Stores . 8 1.5 Thesis Overview and Summary of Contributions . 9 Chapter 2: Exploiting Update Silence in Uniprocessors . 13 2.1 Motivation and Background. 13 2.2 A Simple Microarchitectural Method to Exploit Update Silence . 14 2.2.1 Naive Update Silent Store Suppression . 15 2.3 Advanced Methods For Detecting Update Silence . 15 2.3.1 Read Port Stealing. 16 2.3.2 Load/Store Queue . 17 2.3.2.1 Temporal Locality in the LSQ . 17 2.3.2.2 Spatial Locality in the LSQ . 18 2.3.3 ECC Update Silent Store Suppression . 20 2.3.3.1 ECC in Modern Microarchitectures. 20 2.3.3.2 L1 Data Cache with ECC. 23 2.3.3.3 Write-Through L1 Cache with ECC L2. 25 2.3.3.4 Duplication of L1 Data Cache . 27 2.3.4 Simulation Parameters and Machine Model . 28 2.4 Exploiting Update Silence for Performance Benefit . 30 2.4.1 Writeback Hierarchies. 31 2.4.1.1 Exploiting Update Silent Stores to Reduce Writebacks . 31 2.4.1.2 Suppressing Critical Update Silent Stores for Maximal Writeback Reduction. 33 2.4.1.3 Performance of Simple Suppression Techniques . 35 2.4.2 Write-Through Hierarchies . 40 2.4.2.1 Read Port Stealing Performance Benefits . 41 2.4.2.2 Load Store Queue Suppression Performance Benefit . 43 vi 2.4.2.3 Increasing Write Through Bandwidth with Update Silent Store Suppression . 46 2.4.2.4 Discussion of Aggressive Update Silent Store Suppression Mechanisms . 50 2.5 Exploiting Temporal Silence To Improve Core Performance. 52 2.6 Related Work. 53 Chapter 3: Multiprocessor Sharing Considering Store Value Locality . 57 3.1 Motivation and Background . 57 3.2 Update Silent Sharing . 60 3.3 Temporal Silent Sharing . 61 3.4 Understanding the Difference Between PTS, USS, and TSS . 66 3.5 Related Work. 69 3.6 Simulation Environment for Exploration Studies . 71 Chapter 4: Update Silence in Multiprocessors . 75 4.1 Motivation and Background . 75 4.2 Potential for Data Transfer Elimination . 76 4.3 Address Traffic Considerations . 80 4.4 Critical Update Silent Stores . 84 4.5 Memory Consistency and Correctness Implications . 93 4.5.1 Memory Consistency Considerations. 93 4.5.2 Existing ISA Correctness Considerations . 98 4.5.2.1 PowerPC . 99 4.5.2.2 Other ISAs. 104 4.6 Related Work. 105 Chapter 5: Temporal Silence in Multiprocessors . 107 5.1 Motivation and Background . 107 5.2 Potential for Data Transfer Elimination . 108 5.3 System-Level Considerations for Effectively Exploiting Temporal Silence . 111 5.4 Communicating Temporal Silence . 115 5.4.1 WC TSS . 116 5.4.2 The MESTI Coherence Protocol . 117 5.4.2.1 Temporal Silence Captured. 119 5.4.2.2 Writeback Elimination . ..

Exploring, Defining, and Exploiting Recent Store

18-741 Advanced Computer Architecture Lecture 1: Intro And

Hard Real-Time Performances in Multiprocessor-Embedded Systems Using ASMP-Linux

Computer Architectures an Overview

Understanding IBM Pseries Performance and Sizing

IBM POWER Microprocessors

A Survey of Processors with Explicit Multithreading

3 Processor Cores

Motivation for Multithreaded Architectures

Computer Architecture: Multithreading (II)

A Case Study of 3 Internet Benchmarks on 3 Superscalar Machines

Architectures of Processor Chips Basic Architectural Features

Understanding IBM Pseries Performance and Sizing