University Of
Total Page:16
File Type:pdf, Size:1020Kb
Software Coherence in Multiprocessor Memory Systems William Joseph Bolosky Technical Report 456 May 1993 [NASA-CR-1946961 SQFTWARE N94-21232 COHERENCE IN MULTIPROCESSOR HEMDRY SYSTEMS Ph-O, Thesis <Rockester btniv,) 3.57 p tint 1 as UNIVERSITY OF COMPUTER SCIENCE Software Coherence in Multiprocessor Memory Systems by William Joseph Bolosky Submitted in Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY Supervised by Professor Michael L. Scott Department of Computer Science College of Arts and Science University of Rochester Rochester, New York 1993 11 To R. R. Camp III Curriculum Vitae William J. Bolosky was born in on He attended California State College in California, Pennsylvania from 1977 through 1983. He completed a Bachelor's degree with Univerity Honors in Mathematics at Carnegie-Mellon University in 1986. After working as a research staff member with Carnegie-Mellon's Mach project, he began graduate studies at the University of Rochester in the fall of 1987, studying Computer Science under Professor Michael 1. Scott. In 1989, he received a Masters of Science in Computer Science from the University of Rochester. In 1992, he accepted a position as a Researcher with the Microsoft corporation in Redmond, WA. He received a Sproull Fellowship for graduate studies at the University of Rochester in 1987, and a DARPA/NASA Fellowship in Parallel Processing in 1991. lV - Acknow ledgments While my name is the only one listed as the author of this document, it is - hardly the case that all the work described herein is mine alone. Rather, the bulk of this dissertation is derived from work published jointly with others. Chief among them is my advisor, Michael Scott; without his help none of this could - have happened. During the ACE project, Bob Fitzgerald was a constant source of inspiration and thought-provoking criticism. Rob Fowler and Alan Cox were very helpful in refining the work. Tom LeBlanc has been helpful throughout my tenure at Rochester as a sounding board for my ideas, and Charles Merriam was there when he was needed. Alexander Brinkman provided an extremely helpful, - very thorough reading of the dissertation prior to my defense. Cesary Dubnicki and Jack Veenstra provided insight into my traces and the applications from which they were generated, as well as the general topic of shared memory multiproces sor memory systems. Prior to my arrival at Rochester (and subsequent to my departure), Rick Rashid served both as a mentor and inspiration to me. The work was supported by a University of Rochester Sproull Fellowship, two IBM Summer Internships and a DARPA/NASA Fellowship for Parallel Process ing. Furthermore, the ACE hardware used for much of this work was generously lent to the University of Rochester by IBM, through the good graces of Armando Garcia, Bob Fitzgerald and Colin Harrison. v Abstract Processors are becoming faster and multiprocessor memory interconnection systems are not keeping up. Therefore, it is necessary to have threads and the memory they access as near one another as possible. Typically, this involves putting memory or caches with the processors, which gives rise to the problem of coherence: if one processor writes an address, any other processor reading that ad dress must see the new value. This coherence can be maintained by the hardware or with software intervention. Systems of both types have been built in the past; the hardware-based systems tended to outperform the software ones. However, the ratio of processor to interconnect speed is now so high that the extra overhead of the software systems may no longer be significant. This dissertation explores this issue both by implementing a software maintained system and by introducing and using the technique of offline optimal analysis of memory reference traces. It finds that in properly built systems, software maintained coherence can perform comparably to or even better than hardware maintained coherence. The archi tectural features necessary for efficient software coherence to be profitable include a small page size, a fast trap mechanism, and the ability to execute instructions while remote memory references are outstanding. V] Table of Contents Curriculum Vitae III Acknowledgments IV Abstract V List of Tables x List of Figures Xl 1 Introduction 1 1.1 Data Locality and the Coherence Problem 2 1.1.1 Methods of Implementing Coherence 3 1.2 Outline of the Argument Supporting the Thesis 4 1.3 Related Work . 6 1.3.1 Coherent Caching . 6 1.3.2 NUMA Systems .. 7 1.3.3 Distributed Shared Memory 9 1.3.4 Multiprocessor Tracing Techniques 10 1.3.5 Weak Coherence ......... 12 2 Application Styles, Tracing and the Application Set 13 2.1 Application Programming Styles. 14 2.1.1 C-Threads 14 2.1.2 EPEX 15 2.1.3 Presto 15 2.2 Applications. 15 Vll 2.2.1 EPEX applications 17 2.2.2 gauss. · . · . 17 2.2.3 chip . · . · . 18 2.2.4 bsort and kmerge 18 2.2.5 ply trace · . · . 19 2.2.6 sorbyr and sorbyc . 19 2.2.7 matmult 20 2.2.8 mp3d .. 20 2.2.9 cholesky 21 2.2.10 water .. 21 2.2.11 p-gauss . 22 2.2.12 p-qsort . 22 2.2.13 p-matmult . 22 2.2.14 p-life .... 22 2.3 The ACE Tracer 23 2.3.1 Kernel Modifications to Support Tracing 23 2.3.2 The trace-Baver Program. 24 3 Implementation of a Simple Kernel-Based NUMA System 26 3.1 Implementing Software Coherence on the ACE. 27 3.1.1 The IBM ACE Multiprocessor Workstation. 28 3.1.2 The Mach Virtual Memory System 30 3.1.3 The ACE pmap layer. 31 The NUMA Manager . 31 A Simple NUMA Policy That Limits Page Movement 34 Changes to the Mach pmap Interface to Support NUMA 35 3.2 Performance Results . 36 3.2.1 Evaluating Page Placement 36 3.2.2 The Application Programs 39 3.2.3 Results .......... 40 3.2.4 Page Placement Overhead 41 3.3 Discussion ............ 42 3.3.1 The Two-Level NUMA Model 42 3.3.2 Making Placement Decisions . 42 3.3.3 Mach as a Base for NUMA Systems . 43 3.4 Whence from Here? ............. 44 Vlll 4 A Model of Program Execution in Shared Memory Systems 46 .. _ 4.1 A Model Of Memory System Behavior 48 4.1.1 Machines 49 4.1.2 Traces .. 50 4.1.3 Placements and Policies 50 4.1.4 Cost .... 51 4.1.5 Optimality. 52 4.2 Computing Optimal NUMA Placements 54 4.2.1 Computing Optimality Without Replication 55 4.2.2 Incorporating Replication ..... 56 4.3 Validation of the Trace Analysis Technique . 57 4.4 Discussion................... 62 5 NUMA Policies and Their Relation to Memory Architecture 64 5.1 Implementable (Non-Optimal) Policies .. 65 5.2 Experiments................. 67 5.2.1 Performance of the Various Policies 67 The Importance of the NUMA Problem. 67 The Success of Simple Policies . 70 The Importance of Programming Style 71 The Impact of Memory Architecture . 71 5.3 Variation of Architectural Parameters. 72 5.3.1 Case Study: Successive Over-Relaxation 76 5.4 Summary 78 6 Comparative Performance of Coherently Cached, NUMA, and DSM Architectures on Shared Memory Programs 80 6.1 Machine Models ...... 81 6.1.1 The Machine Models 81 6.1.2 Computing Cost Numbers for the Machines 84 6.2 Experimental Results . 86 6.2.1 Base Machine Model Results. 87 6.2.2 Comparing the Machine Models Using the Same Block Size in All ......... 89 6.2.3 Varying the Block Size . .. 91 IX 6.2.4 The Effect of 0 6 •••••••••••••••••••• 93 6.3 The Effect of Reducing the Page Size on TLB Miss Rates. 97 6.3.1 Virtually Tagged Caches 97 6.3.2 TLB design 98 6.3.3 Results. 99 6.4 Conclusions . 102 7 False Sharing and its Effect on Shared Memory Performance 104 7.1 Definitions of False Sharing ...... 105 7.1.1 The One-Word Block Definition 106 7.1.2 The Interval Definition. 107 7.1.3 Heuristic Interval Selection. 108 7.1.4 Full Duration False Sharing 109 7.1.5 The Hand Tuning Method . 110 7.1.6 The Cost Component Method 110 Example Traces Showing Fragmentation, Grouping and False Sharing . 114 7.2 "Estimating False Sharing. 115 8 Conclusions 120 8.1 Future Work. 122 8.1.1 Reducing False Sharing. 122 8.1.2 Variable Sized Blocks .. 123 8.1.3 Increasing the Detail of the Model. 123 8.1.4 More Complicated Machines . 123 8.1.5 Other Uses of Optimal Analysis .. 124 Bibliography 125 Appendix A Results for All Applications 135 x List of Tables 2.1 Trace Sizes .............................. , 16 3.1 Measured user times in seconds and computed model parameters. 40 3.2 Total system time for runs on 7 processors. .. 41 4.1 Percentage optimal performance change due to local and gap per- turbations . .. 61 6.1 Machine types considered 82 6.2 Formulas for computing model parameters 86 6.3 Parameter values for the base machine models 86 6.4 Performance of other architectures relative to CC+ 94 6.5 MCPR uncorrected and corrected for additional TLB misses 101 ~-~-~ - ~---- Xl List of Figures 3.1 ACE memory architecture . 29 3.2 ACE pmap layer ................ 32 3.3 NUMA manager actions to maintain coherence. 34 4.1 Algorithm for computing optimal cost without replication. 55 4.2 Function to compute the cost of a read-run, no global memory 57 4.3 Optimal policy computation, no global memory 58 4.4 Unmodified vs. local perturbations for CC model 60 5.1 MCPR for ACE hardware parameters.