MEMORY CONSISTENCY MODELS FOR SHARED-MEMORY MULTIPROCESSORS
Kourosh Gharachorloo
Technical Report: CSL-TR-95-685
December 1995
This research has been supported by DARPA contract N00039-91-C-0138. Author also acknowledges support from Digital Equipment Corporation. Copyright 1996 by Kourosh Gharachorloo All Rights Reserved Memory Consistency Models for Shared-Memory Multiprocessors
Kourosh Gharachorloo
Technical Report: CSL-TR-95-685
December 1995
Computer Systems Laboratory Departments of Electrical Engineering and Computer Science Stanford University Gates Building A-408 Stanford, CA 94305-9040 [email protected]
Abstract
The memory consistency model for a shared-memory multiprocessor specifies the behavior of memory with respect to read and write operations from multiple processors. As such, the memory model influences many aspects of system design, including the design of programming languages, compilers, and the underlying hardware. Relaxed models that impose fewer memory ordering constraints offer the potential for higher performance by allowing hardware and software to overlap and reorder memory operations. However, fewer ordering guarantees can compromise programmability and portability. Many of the previously proposed models either fail to provide reasonable programming semantics or are biased toward programming ease at the cost of sacrificing performance. Furthermore, the lack of consensus on an acceptable model hinders software portability across different systems.
This dissertation focuses on providing a balanced solution that directly addresses the trade-off between programming ease and performance. To address programmability, we propose an alternative method for specifying memory behavior that presents a higher level abstraction to the programmer. We show that with only a few types of information supplied by the programmer, an implementation can exploit the full range of optimizations enabled by previous models. Furthermore, the same information enables automatic and efficient portability across a wide range of implementations.
To expose the optimizations enabled by a model, we have developed a formal framework for specifying the low-level ordering constraints that must be enforced by an implementation. Based on these specifications, we present a wide range of architecture and compiler implementation techniques for efficiently supporting a given model. Finally, we evaluate the performance benefits of exploiting relaxed models based on detailed simulations of realistic parallel applications. Our results show that the optimizations enabled by relaxed models are extremely effective in hiding virtually the full latency of writes in architectures with blocking reads (i.e., processor stalls on reads), with gains as high as 80%. Architectures with non- blocking reads can further exploit relaxed models to hide a substantial fraction of the read latency as well, leading to a larger overall performance benefit. Furthermore, these optimizations complement gains from other latency hiding techniques such as prefetching and multiple contexts.
We believe that the combined benefits in hardware and software will make relaxed models universal in future multiprocessors, as is already evidenced by their adoption in several commercial systems.
Key Words and Phrases: shared-memory multiprocessors, memory consistency models, latency hiding techniques, latency tolerating techniques, relaxed memory models, sequential consistency, release consistency Acknowledgements
Many thanks go to my advisors John Hennessy and Anoop Gupta for their continued guidance, support, and encouragement. John Hennessy's vision and enthusiasm have served as an inspiration since my early graduate days at Stanford. He has been a great source of insight and an excellent sounding board for ideas. His leadership was instrumental in the conception of the DASH project which provided a great infrastructure for multiprocessor research at Stanford. Anoop Gupta encouraged me to pursue my early ideas on memory consistency models. I am grateful for the tremendous time, energy, and wisdom that he invested in steering my research. He has been an excellent role model through his dedication to quality research. I also thank James Plummer for graciously serving as my orals committee chairman and my third reader. I was fortunate to be among great friends and colleagues at Stanford. In particular, I would like to thank the other members of the DASH project, especially Jim Laudon, Dan Lenoski, and Wolf-Dietrich Weber, for making the DASH project an exciting experience. I thank the following people for providing the base simulation tools for my studies: Steve Goldschmidt for TangoLite, Todd Mowry for Dixie, and Mike Johnson for the dynamic scheduled processor simulator. Charles Orgish, Laura Schrager, and Thoi Nguyen at Stanford, and Annie Warren and Jason Wold at Digital, were instrumental in supporting the computing environment. I also thank Margaret Rowland and Darlene Hadding for their administrative support. I thank Vivek Sarkar for serving as a mentor during my ®rst year at Stanford. Rohit Chandra, Dan Scales, and Ravi Soundararajan get special thanks for their help in proof reading the ®nal version of the thesis. Finally, I am grateful to my friends and of®ce mates, Paul Calder, Rohit Chandra, Jaswinder Pal Singh, and Mike Smith, who made my time at Stanford most enjoyable. My research on memory consistency models has been enriched through collaborations with Phil Gibbons and Sarita Adve. I have also enjoyed working with Andreas Nowatzyk on the Sparc V9 RMO model, and with Jim Horning, Jim Saxe, and Yuan Yu on the Digital Alpha memory model. I thank Digital Equipment Corporation, the Western Research Laboratory, and especially Joel Bartlett and Richard Swan, for giving me the freedom to continue my work in this area after joining Digital. The work presented in this thesis represents a substantial extension to my earlier work at Stanford. I would like to thank my friends and relatives, Ali, Farima, Hadi, Hooman, Illah, Rohit, Sapideh, Shahin, Shahrzad, Shervin, Siamak, Siavosh, and Sina, who have made these past years so enjoyable. Finally, I thank my family, my parents and brother, for their immeasurable love, encouragement, and support of my education. Most importantly, I thank my wife, Nazhin, who has been the source of happiness in my life.
i ii To my parents Nezhat and Vali and my wife Nazhin
iii iv Contents
Acknowledgements i
1 Introduction 1
1.1 The Problem ::: ::: :::: ::: :::: ::: :::: ::: :::: ::: :::: ::: 2
1.2 Our Approach :: ::: :::: ::: :::: ::: :::: ::: :::: ::: :::: ::: 3
1.2.1 Programming Ease and Portability :: ::: :::: ::: :::: ::: :::: ::: 4
1.2.2 Implementation Issues : ::: :::: ::: :::: ::: :::: ::: :::: ::: 4
1.2.3 Performance Evaluation ::: :::: ::: :::: ::: :::: ::: :::: ::: 4
1.3 Contributions ::: ::: :::: ::: :::: ::: :::: ::: :::: ::: :::: ::: 5
1.4 Organization ::: ::: :::: ::: :::: ::: :::: ::: :::: ::: :::: ::: 5
2 Background 7
2.1 What is a Memory Consistency Model? ::: ::: :::: ::: :::: ::: :::: ::: 7
2.1.1 Interface between Programmer and System :::: ::: :::: ::: :::: ::: 7
2.1.2 Terminology and Assumptions :::: ::: :::: ::: :::: ::: :::: ::: 8
2.1.3 Sequential Consistency ::: :::: ::: :::: ::: :::: ::: :::: ::: 9
2.1.4 Examples of Sequentially Consistent Executions :: ::: :::: ::: :::: ::: 10
2.1.5 Relating Memory Behavior Based on Possible Outcomes :::: ::: :::: ::: 15
2.2 Impact of Architecture and Compiler Optimizations :::: ::: :::: ::: :::: ::: 15
2.2.1 Architecture Optimizations : :::: ::: :::: ::: :::: ::: :::: ::: 15
2.2.2 Compiler Optimizations ::: :::: ::: :::: ::: :::: ::: :::: ::: 19
2.3 Implications of Sequential Consistency ::: ::: :::: ::: :::: ::: :::: ::: 20
2.3.1 Suf®cient Conditions for Maintaining Sequential Consistency :: ::: :::: ::: 21
2.3.2 Using Program-Speci®c Information : ::: :::: ::: :::: ::: :::: ::: 22
2.3.3 Other Aggressive Implementations of Sequential Consistency :: ::: :::: ::: 23
2.4 Alternative Memory Consistency Models :: ::: :::: ::: :::: ::: :::: ::: 24
2.4.1 Overview of Relaxed Memory Consistency Models ::: :::: ::: :::: ::: 24
2.4.2 Framework for Representing Different Models :: ::: :::: ::: :::: ::: 25
2.4.3 Relaxing the Write to Read Program Order :::: ::: :::: ::: :::: ::: 26
v 2.4.4 Relaxing the Write to Write Program Order :::: ::: :::: ::: :::: ::: 31
2.4.5 Relaxing the Read to Read and Read to Write Program Order :: ::: :::: ::: 31
2.4.6 Impact of Relaxed Models on Compiler Optimizations : :::: ::: :::: ::: 38
2.4.7 Relationship among the Models ::: ::: :::: ::: :::: ::: :::: ::: 39
2.4.8 Some Shortcomings of Relaxed Models :: :::: ::: :::: ::: :::: ::: 39
2.5 How to Evaluate a Memory Model? : :::: ::: :::: ::: :::: ::: :::: ::: 40
2.5.1 Identifying the Target Environment : ::: :::: ::: :::: ::: :::: ::: 40
2.5.2 Programming Ease and Performance : ::: :::: ::: :::: ::: :::: ::: 41
2.5.3 Enhancing Programming Ease :::: ::: :::: ::: :::: ::: :::: ::: 42
2.6 Related Concepts : ::: :::: ::: :::: ::: :::: ::: :::: ::: :::: ::: 43
2.7 Summary : :::: ::: :::: ::: :::: ::: :::: ::: :::: ::: :::: ::: 44
3 Approach for Programming Simplicity 45
3.1 Overview of Programmer-Centric Models :: ::: :::: ::: :::: ::: :::: ::: 45
3.2 A Hierarchy of Programmer-Centric Models : ::: :::: ::: :::: ::: :::: ::: 47
3.2.1 Properly-Labeled ModelÐLevel One (PL1) :::: ::: :::: ::: :::: ::: 48
3.2.2 Properly-Labeled ModelÐLevel Two (PL2) :::: ::: :::: ::: :::: ::: 52
3.2.3 Properly-Labeled ModelÐLevel Three (PL3) ::: ::: :::: ::: :::: ::: 55
3.2.4 Relationship among the Properly-Labeled Models : ::: :::: ::: :::: ::: 60
3.3 Relating Programmer-Centric and System-Centric Models : ::: :::: ::: :::: ::: 61
3.4 Bene®ts of Using Properly-Labeled Models : ::: :::: ::: :::: ::: :::: ::: 63
3.5 How to Obtain Information about Memory Operations ::: ::: :::: ::: :::: ::: 65
3.5.1 Who Provides the Information :::: ::: :::: ::: :::: ::: :::: ::: 65
3.5.2 Mechanisms for Conveying Operation Labels ::: ::: :::: ::: :::: ::: 67
3.6 Programs with Unsynchronized Memory Operations :::: ::: :::: ::: :::: ::: 70
3.6.1 Why Programmers Use Unsynchronized Operations ::: :::: ::: :::: ::: 70
3.6.2 Trade-offs in Properly Labeling Programs with Unsynchronized Operations :: ::: 71
3.6.3 Summary for Programs with Unsynchronized Operations :::: ::: :::: ::: 74
3.7 Possible Extensions to Properly-Labeled Models :: :::: ::: :::: ::: :::: ::: 74
3.7.1 Requiring Alternate Information from the Programmer : :::: ::: :::: ::: 74
3.7.2 Choosing a Different Base Model :: ::: :::: ::: :::: ::: :::: ::: 75
3.7.3 Other Possible Extensions :: :::: ::: :::: ::: :::: ::: :::: ::: 76
3.8 Related Work ::: ::: :::: ::: :::: ::: :::: ::: :::: ::: :::: ::: 76
3.8.1 Relation to Past Work on Properly-Labeled Programs :: :::: ::: :::: ::: 76
3.8.2 Comparison with the Data-Race-Free Models ::: ::: :::: ::: :::: ::: 78
3.8.3 Other Related Work on Programmer-Centric Models :: :::: ::: :::: ::: 79
3.8.4 Related Work on Programming Environments ::: ::: :::: ::: :::: ::: 80
3.9 Summary : :::: ::: :::: ::: :::: ::: :::: ::: :::: ::: :::: ::: 81
4 Speci®cation of System Requirements 82
4.1 Framework for Specifying System Requirements : :::: ::: :::: ::: :::: ::: 82
4.1.1 Terminology and Assumptions for Specifying System Requirements : :::: ::: 83
vi 4.1.2 Simple Abstraction for Memory Operations :::: ::: :::: ::: :::: ::: 86
4.1.3 A More General Abstraction for Memory Operations :: :::: ::: :::: ::: 88
4.2 Supporting Properly-Labeled Programs ::: ::: :::: ::: :::: ::: :::: ::: 97
4.2.1 Suf®cient Requirements for PL1 ::: ::: :::: ::: :::: ::: :::: ::: 97
4.2.2 Suf®cient Requirements for PL2 ::: ::: :::: ::: :::: ::: :::: ::: 107
4.2.3 Suf®cient Requirements for PL3 ::: ::: :::: ::: :::: ::: :::: ::: 109
4.3 Expressing System-Centric Models : :::: ::: :::: ::: :::: ::: :::: ::: 109
4.4 Porting Programs Among Various Speci®cations :: :::: ::: :::: ::: :::: ::: 114
4.4.1 Porting Sequentially Consistent Programs to System-Centric Models : :::: ::: 115
4.4.2 Porting Programs Among System-Centric Models : ::: :::: ::: :::: ::: 120
4.4.3 Porting Properly-Labeled Programs to System-Centric Models : ::: :::: ::: 121
4.5 Extensions to Our Abstraction and Speci®cation Framework ::: :::: ::: :::: ::: 126
4.6 Related Work ::: ::: :::: ::: :::: ::: :::: ::: :::: ::: :::: ::: 126
4.6.1 Relationship to other Shared-Memory Abstractions ::: :::: ::: :::: ::: 126
4.6.2 Related Work on Memory Model Speci®cation :: ::: :::: ::: :::: ::: 128
4.6.3 Related Work on Suf®cient Conditions for Programmer-Centric Models :::: ::: 129
4.6.4 Work on Verifying Speci®cations :: ::: :::: ::: :::: ::: :::: ::: 130
4.7 Summary : :::: ::: :::: ::: :::: ::: :::: ::: :::: ::: :::: ::: 131
5 Implementation Techniques 132
5.1 Cache Coherence : ::: :::: ::: :::: ::: :::: ::: :::: ::: :::: ::: 133
5.1.1 Features of Cache Coherence :::: ::: :::: ::: :::: ::: :::: ::: 133
5.1.2 Abstraction for Cache Coherence Protocols :::: ::: :::: ::: :::: ::: 134
5.2 Mechanisms for Exploiting Relaxed Models : ::: :::: ::: :::: ::: :::: ::: 138
5.2.1 Processor : ::: :::: ::: :::: ::: :::: ::: :::: ::: :::: ::: 140
5.2.2 Read and Write Buffers ::: :::: ::: :::: ::: :::: ::: :::: ::: 141
5.2.3 Caches and Intermediate Buffers ::: ::: :::: ::: :::: ::: :::: ::: 143
5.2.4 External Interface ::: ::: :::: ::: :::: ::: :::: ::: :::: ::: 148
5.2.5 Network and Memory System :::: ::: :::: ::: :::: ::: :::: ::: 149
5.2.6 Summary on Exploiting Relaxed Models : :::: ::: :::: ::: :::: ::: 150
5.3 Maintaining the Correct Ordering Among Operations ::: ::: :::: ::: :::: ::: 150 5.3.1 Relating Abstract Events in the Speci®cation to Actual Events in an Implementation 151
5.3.2 Correctness Issues for Cache Coherence Protocols : ::: :::: ::: :::: ::: 154
5.3.3 Supporting the Initiation and Uniprocessor Dependence Conditions :: :::: ::: 158 5.3.4 Interaction between Value, Coherence, Initiation,and Uniprocessor Dependence Con-
ditions :: ::: :::: ::: :::: ::: :::: ::: :::: ::: :::: ::: 160
5.3.5 Supporting the Multiprocessor Dependence Chains ::: :::: ::: :::: ::: 162
5.3.6 Supporting the Reach Condition ::: ::: :::: ::: :::: ::: :::: ::: 183
5.3.7 Supporting Atomic Read-Modify-Write Operations ::: :::: ::: :::: ::: 184
5.3.8 Comparing Implementations of System-Centric and Programmer-Centric Models :: 185
5.3.9 Summary on Maintaining Correct Order :: :::: ::: :::: ::: :::: ::: 186
5.4 More Aggressive Mechanisms for Supporting Multiprocessor Dependence Chains :: ::: 186
vii 5.4.1 Early Acknowledgement of Invalidation and Update Requests :: ::: :::: ::: 187
5.4.2 Simple Automatic Hardware-Prefetching : :::: ::: :::: ::: :::: ::: 196
5.4.3 Exploiting the Roll-Back Mechanism in Dynamically-Scheduled Processors : ::: 199
5.4.4 Combining Speculative Reads with Hardware Prefetching for Writes : :::: ::: 201 5.4.5 Other Related Work on Aggressively Supporting Multiprocessor Dependence Chains 202
5.5 Restricted Interconnection Networks : :::: ::: :::: ::: :::: ::: :::: ::: 204
5.5.1 Broadcast Bus : :::: ::: :::: ::: :::: ::: :::: ::: :::: ::: 204
5.5.2 Hierarchies of Buses and Hybrid Designs : :::: ::: :::: ::: :::: ::: 205
5.5.3 Rings :: ::: :::: ::: :::: ::: :::: ::: :::: ::: :::: ::: 207
5.5.4 Related Work on Restricted Interconnection Networks :: :::: ::: :::: ::: 209
5.6 Systems with Software-Based Coherence :: ::: :::: ::: :::: ::: :::: ::: 210
5.7 Interaction with Thread Placement and Migration : :::: ::: :::: ::: :::: ::: 212
5.7.1 Thread Migration ::: ::: :::: ::: :::: ::: :::: ::: :::: ::: 213
5.7.2 Thread Placement ::: ::: :::: ::: :::: ::: :::: ::: :::: ::: 215
5.8 Interaction with Other Latency Hiding Techniques : :::: ::: :::: ::: :::: ::: 221
5.8.1 Prefetching ::: :::: ::: :::: ::: :::: ::: :::: ::: :::: ::: 221
5.8.2 Multiple Contexts ::: ::: :::: ::: :::: ::: :::: ::: :::: ::: 222
5.8.3 Synergy Among the Techniques ::: ::: :::: ::: :::: ::: :::: ::: 224
5.9 Supporting Other Types of Events :: :::: ::: :::: ::: :::: ::: :::: ::: 224
5.10 Implications for Compilers ::: ::: :::: ::: :::: ::: :::: ::: :::: ::: 225
5.10.1 Problems with Current Compilers :: ::: :::: ::: :::: ::: :::: ::: 225
5.10.2 Memory Model Assumptions for the Source Program and the Target Architecture :: 227
5.10.3 Reasoning about Compiler Optimizations : :::: ::: :::: ::: :::: ::: 228
5.10.4 Determining Safe Compiler Optimizations : :::: ::: :::: ::: :::: ::: 229
5.10.5 Summary of Compiler Issues :::: ::: :::: ::: :::: ::: :::: ::: 233
5.11 Summary : :::: ::: :::: ::: :::: ::: :::: ::: :::: ::: :::: ::: 233
6 Performance Evaluation 234
6.1 Overview : :::: ::: :::: ::: :::: ::: :::: ::: :::: ::: :::: ::: 234
6.2 Architectures with Blocking Reads :: :::: ::: :::: ::: :::: ::: :::: ::: 235
6.2.1 Experimental Framework :: :::: ::: :::: ::: :::: ::: :::: ::: 235
6.2.2 Experimental Results : ::: :::: ::: :::: ::: :::: ::: :::: ::: 240
6.2.3 Effect of Varying Architectural Assumptions ::: ::: :::: ::: :::: ::: 248
6.2.4 Summary of Blocking Read Results : ::: :::: ::: :::: ::: :::: ::: 255
6.3 Interaction with Other Latency Hiding Techniques : :::: ::: :::: ::: :::: ::: 256
6.3.1 Interaction with Prefetching : :::: ::: :::: ::: :::: ::: :::: ::: 256
6.3.2 Interaction with Multiple Contexts :: ::: :::: ::: :::: ::: :::: ::: 258
6.3.3 Summary of Other Latency Hiding Techniques :: ::: :::: ::: :::: ::: 259
6.4 Architectures with Non-Blocking Reads ::: ::: :::: ::: :::: ::: :::: ::: 261
6.4.1 Experimental Framework :: :::: ::: :::: ::: :::: ::: :::: ::: 261
6.4.2 Experimental Results : ::: :::: ::: :::: ::: :::: ::: :::: ::: 264
6.4.3 Discussion of Non-Blocking Read Results : :::: ::: :::: ::: :::: ::: 271
viii 6.4.4 Summary of Non-Blocking Read Results : :::: ::: :::: ::: :::: ::: 272
6.5 Related Work ::: ::: :::: ::: :::: ::: :::: ::: :::: ::: :::: ::: 272
6.5.1 Related Work on Blocking Reads :: ::: :::: ::: :::: ::: :::: ::: 272
6.5.2 Related Work on Interaction with Other Techniques ::: :::: ::: :::: ::: 273
6.5.3 Related Work on Non-Blocking Reads ::: :::: ::: :::: ::: :::: ::: 273
6.5.4 Areas for Further Investigation :::: ::: :::: ::: :::: ::: :::: ::: 274
6.6 Summary : :::: ::: :::: ::: :::: ::: :::: ::: :::: ::: :::: ::: 275
7 Conclusions 276
7.1 Thesis Summary : ::: :::: ::: :::: ::: :::: ::: :::: ::: :::: ::: 277
7.2 Future Directions : ::: :::: ::: :::: ::: :::: ::: :::: ::: :::: ::: 277
A Alternative De®nition for Ordering Chain 279
B General De®nition for Synchronization Loop Constructs 281
C Subtleties in the PL3 Model 283
C.1 Illustrative Example for Loop Read and Loop Write :::: ::: :::: ::: :::: ::: 283
C.2 Simpli®cation of the PL3 Model ::: :::: ::: :::: ::: :::: ::: :::: ::: 284
D Detecting Incorrect Labels and Violations of Sequential Consistency 285
D.1 Detecting Incorrect Labels ::: ::: :::: ::: :::: ::: :::: ::: :::: ::: 285
D.2 Detecting Violations of Sequential Consistency :: :::: ::: :::: ::: :::: ::: 287
D.3 Summary of Detection Techniques :: :::: ::: :::: ::: :::: ::: :::: ::: 289
E Alternative De®nition for Return Value of Reads 290
F Reach Relation 291
G Aggressive Form of the Uniprocessor Correctness Condition 295
H Aggressive Form of the Termination Condition for Writes 296
H.1 Relaxation of the Termination Condition ::: ::: :::: ::: :::: ::: :::: ::: 296
H.2 Implications of the Relaxed Termination Condition on Implementations : ::: :::: ::: 297
I Aggressive Speci®cations for Various System-Centric Models 299
I.1 Aggressive Speci®cation of the Models ::: ::: :::: ::: :::: ::: :::: ::: 299
I.2 Reach Relation for System-Centric Models : ::: :::: ::: :::: ::: :::: ::: 314
I.3 Aggressive Uniprocessor Correctness Condition for System-Centric Models : :::: ::: 314
J Extensions to Our Abstraction and Speci®cation Framework 316
J.1 Assumptions about the Result of an Execution ::: :::: ::: :::: ::: :::: ::: 316
J.2 External Devices : ::: :::: ::: :::: ::: :::: ::: :::: ::: :::: ::: 318
J.3 Other Event Types ::: :::: ::: :::: ::: :::: ::: :::: ::: :::: ::: 321
J.4 Incorporating various Events into Abstraction and Speci®cation : :::: ::: :::: ::: 325
ix J.5 A More Realistic Notion for Result : :::: ::: :::: ::: :::: ::: :::: ::: 326
J.6 Summary on Extensions to Framework ::: ::: :::: ::: :::: ::: :::: ::: 326
K Subtle Issues in Implementing Cache Coherence Protocols 328
K.1 Dealing with Protocol Deadlock ::: :::: ::: :::: ::: :::: ::: :::: ::: 329
K.2 Examples of Transient or Corner Cases ::: ::: :::: ::: :::: ::: :::: ::: 329
K.3 Serializing Simultaneous Operations : :::: ::: :::: ::: :::: ::: :::: ::: 333
K.4 Cross Checking between Incoming and Outgoing Messages ::: :::: ::: :::: ::: 335
K.5 Importance of Point-to-Point Orders : :::: ::: :::: ::: :::: ::: :::: ::: 337
L Bene®ts of Speculative Execution 339
M Supporting Value Condition and Coherence Requirement with Updates 341
N Subtle Implementation Issues for Load-Locked and Store-Conditional Instructions 343
O Early Acknowledgement of Invalidation and Update Requests 345
P Implementation of Speculative Execution for Reads 351
P.1 Example Implementation of Speculative Execution :::: ::: :::: ::: :::: ::: 351
P.2 Illustrative Example for Speculative Reads :: ::: :::: ::: :::: ::: :::: ::: 355
Q Implementation Issues for a More General Set of Events 357
Q.1 Instruction Fetch : ::: :::: ::: :::: ::: :::: ::: :::: ::: :::: ::: 357
Q.2 Multiple Granularity Operations ::: :::: ::: :::: ::: :::: ::: :::: ::: 358
Q.3 I/O Operations :: ::: :::: ::: :::: ::: :::: ::: :::: ::: :::: ::: 360
Q.4 Other Miscellaneous Events :: ::: :::: ::: :::: ::: :::: ::: :::: ::: 361
Q.5 Summary of Ordering Other Events Types :: ::: :::: ::: :::: ::: :::: ::: 361
Bibliography 362
x List of Tables
3.1 Suf®cient program order and atomicity conditions for the PL models. :: ::: :::: ::: 62
3.2 Unnecessary program order and atomicity conditions for the PL models. ::: :::: ::: 63
4.1 Suf®cient mappings for achieving sequential consistency. : ::: :::: ::: :::: ::: 116
4.2 Suf®cient mappings for extended versions of models. ::: ::: :::: ::: :::: ::: 119
4.3 Suf®cient mappings for porting PL programs to system-centric models. : ::: :::: ::: 122
4.4 Suf®cient mappings for porting PL programs to system-centric models. : ::: :::: ::: 123
4.5 Suf®cient mappings for porting PL programs to system-centric models. : ::: :::: ::: 124
4.6 Porting PL programs to extended versions of some system-centric models. :: :::: ::: 125
5.1 Messages exchanged within the processor cache hierarchy. [wt] and [wb] mark messages particular to write through or write back caches, respectively. (data) and (data*) mark
messages that carry data, where (data*) is a subset of a cache line. :::: ::: :::: ::: 136
5.2 How various models inherently convey ordering information. :: :::: ::: :::: ::: 167
6.1 Latency for various memory system operations in processor clocks. Numbers are based on
33MHz processors. ::: :::: ::: :::: ::: :::: ::: :::: ::: :::: ::: 238
6.2 Description of benchmark applications. ::: ::: :::: ::: :::: ::: :::: ::: 239
6.3 General statistics on the applications. Numbers are aggregated for 16 processors. ::: ::: 241 6.4 Statistics on shared data references and synchronization operations, aggregated for all 16
processors. Numbers in parentheses are rates given as references per thousand instructions. : 241
6.5 Statistics on branch behavior. : ::: :::: ::: :::: ::: :::: ::: :::: ::: 264
xi List of Figures
1.1 Example producer-consumer interaction. ::: ::: :::: ::: :::: ::: :::: ::: 3
2.1 Various interfaces between programmer and system. :::: ::: :::: ::: :::: ::: 8
2.2 Conceptual representation of sequential consistency (SC). : ::: :::: ::: :::: ::: 10
2.3 Sample programs to illustrate sequential consistency. ::: ::: :::: ::: :::: ::: 11
2.4 Examples illustrating instructions with multiple memory operations. ::: ::: :::: ::: 13
2.5 Examples illustrating the need for synchronization under SC. :: :::: ::: :::: ::: 14
2.6 A typical scalable shared-memory architecture. :: :::: ::: :::: ::: :::: ::: 16
2.7 Reordering of operations arising from distribution of memory and network resources. : ::: 17
2.8 Non-atomic behavior of writes due to caching. ::: :::: ::: :::: ::: :::: ::: 18
2.9 Non-atomic behavior of writes to the same location. :::: ::: :::: ::: :::: ::: 19
2.10 Program segments before and after register allocation. ::: ::: :::: ::: :::: ::: 20
2.11 Representation for the sequential consistency (SC) model. : ::: :::: ::: :::: ::: 26
2.12 The IBM-370 memory model. : ::: :::: ::: :::: ::: :::: ::: :::: ::: 27
2.13 The total store ordering (TSO) memory model. :: :::: ::: :::: ::: :::: ::: 29
2.14 Example program segments for the TSO model. :: :::: ::: :::: ::: :::: ::: 29
2.15 The processor consistency (PC) model. ::: ::: :::: ::: :::: ::: :::: ::: 30
2.16 The partial store ordering (PSO) model. ::: ::: :::: ::: :::: ::: :::: ::: 31
2.17 The weak ordering (WO) model. ::: :::: ::: :::: ::: :::: ::: :::: ::: 33
2.18 Comparing the WO and RC models. : :::: ::: :::: ::: :::: ::: :::: ::: 34
2.19 The release consistency (RC) models. :::: ::: :::: ::: :::: ::: :::: ::: 35
2.20 The Alpha memory model. ::: ::: :::: ::: :::: ::: :::: ::: :::: ::: 36
2.21 The Relaxed Memory Order (RMO) model. : ::: :::: ::: :::: ::: :::: ::: 37
2.22 An approximate representation for the PowerPC model. :: ::: :::: ::: :::: ::: 37
2.23 Example program segments for the PowerPC model. :::: ::: :::: ::: :::: ::: 38
2.24 Relationship among models according to the ªstricterº relation. : :::: ::: :::: ::: 39
2.25 Trade-off between programming ease and performance. :: ::: :::: ::: :::: ::: 41
2.26 Effect of enhancing programming ease. ::: ::: :::: ::: :::: ::: :::: ::: 42
3.1 Exploiting information about memory operations. : :::: ::: :::: ::: :::: ::: 46
xii 3.2 Example of program order and con¯ict order. ::: :::: ::: :::: ::: :::: ::: 49
3.3 Categorization of read and write operations for PL1. :::: ::: :::: ::: :::: ::: 49
3.4 Program segments with competing operations. ::: :::: ::: :::: ::: :::: ::: 50
3.5 Possible reordering and overlap for PL1 programs. : :::: ::: :::: ::: :::: ::: 51
3.6 Categorization of read and write operations for PL2. :::: ::: :::: ::: :::: ::: 53
3.7 Program segment from a branch-and-bound algorithm ::: ::: :::: ::: :::: ::: 54
3.8 Possible reordering and overlap for PL2 programs. : :::: ::: :::: ::: :::: ::: 55
3.9 Categorization of read and write operations for PL3. :::: ::: :::: ::: :::: ::: 57
3.10 Example program segments: (a) critical section, (b) barrier. ::: :::: ::: :::: ::: 58
3.11 Example program segments with loop read and write operations. :::: ::: :::: ::: 59
3.12 Possible reordering and overlap for PL3 programs. : :::: ::: :::: ::: :::: ::: 60
3.13 Example with no explicit synchronization. :: ::: :::: ::: :::: ::: :::: ::: 73
4.1 Simple model for shared memory. :: :::: ::: :::: ::: :::: ::: :::: ::: 87
4.2 Conditions for SC using simple abstraction of shared memory. :: :::: ::: :::: ::: 88
4.3 General model for shared memory. :: :::: ::: :::: ::: :::: ::: :::: ::: 89
4.4 Conservative conditions for SC using general abstraction of shared memory. : :::: ::: 92
4.5 Example producer-consumer interaction. ::: ::: :::: ::: :::: ::: :::: ::: 93
4.6 Scheurich and Dubois' conditions for SC. :: ::: :::: ::: :::: ::: :::: ::: 94
4.7 Aggressive conditions for SC. : ::: :::: ::: :::: ::: :::: ::: :::: ::: 96
4.8 Examples illustrating the need for the initiation and termination conditions. :: :::: ::: 96
4.9 Conservative conditions for PL1. ::: :::: ::: :::: ::: :::: ::: :::: ::: 98
4.10 Examples to illustrate the need for the reach condition. :: ::: :::: ::: :::: ::: 100
4.11 Examples to illustrate the aggressiveness of the reach relation. :: :::: ::: :::: ::: 103
4.12 Example to illustrate optimizations with potentially non-terminating loops. :: :::: ::: 104
4.13 Unsafe optimizations with potentially non-terminating loops. :: :::: ::: :::: ::: 105
4.14 Suf®cient conditions for PL1. : ::: :::: ::: :::: ::: :::: ::: :::: ::: 106
4.15 Suf®cient conditions for PL2. : ::: :::: ::: :::: ::: :::: ::: :::: ::: 108
4.16 Suf®cient conditions for PL3. : ::: :::: ::: :::: ::: :::: ::: :::: ::: 110
4.17 Original conditions for RCsc. : ::: :::: ::: :::: ::: :::: ::: :::: ::: 111
4.18 Equivalent aggressive conditions for RCsc. : ::: :::: ::: :::: ::: :::: ::: 112
4.19 Example to illustrate the behavior of the RCsc speci®cations. :: :::: ::: :::: ::: 113
4.20 Aggressive conditions for RCpc. ::: :::: ::: :::: ::: :::: ::: :::: ::: 114
4.21 Relationship among models (arrow points to stricter model). ::: :::: ::: :::: ::: 115
4.22 Examples to illustrate porting SC programs to system-centric models. :: ::: :::: ::: 117
5.1 Cache hierarchy and buffer organization. ::: ::: :::: ::: :::: ::: :::: ::: 135
5.2 Typical architecture for distributed shared memory. :::: ::: :::: ::: :::: ::: 139
5.3 Example of out-of-order instruction issue. :: ::: :::: ::: :::: ::: :::: ::: 141
5.4 Example of buffer deadlock. :: ::: :::: ::: :::: ::: :::: ::: :::: ::: 146
5.5 Alternative buffer organizations between caches. : :::: ::: :::: ::: :::: ::: 147
5.6 Conditions for SC with reference to relevant implementation sections. :: ::: :::: ::: 151
xiii 5.7 Typical scalable shared memory architecture. ::: :::: ::: :::: ::: :::: ::: 152
5.8 Generic conditions for a cache coherence protocol. :::: ::: :::: ::: :::: ::: 154
5.9 Updates without enforcing the coherence requirement. ::: ::: :::: ::: :::: ::: 157
5.10 Example showing interaction between the various conditions. :: :::: ::: :::: ::: 160
5.11 Simultaneous write operations. : ::: :::: ::: :::: ::: :::: ::: :::: ::: 161
5.12 Multiprocessor dependence chains in the SC speci®cation. : ::: :::: ::: :::: ::: 163
5.13 Multiprocessor dependence chains in the PL1 speci®cation. ::: :::: ::: :::: ::: 164
5.14 Examples for the three categories of multiprocessor dependence chains. : ::: :::: ::: 165
5.15 Protocol options for a write operation. :::: ::: :::: ::: :::: ::: :::: ::: 171
5.16 Subtle interaction caused by eager exclusive replies. :::: ::: :::: ::: :::: ::: 173
5.17 Dif®culty in aggressively supporting RMO. : ::: :::: ::: :::: ::: :::: ::: 176
5.18 Merging writes assuming the PC model. ::: ::: :::: ::: :::: ::: :::: ::: 180
5.19 Semantics of read-modify-write operations. : ::: :::: ::: :::: ::: :::: ::: 185
5.20 Example illustrating early invalidation acknowledgements. ::: :::: ::: :::: ::: 188
5.21 Order among commit and completion events. ::: :::: ::: :::: ::: :::: ::: 189
co
! :::: ::: :::: :::
5.22 Multiprocessor dependence chain with a R W con¯ict order. 190
co