A Simulation Framework for Evaluating Location Consistency Based Cache

LC-SIM: A SIMULATION FRAMEWORK FOR EVALUATING LOCATION CONSISTENCY BASED CACHE PROTOCOLS by Pouya Fotouhi A thesis submitted to the Faculty of the University of Delaware in partial fulfillment of the requirements for the degree of Master of Science in Computer Engineering Spring 2017 c 2017 Pouya Fotouhi All Rights Reserved LC-SIM: A SIMULATION FRAMEWORK FOR EVALUATING LOCATION CONSISTENCY BASED CACHE PROTOCOLS by Pouya Fotouhi Approved: Guang R. Gao,Ph.D. Professor in charge of thesis on behalf of the Advisory Committee Approved: Kenneth E. Barner, Ph.D. Chair of the Department of Electrical and Computer Engineering Approved: Babatunde A. Ogunnaike, Ph.D. Dean of the College of Engineering Approved: Ann L. Ardis, Ph.D. Senior Vice Provost for Graduate and Professional Education ACKNOWLEDGMENTS I would like to thank Professor Gao for giving me the opportunity of joining CAPSL and multi-dimensional learning experience. With special thanks to Dr. StéphaneZuckerman for guiding me step by step over the research, and my colleague Jose Monsalve Diaz for deep discussions and his technical help. Very special thanks to my wife Elnaz , and also my parents for their support and love. iii TABLE OF CONTENTS LIST OF FIGURES ::::::::::::::::::::::::::::::: vi ABSTRACT ::::::::::::::::::::::::::::::::::: ix Chapter 1 INTRODUCTION :::::::::::::::::::::::::::::: 1 2 BACKGROUND ::::::::::::::::::::::::::::::: 4 2.1 An Introduction to Memory Consistency Models :::::::::::: 5 2.1.1 Uniform Memory Consistency Models :::::::::::::: 6 2.1.1.1 Sequential Consistency ::::::::::::::::: 7 2.1.1.2 Total Store Ordering :::::::::::::::::: 8 2.1.1.3 Coherence :::::::::::::::::::::::: 10 2.1.2 Hybrid Memory Consistency Models :::::::::::::: 10 2.1.2.1 Weak Ordering (Weak Consistency) :::::::::: 11 2.1.2.2 Release Consistency :::::::::::::::::: 11 2.1.2.3 Location Consistency :::::::::::::::::: 12 2.2 Cache Coherence ::::::::::::::::::::::::::::: 14 2.2.1 MSI :::::::::::::::::::::::::::::::: 16 2.2.2 MESI ::::::::::::::::::::::::::::::: 17 2.2.3 MOESI :::::::::::::::::::::::::::::: 18 3 THE PROBLEM WITH COHERENCE :::::::::::::::: 22 iv 4 LC-CACHE: TOWARD A SCALABLE CACHE PROTOCOL FOR EXASCALE SYSTEMS ::::::::::::::::::::::: 23 4.1 LC-Cache Protocol :::::::::::::::::::::::::::: 23 4.2 Weaknesses of LC Cache ::::::::::::::::::::::::: 29 5 LCCSIM: A SIMULATION FRAMEWORK TO COMPARE CACHE PROTOCOLS ::::::::::::::::::::::::::: 32 5.1 Implementation details :::::::::::::::::::::::::: 32 5.2 Simulator features ::::::::::::::::::::::::::::: 33 5.3 Limitations of LCCSim :::::::::::::::::::::::::: 39 6 EXPERIMENTAL RESULTS ::::::::::::::::::::::: 41 6.1 Private Accesses :::::::::::::::::::::::::::::: 42 6.2 Shared accesses - Non-synchronizing ::::::::::::::::::: 45 6.3 Shared accesses - Synchronizing ::::::::::::::::::::: 49 6.4 Combined accesses :::::::::::::::::::::::::::: 52 7 RELATED WORK ::::::::::::::::::::::::::::: 55 8 CONCLUSION AND FUTURE WORK :::::::::::::::: 57 REFERENCES :::::::::::::::::::::::::::::::::: 58 v LIST OF FIGURES 2.1 Motivating example for memory consistency models ::::::::: 5 2.2 One possible execution trace of the example presented in fig 2.1 :: 6 2.3 A simple example for uniform consistency models. The question is to see if a model allows both r1 and r2 to be equal to 0. :::::::: 8 2.4 One possible order of accesses in a system using TSO :::::::: 9 2.5 A valid trace of accesses under coherence ::::::::::::::: 10 2.6 State diagram for MSI cache coherence protocol ::::::::::: 17 2.7 Operation of MSI protocol ::::::::::::::::::::::: 17 2.8 State diagram for MESI cache coherence protocol :::::::::: 19 2.9 Operation of MESI protocol :::::::::::::::::::::: 19 2.10 State diagram for MOESI cache coherence protocol ::::::::: 20 2.11 Operation of MOESI protocol ::::::::::::::::::::: 21 4.1 State transitions - LC cache protocol ::::::::::::::::: 24 4.2 Atomic increments on shared location x ::::::::::::::: 26 4.3 Single producer with multiple asynchronous consumers ::::::: 27 4.4 An example to highlight the difference between LC and RC :::: 29 4.5 State transitions and values after each instruction in figure 4.4. LP 1 = 1(D) means the value of location L in processor P1 is equal to 1 and is in state D. LMM refers to the value of location L in main memory. ::::::::::::::::::::::::::::::::: 31 vi 5.1 Transitions in MSI protocol :::::::::::::::::::::: 34 5.2 Transitions in MESI protocol ::::::::::::::::::::: 35 5.3 Transitions in MOESI protocol :::::::::::::::::::: 36 5.4 Transitions in MOESI protocol - Cont'd ::::::::::::::: 37 5.5 Control messages in LCCSim ::::::::::::::::::::: 39 6.1 Experiment details ::::::::::::::::::::::::::: 41 6.2 Private accesses - Total access latency over number of cores :::: 42 6.3 Private accesses - Traffic on-chip over number of cores ::::::: 43 6.4 Private accesses - Total access latency over ratio of write operations 43 6.5 Private accesses - Traffic on-chip over ratio of write operations ::: 43 6.6 Private accesses - Total access latency over number of locations :: 44 6.7 Private accesses - Traffic on-chip over number of locations ::::: 44 6.8 Non-synchronizing shared accesses - Total access latency over number of cores ::::::::::::::::::::::::::::::::: 45 6.9 Non-synchronizing shared accesses - Traffic on-chip over number of cores ::::::::::::::::::::::::::::::::::: 46 6.10 Non-synchronizing shared accesses - Total access latency over ratio of write operations ::::::::::::::::::::::::::::: 47 6.11 Non-synchronizing shared accesses - Traffic on-chip over ratio of write operations :::::::::::::::::::::::::::::::: 47 6.12 Non-synchronizing shared accesses - Total access latency over number of locations ::::::::::::::::::::::::::::::: 48 6.13 Non-synchronizing shared accesses - Traffic on-chip over number of locations ::::::::::::::::::::::::::::::::: 48 vii 6.14 Non-synchronizing shared accesses - Total access latency for 256 locations ::::::::::::::::::::::::::::::::: 49 6.15 Synchronizing shared accesses - Total access latency over number of cores ::::::::::::::::::::::::::::::::::: 49 6.16 Synchronizing shared accesses - Traffic on-chip over number of cores 50 6.17 Synchronizing shared accesses - Total access latency over ratio of write operations ::::::::::::::::::::::::::::: 50 6.18 Synchronizing shared accesses - Traffic on-chip over ratio of write operations :::::::::::::::::::::::::::::::: 51 6.19 Synchronizing shared accesses - Total access latency over number of locations ::::::::::::::::::::::::::::::::: 51 6.20 Synchronizing shared accesses - Traffic on-chip over number of locations ::::::::::::::::::::::::::::::::: 52 6.21 Combined accesses - Total access latency over number of cores ::: 53 6.22 Combined accesses - Traffic on-chip over number of cores :::::: 53 6.23 Combined accesses - Total access latency over ratio of write operations :::::::::::::::::::::::::::::::: 54 6.24 Combined accesses - Traffic on-chip over ratio of write operations : 54 viii ABSTRACT New high-performance processors tend to shift from multi to many cores. More- over, shared memory has turned to dominant paradigm for mainstream multicore processors. As memory wall issue loomed over architecture design, most modern computer systems have several layers in their memory hierarchy. Among many, caches has be- come everlasting components of memory hierarchies as they significantly reduce access time by taking the advantage of locality. Major processor vendors usually rely on cache coherence, and implement a vari- ant of MESI, e.g., MOESI for AMD, to help reduce inter-chip traffic on the fast in- terconnection network. Supposedly, maintaining coherence should help with keeping parallel and concurrent programmers happy, all the while providing them with a well- known cache behavior for shared memory. This thesis challenge the assumption that Coherence is well-suited for large-scale many core processors. Seeking an alternative for coherence, LC cache protocol is extensively investigated. LC-Cache is a cache protocol weaker than Coherence, but which preserves causality. It relies on the Location Consistency (LC) model. The basic philosophy behind LC is to maintain a unique view of memory only if there is a reason to. Other ordinary memory accesses may be observed in any order by the other processors of the computer system. The motivation to stand against cache coherence, relies on underestimated limitations implied on system design by coherence. Observations presented in this thesis, demonstrates that coherence eliminates the possibility of having a directory based protocol in practice since size of such directory grows linearly with number of cores. In addition, coherence adds implicit latency in many cases to the protocol. ix This thesis presents LCCSim, a simulation framework to compare cache protocol based on location consistency against cache coherence protocols. A comparative analysis between the MESI and MOESI coherence protocols is provided, and pit them against LC-Cache. Both MESI and MOESI consistently generate more on-chip traffic compared to LC cache since transitions in LC cache are done locally. However, LC cache degrades total latency of accesses as it does not take the advantage of cache to cache forwarding. Additionally, LC cache cannot be considered a true implementation based on LC since it does not behave according to the memory model. The following summarizes the contributions of this thesis: • Detailed specification of LC cache protocol, covering the missing aspects in the original paper. • A simulation framework to compare cache protocols

A Simulation Framework for Evaluating Location Consistency Based Cache

Directed Test Generation for Validation of Cache Coherence Protocols Yangdi Lyu, Xiaoke Qin, Mingsong Chen, Member, IEEE and Prabhat Mishra, Senior Member, IEEE

Page 1 Cache Coherence in More Detail

Tightly-Coupled and Fault-Tolerant Communication in Parallel Systems

Verification of Hierarchical Cache Coherence Protocols for Future Processors

Study and Performance Analysis of Cache-Coherence Protocols in Shared-Memory Multiprocessors

Models for Addressing Cache Coherence Problems 9 INTERNATIONAL JOURNAL of NATURAL and APPLIED SCIENCES (IJNAS), VOL

A Primer on Memory Consistency and CACHE COHERENCE CONSISTENCY on MEMORY a PRIMER and Cache Coherence Consistency and Daniel J

Cache Coherence Protocols

Cache Coherence and Mesi Protocol