Imposing Minimal Memory Ordering on Multiprocessors

UNIVERSITY OF CALIFORNIA RIVERSIDE Imposing Minimal Memory Ordering on Multiprocessors A Dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Computer Science by Changhui Lin August 2013 Dissertation Committee: Dr. Rajiv Gupta, Chairperson Dr. Gianfranco Ciardo Dr. Iulian Neamtiu Dr. Philip Brisk Copyright by Changhui Lin 2013 The Dissertation of Changhui Lin is approved: Committee Chairperson University of California, Riverside Acknowledgments I owe my gratitude to all those people who have made this dissertation possible and because of whom my graduation experience has been one that I will cherish forever. Foremost, I would like to express my sincere gratitude to my advisor, Dr. Ra- jiv Gupta, for leading me to today’s achievements. His perpetual enthusiasm in research, meticulous scholarship, extensive knowledge and insightful vision have influenced me im- mensely in the course of performing research. More importantly, he deeply cares about his students. I really appreciate his belief in my potential, which allows me to explore problems extensively. From him, I have learned innumerable lessons and insights on the working of academic research in general, which have helped me come a long way and will always guide me in future. Thanks, Dr. Gupta! I would like to thank my other dissertation committee members, Dr. Gianfranco Ciardo, Dr. Iulian Neamtiu and Dr. Philip Brisk for taking their time to help me improve this dissertation. I would like to express my gratitude to all the members of my research group including Vijay Nagarajan, Dennis Jeffrey, Chen Tian, Min Feng, Kishore Kumar Pusukuri, Yan Wang, and Sai Charan Koduru for helping me in many ways during these years. I would also like to extend a general thank you to all teachers I have had throughout my life. I am where I am today because of them. Finally, I would like to thank my family, particularly my father Dianyi Lin, my mother Youfang Zhang, and my wife Ying He, for their unconditional support throughout my life and career. iv To my wife and parents, for their endless love and support. v ABSTRACT OF THE DISSERTATION Imposing Minimal Memory Ordering on Multiprocessors by Changhui Lin Doctor of Philosophy, Graduate Program in Computer Science University of California, Riverside, August 2013 Dr. Rajiv Gupta, Chairperson Shared memory has been widely adopted as the primary system level programming abstraction on modern multiprocessor systems due to its ease of programming. To assist programmers in understanding program behaviors with respect to read and write operations originating from multiple processors, many memory consistency models have been proposed. Sequential consistency (SC) memory model is the simplest and most intuitive model, but its strict memory ordering requirements can restrict many hardware and compiler optimizations that are possible in uniprocessors. For higher performance, many manufacturers typically choose to support relaxed consistency models. In these models, memory fence instructions are also provided to permit selective overriding of default relaxed memory access ordering, where strict ordering must be enforced for program correctness. However, memory fences are costly because they cause a processor to stall. Although the underlying memory models or memory fence instructions require a set of memory orderings to be enforced, they are not always necessary at runtime. If reordering of two memory operations executed on one processor is not observed by any other processor, then the reordering is safe as the program will behave the same as if they vi were not reordered. Thus we observe that it is possible to relax some of memory orderings that are in general required but are unnecessary in the current execution. The goal of this dissertation is to dynamically identify the necessary memory orderings and thus relax the rest of the required memory orderings. The challenge of achieving this goal is to efficiently identify the necessary memory orderings. This dissertation explores programming, compiler, and hardware support for eliminating unnecessary memory orderings to improve program performance. First, conflict ordering is proposed for implementing SC efficiently, where SC is enforced by explicitly ordering only conflicting memory operations in the global memory order, instead of all memory operations. The approach achieves SC performance comparable to RMO (relaxed memory order), without aggressive post-retirement speculation. Next, three approaches (i.e., scoped fence, conditional fence, and address-aware fence) are proposed to relax memory orderings imposed by memory fences, representing different levels of support from programmer, compiler and hardware. These approaches make memory fences lightweight to use. The programming directed approach, scoped fence (S-Fence), introduces the con- cept of fence scope which constrains the effect of fences in programs. S-Fence enables programmers to specify the scope of fences using a customizable fence statement. The scope information is then encoded into binaries and conveyed to hardware. At runtime, hardware utilizes the scope information to determine whether a fence needs to stall due to uncompleted memory accesses in the scope. S-Fence bridges the gap between programmers’ intention and hardware execution with respect to the memory ordering enforced by fences. The compiler directed approach, conditional fence (C-Fence), utilizes compiler information to dynamically decide if there is a need to stall at each fence. The compiler vii helps to identify fence associates and incorporate this information in a C-Fence instruction. At runtime, to decide whether a fence can be dynamically eliminated, the hardware uses the fence associate information to check if there is any associate that is executing. It significantly reduces fence overhead, while only requiring lightweight hardware support. The hardware directed approach, address-aware fence, utilizes hardware support to collect memory access information from other processors to form a watchlist for each fence instance. The completion of a memory access following the fence is allowed if its memory address is not contained in the watchlist, appearing as if the fence does not take effect; otherwise, the memory access is delayed to ensure correctness. Address-aware fence is implemented in the microarchitecture without instruction set support and is transparent to programmers. It has the highest precision, eliminating nearly all possible unnecessary memory orderings due to fences. viii Table of Contents List of Figures xii List of Tables xv 1 Introduction 1 1.1 DissertationOverview ............................ ............... 8 1.1.1 Sequential consistency ......................... ............ 8 1.1.2 Fenceinstructions............................. ............ 9 1.2 Dissertation Organization ........................ ................ 15 2 Efficient Sequential Consistency via Conflict Ordering 16 2.1 Background ...................................... ............. 17 2.2 SCviaConflictOrdering ............................ ............. 19 2.2.1 Conflictordering............................... ........... 19 2.2.2 Proofofcorrectness ............................ ........... 20 2.3 BasicHardwareDesign ............................. ............. 23 2.3.1 Basicconflictordering .......................... ........... 24 2.3.2 Handling distributed directories . .............. 27 2.3.3 Correctness................................... ........... 29 2.4 EnhancedHardwareDesign.......................... ............. 31 2.4.1 Limitations of basic conflict ordering . ............. 31 2.4.2 Enhanced conflict ordering....................... ........... 32 2.4.3 Hardwaresummary ............................... ........ 37 2.5 Experimental Evaluation .......................... ............... 42 2.5.1 Implementation ................................ .......... 43 2.5.2 Executiontimeoverhead ......................... .......... 45 2.5.3 Sensitivitystudy.............................. ............ 47 2.5.4 Characterization of conflict ordering . ............. 49 2.5.5 Bandwidthincrease ............................. .......... 51 2.5.6 HWresourcesutilized........................... ........... 51 2.6 Summary ......................................... ............ 52 ix 3 Scoped Fence 54 3.1 Motivation ...................................... .............. 55 3.2 ScopedFence ..................................... ............. 58 3.2.1 SemanticsofS-Fence ............................. ......... 58 3.2.2 Scopeofafence ................................. ......... 59 3.3 ClassScope ...................................... ............. 60 3.3.1 Implementation design .......................... ........... 63 3.3.2 Anexample ..................................... ........ 70 3.4 SetScope ........................................ ............. 71 3.4.1 Implementation design .......................... ........... 72 3.5 UsingS-Fence..................................... ............. 73 3.6 Experimental Evaluation .......................... ............... 74 3.6.1 Lock-freealgorithms............................ ........... 76 3.6.2 Performance on full applications . ............ 78 3.6.3 Class scope vs. setscope ................................... 81 3.6.4 Sensitivitystudy.............................. ............ 81 3.7 Summary ......................................... ............ 84 4 Conditional Fence 85 4.1 FenceOrder .....................................

Load more