Compiler Optimisations and Relaxed Memory Consistency Models Robin Morisset

Compiler optimisations and relaxed memory consistency models Robin Morisset To cite this version: Robin Morisset. Compiler optimisations and relaxed memory consistency models. Other [cs.OH]. Université Paris sciences et lettres, 2017. English. NNT : 2017PSLEE050. tel-01823521 HAL Id: tel-01823521 https://tel.archives-ouvertes.fr/tel-01823521 Submitted on 26 Jun 2018 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. THÈSE DE DOCTORAT de l’Université de recherche Paris Sciences Lettres PSL Research University Préparée à l’École normale supérieure Compiler Optimisations and Relaxed Memory Consistency Models ED 386 sciences mathématiques de paris centre Spécialité informatique composition du jury : M. Andrew Appel Princeton University Rapporteur (non membre du jury) M. Doug Lea State University of New York Rapporteur M. Mark Batty University of Kent Examinateur Robin Morisset M. Derek Dreyer Version du 12/07/2017 Max Planck Institute for Software Systems Examinateur M. Marc Pouzet École normale supérieure Dirigée par Francesco Zappa Nardelli Président du jury M. Dmitry Vyukov Google Examinateur M. Francesco Zappa Nardelli Inria Directeur de thèse Acknowledgments First, I would like to thank Francesco for being the best advisor I could have hoped for, leaving me great freedom in what to research while always being available when I needed advice. I hope to have learned some of his riguor and pedagogy through these years. I am also grateful to all the members of my jury for their participation, and in particular to Andrew and Doug for having reviewed very thoroughly this thesis in such a short time. I was supported throughout this work by a fellowship from Google for which I am thankful. Google, and J.F. Bastien in particular also offered me the wonderful opportunity to apply my research in the Clang compiler through a 3 month internship that I greatly appreciated. Throughout the rest of my thesis I frequently interacted with the other members of the Parkas team, who made it a very welcoming place to work. I would especially like to thank Guillaume Baudart, Adrien Guatto and Nhat Minh Lê for the many coffee breaks we shared discussing everything and anything, and for the sleepless nights I spent writing papers with Adrien and Nhat. Obviously, research is frequently a group effort, and I would like to thank now all those I collaborated with over the past few years. Obviously this includes Francesco, Adrien, and Nhat, but also Thibaut Balabonski, Pankaj Pawan, and Viktor Vafeiadis. I also benefited from more informal research collaborations and discussions, in particular with Peter Sewell’s team in Cambridge: Mark Batty, Kayvan Memarian, Jean Pichon, Peter Sewell and others made each of my trips to Cambridge remarkably productive and enjoyable. I am also deeply thankful to Jean-Christophe Filliâtre. He not only made me discover research on compilers through his course in 2009 and his recommendation for my first internship; he also made me a TA in 2015 for his compilation course. This teaching work was a pleasure, with wonderful students, and gave me something to look forward to every week, even when nothing was working on the research side. I would also like to thank my family, for believing in me even when I didn’t and supporting me all these years. I wouldn’t be here without you. Last but not least, I would like to thank all of my friends, who made these years bearable and often even fun. 2 Contents 1 Why study compiler optimisations of shared memory programs? 5 2 A primer on memory models 9 2.1 Examples of non-sequentially consistent behaviors . .9 2.2 Hardware memory models . 10 2.2.1 The x86 architecture . 10 2.2.2 The Power and ARM architectures . 13 2.2.3 Other architectures . 17 2.3 Language memory models . 17 2.3.1 Compiler optimisations in the Power model? . 18 2.3.2 Data-Race Freedom (DRF) . 20 2.3.3 The C11/C++11 model . 20 3 Issues with the C11 model and possible improvements 28 3.0.1 Consume accesses . 28 3.0.2 SC fences . 29 3.1 Sanity properties . 29 3.1.1 Weakness of the SCReads axiom . 29 3.1.2 Relaxed atomics allow causality cycles . 30 3.2 Further simplifications and strengthenings of the model . 34 3.2.1 Simplifying the coherency axioms . 34 3.2.2 The definition of release sequences . 36 3.2.3 Intra-thread synchronisation is needlessly banned . 36 4 Correct optimizations in the C11 model 38 4.1 Optimising accesses around atomics . 39 4.1.1 Eliminations of actions . 39 4.1.2 Reorderings of actions . 43 4.1.3 Introductions of actions . 44 4.2 Optimizing atomic accesses . 44 4.2.1 Basic Metatheory of the Corrected C11 Models . 44 4.2.2 Verifying Instruction Reorderings . 47 4.2.3 Counter-examples for unsound reorderings . 50 4.2.4 Verifying Instruction Eliminations . 57 5 Fuzzy-testing GCC and Clang for compiler concurrency bugs 61 5.1 Compiler concurrency bugs . 61 5.2 Compiler Testing . 64 5.3 Impact on compiler development . 69 3 6 LibKPN: A scheduler for Kahn Process Networks using C11 atomics 72 6.1 Scheduling of Kahn Process Networks . 72 6.1.1 Graphs of dependencies between task activations . 73 6.1.2 Scheduling: why naive approaches fail . 73 6.2 Algorithms for passive waiting . 75 6.2.1 Algorithm F: double-checking in suspend ............... 75 6.2.2 Algorithm S: removing fences on the critical path . 78 6.2.3 Algorithm H: keeping the best of algorithms F and S . 78 6.2.4 Going beyond KPNs: adding select in algorithm H . 79 6.2.5 Implementation details . 81 6.3 Proofs of correctness . 85 6.3.1 Outline of the difficulties and methods . 85 6.3.2 Absence of races . 86 6.3.3 Guarantee of progress . 90 6.4 Remarks on C11 programming . 93 6.5 Applications and experimental evaluation . 94 6.5.1 Micro-benchmarks . 94 6.5.2 Dataflow benchmarks . 95 6.5.3 Linear algebra benchmarks . 97 7 A fence elimination algorithm for compiler backends 102 7.1 A PRE-inspired fence elimination algorithm . 103 7.1.1 Leveraging PRE . 105 7.1.2 The algorithm on a running example . 106 7.1.3 Corner cases . 109 7.1.4 Proof of correctness . 110 7.2 Implementation and extensions to other ISAs . 111 7.3 Experimental evaluation . 113 8 Perspectives 115 8.1 Related work . 116 8.1.1 About optimisations in a concurrent context . 116 8.1.2 About fuzzy-testing of compilers . 116 8.1.3 About LibKPN . 117 8.1.4 About fence elimination . 118 8.2 Future work . 119 8.2.1 About the C11 model . 119 8.2.2 About the fence elimination algorithm . 119 8.2.3 About LibKPN . 121 A Proof of Theorem 4 131 4 Chapter 1 Why study compiler optimisations of shared memory programs? Multicore computers have become ubiquitous, in servers, personal computers and even phones. The lowest abstraction for programming them is the shared-memory, multi- threaded program, in which several sequences of instructions (threads) are executed con- currently, accessing the same data. When these programs are compiled, it is usually done piecemeal: the compiler only see fragments of a thread at any time, unaware of the ex- istence of other threads. This disconnect can lead compilers to break the expectations of the programmers regarding such multi-threaded programs. Consider for example the following program which shows two threads executing side-by-side: Initially, x = y = 0 Thread 1 Thread 2 x:= 1; if (x > 0) { if (y > 0) { x := 0; print x; y := 1; } } It has two global variables, x and y which are initialized at 0, and they are shared by the two threads. We assume both threads are started at the same time. What values can the first thread print? At first glance, 0 is the only value that can be printed: if the conditional in the left thread is taken, it means the store of 1 to y has happened, and so the store of 0 to x has happened too. Since it is also in a conditional, it must have happened after the store of 1 to x. So the value of x is 0 when it is printed. But if the first thread is compiled by an optimising compiler, the compiler will execute the constant propagation optimisation: since there is no visible access to x between its store and its load, the compiler will assume the value has not changed and will replace print x by print 1, introducing a new behaviour. Is this a bug in the compiler? No, the problem was our assumption that all threads have the same view of memory at all times, and that executing a multi-threaded program is equivalent to repeatedly picking one thread and executing it for one step. These assumptions are called the Sequential Consistency model (SC) and were first formal- ized in [Lam79]. Constant propagation is not the only optimisation that can break the SC model. Common Sub-expression Elimination (CSE), Loop Invariant Code Motion (LICM), Global Value Numbering (GVN) and Partial Redundancy Elimination (PRE) 5 are some other examples.

Compiler Optimisations and Relaxed Memory Consistency Models Robin Morisset

Memory Consistency

Memory Consistency: in a Distributed Outline for Lecture 20 Memory System, References to Memory in Remote Processors Do Not Take Place I

Towards Shared Memory Consistency Models for Gpus

On the Coexistence of Shared-Memory and Message-Passing in The

Consistency Models • Data-Centric Consistency Models • Client-Centric Consistency Models

Challenges for the Message Passing Interface in the Petaflops Era

An Overview on Cyclops-64 Architecture - a Status Report on the Programming Model and Software Infrastructure

Constraint Graph Analysis of Multithreaded Programs

Memory Consistency Models

The Openmp Memory Model

Optimization of Memory Management on Distributed Machine Viet Hai Ha

Dag-Consistent Distributed Shared Memory