Dynamic Synthesis for Relaxed Memory Models

Dynamic Synthesis for Relaxed Memory Models Feng Liu Nayden Nedev Nedyalko Prisadnikov Princeton University Princeton University Sofia University [email protected] [email protected] [email protected] Martin Vechev Eran Yahav ∗ ETH Zürich Technion [email protected] [email protected] Abstract 1. Introduction Modern architectures implement relaxed memory models which Modern architectures use relaxed memory models in which mem- may reorder memory operations or execute them non-atomically. ory operations may be reordered and executed non-atomically [2]. Special instructions called memory fences are provided, allowing These models enable improved hardware performance but pose a control of this behavior. burden on the programmer, forcing her to understand the effect that To implement a concurrent algorithm for a modern architecture, the memory model has on their implementation. To allow program- the programmer is forced to manually reason about subtle relaxed mer control over those executions, processors provide special mem- behaviors and figure out ways to control these behaviors by adding ory fence instructions. fences to the program. Not only is this process time consuming and As multicore processors increasingly dominate the market, error-prone, but it has to be repeated every time the implementation highly-concurrent algorithms become critical components of many is ported to a different architecture. systems [28]. Highly-concurrent algorithms are notoriously hard In this paper, we present the first scalable framework for han- to get right [22] and often rely on subtle ordering of events, which dling real-world concurrent algorithms running on relaxed archi- may be violated under relaxed memory models [14, Ch.7]. tectures. Given a concurrent C program, a safety specification, and a description of the memory model, our framework tests the pro- Placing Memory Fences Manually reasoning where to place gram on the memory model to expose violations of the specifica- fences in a concurrent program running on a relaxed architecture tion, and synthesizes a set of necessary ordering constraints that is a challenging task. Using too many fences (over-fencing) hin- prevent these violations. The ordering constraints are then realized ders performance, while missing necessary fences (under-fencing) as additional fences in the program. permits illegal executions. Manually balancing between over- and under- fencing is very difficult, time-consuming and error-prone We implemented our approach in a tool called DFENCE based on LLVM and used it to infer fences in a number of concurrent al- [4, 14]. Furthermore, the process of placing fences is repeated whenever the algorithm changes, and whenever it is ported to a gorithms. Using DFENCE, we perform the first in-depth study of the interaction between fences in real-world concurrent C programs, different architecture. correctness criteria such as sequential consistency and linearizabil- Our goal is to automate the task of fence placement and free the ity, and memory models such as TSO and PSO, yielding many in- programmer to focus on the algorithmic details of her work. Au- teresting observations. We believe that this is the first tool that can tomatic fence placement is important for expert designers of con- handle programs at the scale and complexity of a lock-free memory current algorithms, as it lets the designer quickly prototype with allocator. different algorithmic choices. Automatic fence placement is also important for any programmer trying to implement a concurrent al- Categories and Subject Descriptors D.1.3 [Concurrent Pro- gorithm as published in the literature. Because fence placement de- gramming]; D.2.4 [Program Verification] pend on the specific architecture, concurrent algorithms are usually published without a detailed fence placements for different archi- General Terms Algorithms, Verification tectures. This presents a nontrivial challenge to any programmer Keywords Concurrency, Synthesis, Relaxed Memory Models, trying to implement an algorithm from the literature on relaxed ar- Weak Memory Models chitectures. ∗ Dynamic Synthesis Existing approaches to automatic fence in- Deloro Fellow ference are either overly conservative [26], resulting in over- fencing, or have severely limited scalability [17, 18]. The main idea in this paper is to break the scalability barrier of static approaches by performing the synthesis based on dynamic execu- Permission to make digital or hard copies of all or part of this work for personal or tions. To identify illegal executions, we introduce a flush-delaying classroom use is granted without fee provided that copies are not made or distributed demonic scheduler that is effective in exposing illegal executions for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute under relaxed memory models. to lists, requires prior specific permission and/or a fee. Given a program P , a specification S, and a description of the PLDI’12, June 11–16, 2012, Beijing, China. memory model M, guided execution (using the flush-based sched- Copyright c 2012 ACM 978-1-4503-1205-9/12/06. $10.00 uler) of P under M can be used to identify a set of illegal execu- 1 int take() { 1 void put(int task) { Motivating Example Fig. 1 shows a pseudo-code of the Chase- 2 while (true) { 2 t = T; Lev work-stealing queue [7] (note that our implementation handle 3 t = T - 1; 3 items[t] = task; the complete C code). A work-stealing queue is a special kind of 4 T = t; 4 fence(st-st); //F2 5 fence(st-ld); //F1 double-ended queue that provides three operations: put, take, and 5 T = t + 1; 6 h = H; steal put take 6 fence(st-st); //F3 . A single owner thread can and an item from the 7 if (t < h) { 7 } tail of the queue, and multiple thief threads can steal an item from 8 T = h; the head of the queue. 1 int steal() { 9 return EMPTY; 2 while (true) { 10 } In the implementation of Fig. 1, H and T are global shared 3 h = H; 11 task = items[t]; variables storing head and tail indices of the valid section of the 4 t = T; 12 if (t > h) array items in the queue. The operations put and take operate on 5 if (h >= t) 13 return task; one end of the array, and steal operates on the other end. 6 return EMPTY; 14 T = h + 1; 7 task = items[h]; 15 if(!cas(&H,h,h+1)) The put operation takes a task as parameter, and adds it to the 8 if(!cas32(&H,h,h+1)) 16 continue; tail of the queue by storing it in items[t] and incrementing the 9 continue; 17 return task; tail index T. The take operation uses optimistic synchronization, 10 return task; 18 } repeatedly trying to remove an item from the tail of the queue, 11 } 19 } 12 } potentially restarting if the tail and the head refer to the same item. take works by first decrementing the tail index, and comparing the Figure 1: Simplified version of the Chase-Lev work-stealing queue. new value with the head index. There are three possible cases: Here, store-load fence F1 prevents the non-SC scenario of Fig. 2a • new tail index is smaller than head index, meaning that the under TSO; F1 combined with store-store fence F2 prevents the queue is empty. In this case, the original value of tail index is non-SC scenario of Fig. 2b under PSO. F1, F2 and store-store fence restored and the take returns EMPTY (line 9). F3 prevent the linearizability violation of Fig. 2c under PSO. • new tail index is larger than head index, take then uses it to read and return the item from the array. • new tail index equals to head index, in which case, the only item tions (violating S). The tool will then automatically synthesize a steal 0 in the queue may be potentially stolen by a concurrent program P that avoids the observed illegal executions (under M), operation. A compare-and-swap (CAS) instruction is used to but still permits as many legal executions as possible. check whether the head has changed since we read it into h. If the head has not changed, then there is no concurrent steal, and Evaluation Enabled by our tool, we perform the first in-depth the single item in the queue can be returned, while the head is study of the subtle interaction between: i) required fences in a num- updated to the new value h+1. If the value of head has changed, ber of real-world concurrent C algorithms (including a lock-free take restarts by going to the next loop iteration. memory allocator), ii) correctness criteria such as linearizabiltiy and (operation-level) sequential consistency [14], and iii) relaxed Similarly, the implementation of steal reads the head and tail memory models such as TSO and PSO. indexes of the array first, and if the head index is larger or equal to the tail index, either the queue is empty or the only item is taken Main Contributions The main contributions of this paper are: by the owner, and the thief returns EMPTY. Otherwise, it can read the items pointed by the head. In case no other thieves are stealing • A novel framework for dynamic synthesis of synchronization the same element, a CAS instruction is used to update the head under relaxed memory models. The framework breaks the scal- index atomically. If the CAS succeeds, the item can be returned, ability barrier of static synthesis approaches by performing the otherwise, the steal function needs to retry. synthesis based on dynamic executions. Correctness Criteria and Terminology The implementation of • A flush-delaying demonic scheduler which delays the effect of Fig.

Dynamic Synthesis for Relaxed Memory Models

Fencing Cyberspace: Drawing Borders in a Virtual World Maureen A

Survey of Methodologies, Approaches, and Challenges in Parallel Programming Using High-Performance Computing Systems

SAP Solutions on Vmware Vsphere Guidelines Summary and Best Practices

Oracle Solaris and Oracle SPARC Systems—Integrated and Optimized for Mission Critical Computing

Cluster Suite Overview

Arxiv:1312.1411V2 [Cs.LO] 9 Jun 2014 Ocretporm R Adt Einadipeet Seilyw Especially Implement, Buﬀering and Implement Design Multiprocessors to Tectures

High Performance Virtual Machine Recovery in the Cloud

Comparing and Improving Centralized and Distributed Techniques for Coordinating Massively Parallel Shared-Memory Systems

Scale-Out Deployments of SAP HANA on Vsphere

Fundamental Understanding and Future Guidance for Handheld Computers in the Rail Industry

Don't Sit on the Fence*

Arxiv:1611.07372V1 [Cs.LO] 22 Nov 2016 the Strength of Our Approach, We Have Implemented a Prototype and Run It on a Wide Range of Examples and Benchmarks