Foundations of the C++ Concurrency Memory Model

Foundations of the C++ Concurrency Memory Model Hans-J. Boehm Sarita V. Adve HP Laboratories University of Illinois at Urbana-Champaign [email protected] [email protected] Abstract Initially X=Y=0 Currently multi-threaded C or C++ programs combine a single- T1 T2 r1=X r2=Y threaded programming language with a separate threads library. if (r1==1) if (r2==1) This is not entirely sound [7]. Y=1 X=1 We describe an effort, currently nearing completion, to address these issues by explicitly providing semantics for threads in the Is outcome r1=r2=1 allowed? next revision of the C++ standard. Our approach is similar to Figure 1. that recently followed by Java [25], in that, at least for a well- Without a precise definition, it is unclear if this example defined and interesting subset of the language, we give sequentially is data-race-free. The outcome shown can occur in an execution consistent semantics to programs that do not contain data races. where threads T1 and T2 speculate that the values of X and Y Nonetheless, a number of our decisions are often surprising even to respectively are 1, then each thread writes 1, validating the other’s those familiar with the Java effort: speculation. Such an execution has a data race on X and Y, but most programmers would not envisage such executions when assessing • We (mostly) insist on sequential consistency for race-free pro- whether the program is data-race-free. grams, in spite of implementation issues that came to light after the Java work. future performance gains will require explicitly parallel applica- • We give no semantics to programs with data races. There are no tions. benign C++ data races. The most established way to write parallel applications in stan- • We use weaker semantics for trylock than existing languages dard programming languages is as a collection of threads sharing an or libraries, allowing us to promise sequential consistency with address space. In fact, a large fraction of existing desktop applica- an intuitive race definition, even for programs with trylock. tions already use threads, though typically more to handle multiple event streams, than parallel processors. The majority of such appli- This paper describes the simple model we would like to be able cations are written in either C or C++, using threads provided by to provide for C++ threads programmers, and explain how this, to- the operating system. We will use Pthreads [21] as the canonical gether with some practical, but often under-appreciated implemen- example. The situation on Microsoft platforms is similar. tation constraints, drives us towards the above decisions. Both the C and C++ languages are currently defined as single- threaded languages, without reference to threads. Correspondingly, Categories and Subject Descriptors D.3.3 [Programming Lan- compilers themselves are largely thread-unaware, and generate guages]: Language Constructs and Features—Concurrent Pro- code as for a single-threaded application. In the absence of fur- gramming Structures; D.1.3 [Programming Techniques]: Concur- ther restrictions, this allows compilers to perform transformations, rent Programming—Parallel Programming; B.3.2 [Memory Struc- such as reordering two assignments to independent variables, that tures]: Design Styles—Shared Memory do not preserve the meaning of multithreaded programs [29, 2, 25]. General Terms languages, standardization, reliability The OS thread libraries attempt to informally address this problem by prohibiting concurrent accesses to normal variables Keywords Memory consistency, memory model, sequential con- or data races. To prevent such concurrent accesses, thread li- sistency, C++, trylock, data race braries provide a collection of synchronization primitives such as pthread mutex lock(), which can be used to restrict shared vari- 1. Introduction able access to one thread at a time. Implementations are then pro- As technological constraints increasingly limit the performance of hibited from reordering synchronization operations with respect to individual processor cores, the computer industry is relying in- ordinary memory operations in a given thread. This is enforced in a creasingly on larger core counts to provide future performance im- compiler by treating synchronization operations as opaque and po- provements. In many domains, it is expected that any substantial tentially modifying any shared location. It is typically assumed that ordinary memory operations between synchronization operations may be freely reordered, subject only to single-thread constraints. Unfortunately, informally describing multithreaded semantics as part of a threads library in the above way is insufficient for Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed several reasons, as described in detail by Boehm [7]. Briefly, the for profit or commercial advantage and that copies bear this notice and the full citation key reasons are as follows. on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. • The informal specifications provided by current libraries are PLDI’08, June 7–13, 2008, Tucson, Arizona, USA. ambiguous at best; e.g., what is a data race and what precisely Copyright c 2008 ACM 978-1-59593-860-2/08/06. $5.00 are the semantics of a program without a data race? Figure 1 68 struct s { char a; char b; } x; model constrains the transformations allowed by a C++ compiler. The memory model for the hardware that will run the produced bi- Thread 1: Thread 2: nary must not allow results that would be illegal for the C++ model x.a = 1; x.b = 1; applied to the original program. Thread 1 is not equivalent to: There has been extensive work in memory models over the last struct s tmp = x; two decades. Sequential consistency, defined by Lamport [24], is tmp.a = 1; considered to be the most intuitive model. It ensures that memory x = tmp; operations appear to occur in a single total order (i.e., atomically); further, within this total order, the memory operations of a given Figure 2. Compiler transformations must be aware of threads. thread appear in the program order for that thread. Many compilers perform transformations similar to the one above Unfortunately, sequential consistency restricts many common when a is declared as a bit-field. The transformation may be visible compiler and hardware optimizations [2]. There has been signifi- to client code, since the update to b by thread 2 may be overwritten cant academic progress in compiler algorithms to determine when by the store to the complete structure x. a transformation is safe for sequential consistency [29, 30, 23] and in hardware that speculatively performs traditional optimizations without violating sequential consistency [19, 28, 33, 15]. However, shows an example [1] where the answers to these questions are current commercial compilers and most current commercial hard- unclear from the current specification. ware do not preserve sequential consistency. • Without precise semantics, it is hard to reason when a com- To overcome the performance limitations of sequential consis- piler transformation will violate these semantics. For example, tency, hardware vendors and researchers have proposed several re- Boehm [7] shows that reasonable (common) interpretations of laxed memory models [2]. These models allow various hardware the current specification do not preclude a class of compiler optimizations, but most are specified at a low level and are gener- transformations that effectively introduce new writes to poten- ally difficult to reason with for high level language programmers. tially shared variables. Examples include Figure 2 and specula- They also limit certain compiler optimizations [25]. tive register promotion. As a result, conventional compilers can, An alternative approach, called the data-race-free [3, 1] or prop- and do, on rare occasions, introduce data races, producing com- erly labeled [18, 17] models, has been proposed to achieve both the pletely unexpected (and unintended) results without violating simple programmability of sequential consistency and the imple- the letter of the specification. mentation flexibility of the relaxed models. This approach is based • There are many situations in which the overhead of preventing on the observation that good programming practice dictates pro- concurrent access to shared data is simply too high. Either ven- grams be correctly synchronized or data-race-free. These models dors, or sometimes applications, generally provide facilities for formalize correct programs as those that do not contain data races atomic (indivisible, safe for unprotected concurrent use) data in any sequentially consistent execution. They guarantee sequential access, without the use of locks. For example, gcc provides a consistency to such data-race-free programs, and do not provide family of sync intrinsics, Microsoft provides Interlocked any guarantees whatsoever to others. Thus, the approach combines operations, and the Linux kernel defines its own atomic opera- a simple programming model (sequential consistency) and high tions (which have been regularly (ab)used in user-level code). performance (by guaranteeing sequential consistency to only well- These solutions have not been satisfactory, both because they written programs). Different data-race-free models successively re- are not portable and because their interactions with other shared fine the notion of a race to provide increasing flexibility, but requir- memory variables are generally not carefully or clearly speci- ing increasing amounts of information from the programmer. The fied. data-race-free-0 model uses the simplest definition of a data race; i.e., two concurrent conflicting accesses (formalized later). In order to address the above issues, an effort was begun several Among high-level programming languages, Ada 83 [32] was years ago to properly define the semantics of multi-threaded C++ perhaps the first to adopt the approach of requiring synchronized programs, specifically, the memory model, in the next revision of accesses and leaving semantics undefined otherwise.

Foundations of the C++ Concurrency Memory Model

MIPS® Architecture for Programmers Volume I-B: Introduction to the Micromips32™ Architecture, Revision 5.03

MIPS32™ Architecture for Programmers Volume I: Introduction to the MIPS32™ Architecture

MCST~8 Assembly Language Programming Manual PRELIMINARY EDITION

Notes on Usage of the RX Family C/C++ Compiler V.1.00 Release 01 and Corrections in the User's Manual

(SLB) Predictor: a Compiler Assisted Branch Prediction for Data Dependent Branches

Downloads/Kornau-Tim--Diplomarbeit--Rop.Pdf [34] Sebastian Krahmer

PL/I for AIX: Programming Guide

Jump Tables Via Function Pointer Arrays in C/C++ Jump Tables, Also Called Branch Tables, Are an Efficient Means of Handling Similar Events in Software

LC-2 Programmer's Reference and User Guide

Control Flow

Lecture Notes Companion CPSC 213 2Nd Edition DRAFT Oct28

Low-Level C Programming CSEE W4840