<<

practice

DOI10.1145/2076450.2076465

Article development led by queue.acm.org

Data races are evil.

BY HANS-. BOEHM AND SARITA V. ADVE You Don’t Know Jack About or Memory Models

A GOOGLE SEARCH for “threads are evil” generates can access all of the application’s 18,000 hits, but threads—evil or not—are ubiquitous. memory. Shared variables are either the core strength of threads or the root Almost all of the processes running on a modern of their evil, depending on your per- Windows PC use them. threads are typically spective. They allow threads to com- municate easily and quickly, but they how programmers get machines with multiple also make it possible for threads to get cores to work together to solve problems faster. in each other’s way. And often they are what allow user interfaces to Although shared variables are at the core of most programs, even experts remain responsive while the application performs are often confused about the rules for a background calculation. using them. Consider the following simple example. Threads are multiple programs running at the same To implement a function incr that time but sharing variables. Typically, every thread increments a counter x, your first at-

48 COMMUNICATIONS OF THE ACM | FEBRUARY 2012 | VOL. 55 | NO. 2 tempt might be case that two threads coincidentally void incr() perform these in lockstep, they will { void incr() both read the same value, both add mtx.lock(); { one to it, and then both write the x++; x++; same value, incrementing x by only mtx.unlock(); } one instead of two. A call to incr() } does not behave atomically; it is vis- Many would immediately object ible to the user that it is composed In Java, this might look like that this isn’t guaranteed to pro- of different steps. (Atomicity means duce the correct answer when called different things to different com- void incr() by multiple threads. The statement munities; our use is called isolation { x++ is equivalent to x=x+1, which by folks.) synchronized(mtx) { amounts to three steps: Getting the We might address the problem by x++; value of x; adding one; and writing using a mutex, which can be locked } LLUSTRATION BY AGSANDREW / SHUTTERSTOCK.COM AGSANDREW BY LLUSTRATION I the result back to x. In the unlikely by only one thread at a time: }

FEBRUARY 2012 | VOL. 55 | NO. 2 | COMMUNICATIONS OF THE ACM 49 practice

or perhaps just the blue thread runs its last step. The result is that we incremented 999 synchronized void incr() twice to get 2000. This is difficult to { explain to a programmer who doesn’t x++; understand precisely how the code is } Although shared being compiled. variables are The fundamental problem is that Those would all work correctly, but multiple threads were accessing x mutex calls can be slow, so the result at the core of at the same time, without proper may run slower than desired. most programs, locking or other synchronization to What if we are concerned only make sure that one occurred after the about getting an approximate count? even experts other. This situation is called a data What if we just leave off the mutex, are often confused race—which really is evil! We will get and settle for some inaccuracy? What back to avoiding data races without could go wrong? about the rules locks later. To begin with, we observed that some actual code incrementing such a for using them. Another Racy Example counter in two threads without a mutex We have only begun to see the prob- routinely missed about half the counts, lems caused by data races. Here is an probably a result of unfortunate tim- example commonly tried in real code. ing caused by communication be- One thread initializes a piece of data tween the processors’ caches. It could (say, x) and sets a flag (call it done) be worse. A thread could do nothing when it finishes. Any thread that later but call incr() once, loading the val- reads x first waits for the done flag, as ue zero from x at the beginning, get in Figure 2. What could possibly go suspended for a long time, and then wrong? write back one just before the program This code may work reliably with terminates. This would result in a final a “dumb” compiler, but any “clever” count of one, no matter what the other optimizing compiler is likely to break threads did. it. When the compiler sees the loop, Those are the cases that are less sur- it is likely to observe that done is prising and easier to explain. The final not modified in the loop (that is, it is count can also be too high. Consider a “loop-invariant”). Thus, it gets to as- case in which the count is bigger than sume that done does not change in a machine word. To avoid dealing with the loop. binary numbers, assume we have a Of course, this assumption isn’t ac- decimal machine in which each word tually correct for our example, but the holds three digits, and the counter x compiler gets to make it anyway, for can hold six digits. The compiler trans- two reasons: compilers were tradition- lates x++ to something like ally designed to compile sequential, not multithreaded code; and because, tmp_hi = x_hi; as we will see, even modern multi- tmp_lo = x_lo; threaded languages continue to allow (tmp_hi, tmp_lo)++; this, for good reason. x_hi = tmp_hi; Thus, the loop is likely to be trans- x_lo = tmp_lo; formed to

where tmp_lo and tmp_hi are machine tmp = done; while (!tmp) {} registers, and the increment operation in the middle would really involve sev- or maybe even eral machine instructions. Now assume that x is 999 (x_hi = tmp = done; if (!tmp) while (true) {} 0, and x_lo = 999), and two threads, a blue and a red one, each increment In either case, if done is not already x as shown in Figure 1 (remember set when a red thread starts, the red that each thread has its own copy of thread is guaranteed to enter an the machine registers tmp_hi and infinite loop. tmp_lo). The blue thread runs almost Assume we have a “dumb” compiler to completion; then the red thread that does not perform such transfor- runs all at once to completion; finally mations and compiles the code exactly

50 COMMUNICATIONS OF THE ACM | FEBRUARY 2012 | VOL. 55 | NO. 2 practice as written. Depending on the hard- be zero when they both finished. Al- ever, is critical in understanding the ware, this code can still fail. though the original program was well behavior of real shared variables, for The problem this time is that the behaved and had no data races, the two reasons: hardware may optimize the blue compiler added an implicit update to ! Essentially all modern languages thread. Nearly all processor architec- b2 that created a data race. (Java, ++11, C11) do in fact promise tures allow stores to memory to be This kind of data-race insertion sequential consistency for programs saved in a buffer visible only to that has been clearly disallowed in Java for without data races. This guarantee is processor core before writing them a long time. The recently published normally violated by a few low-level to memory visible to other processor C++11 and C11 standards also disal- language features—notably, Java’s cores.2 Some, such as the ARM chip low it. We know of no Java implemen- lazySet() and C++11 and C11’s ex- that is probably in your smartphone, tations with such problems, nor do plicit memory_order... specifications, allow the stores to become visible to modern C and C++ compilers general- which are easy to avoid (with the pos- other processor cores in a different ly exhibit precisely this problem. Un- sible exception of OpenMP’s atomic order. On such a processor the blue fortunately, many do introduce data directive) and which we’ll mostly ig- thread’s write to done may become races under certain obscure, unlikely, nore here. Most programmers will visible to the red thread, running on and unpredictable conditions. This also want to ignore these features. another core, before the blue thread’s problem will disappear as C++11 and ! So far we have been a bit impre- write to x. Thus, the red thread may C11 become widely supported. cise about what constitutes a data see done set to true, and the loop For C and C++, the story for bit- race. Since this has now become a crit- may terminate before it can retrieve fields is slightly more complicated. ical part of our programming rules, we the proper value of x. Thus, when the We’ll discuss that more, later. can make it more precise as follows: red thread accesses x, it may still get two memory operations conflict if they the uninitialized value. And the Real Rules Are… access the same memory location and Unlike the original problem of The simplest view of threads, and the at least one of the accesses is a write. reading done once outside the loop, one we started with, is that a multi- For our purposes, a memory location this problem will occur infrequently, threaded program is executed by in- is a unit of memory that is separately and may well be missed during test- terleaving steps from each thread. updatable. Normally every scalar (un- ing. Logically the computer executes a step structured) variable or field occupies Again the core problem here is that from one thread, then picks another its own memory location; each can be although the done flag is intended to thread, or possibly the same one, ex- independently updated. Contiguous prevent simultaneous accesses to x, it ecutes its next step, and so on. This is sequences of C or C++ bit fields, how- can itself be simultaneously accessed a sequentially consistent execution. ever, normally share a single location; by both threads. And data races are As already shown, real machines updating one potentially interferes evil! and compilers sometimes result in with the others. non-sequentially consistent execu- Two conflicting data operations Bits and Bytes tions: for example, when the assign- form a data race if they are from dif- So far, we have talked only about data ment to a variable and a done flag are ferent threads and can be executed “at races in which two threads access ex- made visible to other threads out of the same time.” But when is this pos- actly the same variable, or object field, order. Sequential consistency, how- sible? Clearly that depends on how at the same time. That has not always been the only concern. According to Figure 1. Two interleaved multi-word increments. some older standards, when you de- clare two small fields b1 and b2 next tmp_hi = x_hi; to each other, for example, then up- tmp_lo = x_lo; dating b1 could be implemented with (tmp_hi, tmp_lo)++; // tmp_hi = 1, tmp_lo = 0 the following steps: x_hi = tmp_hi; // x_hi = 1, x_lo = 999, x = 1999 x++; // red runs all steps 1. Load the machine word contain- // x_hi = 2, x_lo = 0, x = 2000 ing both b1 and b2 into a machine reg- x_lo = tmp_lo; // x_hi = 2, x_lo = 0 ister. 2. Update the b1 piece in the ma- chine register. 3. Store the register back to the lo- Figure 2. Waiting on a flag. cation from which it was loaded. Unfortunately, if another thread updates b2 just before the last step, then that update is overwritten by Blue Thread Other Threads x = ...; while (!done) {} the last step and effectively lost. If done = true; ... = x; both fields were initially zero, and one thread executed b1 = 1, while the other executed b2 = 1, b2 could still

FEBRUARY 2012 | VOL. 55 | NO. 2 | COMMUNICATIONS OF THE ACM 51 practice

shared variables behave, which we’re tures previously mentioned are avoid- in terms of accesses to memory loca- trying to define. ed). tions, and sequential consistency is We break this circularity by con- This is very different from promis- defined in terms of interleaving in- sidering only sequentially consistent ing full sequential consistency for all divisible steps, which are effectively executions: two conflicting operations programs; our earlier examples are machine instructions. This is an en- in a sequentially consistent execution not guaranteed to work as expected, tirely new complication. A program- execute at the same time, if one ap- since they all have data races. None- mer writing sequential code does not pears immediately after the other in theless, when writing a program, need to know about the granularity that execution’s interleaving. Now we there is no need to think explicitly of machine instructions and whether can say that a program is data-race- about compiler or hardware memory memory is accessed a byte or a word at free if none of its sequentially consis- reordering; we can still reason entirely a time. tent executions has a data race. in terms of sequential consistency, as Fortunately, once we insist on data- Here we have defined a data race long as we follow the rules and avoid race-free programs, this issue disap- in terms of data operations explicitly data races. pears. A very useful side effect of our to exclude synchronization operations This has some consequences that model is that a thread’s synchroniza- such as locking and unlocking a mu- often surprise programmers. Con- tion-free regions appear indivisible or tex. Two operations on the same mu- sider the program in Figure 3, where atomic. Thus, although our model is tex do not introduce a data race if they x and y are initially false. When rea- defined in terms of memory locations appear next to each other in the inter- soning about whether this has a data and individual steps, there is really no leaving. Indeed, they could not useful- race, we observe that there is no se- way to tell what those steps and mem- ly control simultaneous data accesses quentially consistent execution (that ory locations are without introducing if concurrent accesses to the mutexes is, no interleaving of thread steps) in data races. were disallowed. which either assignment is executed. More generally, data-race-free pro- Thus, the basic programming mod- Thus, there are no pairs of conflicting grams always behave as though they el is: operations, and hence certainly no were interleaved only at synchroniza- ! Write code such that data races data races. tion operations, such as mutex lock/ are impossible, assuming that the im- unlock operations. If this were not the plementation follows sequential con- Work at a Higher Level case, synchronization-free code sec- sistency rules. So far, our programming model still tions from different threads would ap- ! The implementation then guaran- has us thinking of interleaving thread pear to interleave as in figures 4 and 5. tees sequential consistency for such execution at the memory-access or in- In the first case (Figure 4), no such code (assuming that the low-level fea- struction level. Data races are defined interleaved code sections contain conflicting operations, and each sec- Figure 3. Is there a data race if initially x = y = false? tion effectively operates on its own separate set of memory locations. The instruction interleaving is entirely Blue Thread Red Thread equivalent to one in which these code if (x) y = true; if (y) x = true; sections execute one after the other as shown in the figure, with the only vis- ible interleaving at synchronization operations (not shown). In the second case (Figure 5), two Figure 4. Conflict-free interleaving is not observable. code sections contain conflicting op- erations on the same memory loca-

r1 = x; r1 = x; tion. In this case there is an alternate r2 = y; v = r1; interleaving in which the conflicting v = r1; z = 2; operations appear next to each other, w = r2; r2 = y; and a data race is effectively exhibited, z = 2; w = r2; as shown. Thus, this cannot happen for data-race-free programs. This means that, for a data-race- Figure 5. Interleaving with conflict implies a data race. free program, any section of code containing no synchronization opera- tions behaves as though it executes r1 = x; r1 = x; atomically (that is, all at once) without r2 = y; v = r1; v = r1; r2 = y; Adjacent conflicting operations; being affected by other threads and w = r2; y = 2; }Data race without another thread being able to y = 2; w = r2; see any variable values occurring in the middle of that code section. Thus, insisting on data-race-free programs

52 COMMUNICATIONS OF THE ACM | FEBRUARY 2012 | VOL. 55 | NO. 2 practice has some pleasant consequences: nal write to the object that can create ! We no longer care whether mem- a data race. ory is updated a byte or a word at a With the data-race-free approach, time. Properly written code can’t tell library-implemented container data any more than it could for sequential types can behave as built-in integers code. When writing or pointers; the programmer does ! Library calls that do not use inter- a program, not need to be concerned with what nal synchronization behave as if they goes on inside. As long as two differ- execute in a single step. The interme- there is no need ent threads don’t access the same diate states cannot be seen by another to think explicitly container at the same time, or they are thread. Thus, such libraries can con- both read accesses, the implementa- tinue to specify only the overall effect about compiler tion remains hidden. of making a call, not which intermedi- Again, while all of these proper- ate values might be taken by variables. or hardware ties simplify reasoning about paral- Of course, that is what we have been memory reordering; lel code, they assume that the library doing all along, but it really makes writer and the client are responsible sense only with data-race freedom. we can still reason for obeying the prescribed disciplines. ! Reasoning about multithreaded entirely in terms programs is still hard, but without But What if Locks are Too Slow? data races, it’s not as hard as people of sequential The most common way to avoid data often claim. In particular, we don’t consistency, races is to use mutexes to ensure mu- have to care about all possible ways of tual exclusion between code sections interleaving threads’ instructions. At as long as accessing the same variable. In cer- most, we care about the interleavings tain contexts, other synchronization of synchronization-free regions. we follow mechanisms such as OpenMP’s barri- Of course, all of these properties the rules and ers are more appropriate. Experience require that the program be data-race- has shown, however, that such mecha- free. Today, detecting and avoiding avoid data races. nisms are insufficient in a few cases. data-race bugs can be far from easy. Mutexes don’t work well with signal or Later we discuss recent progress to- interrupt handlers, and they often in- ward making it easier. volve significant overhead, even if they In particular, to ensure data-race- have started to get faster on recent freedom, it suffices to ensure that syn- processors. chronization-free code sections that Unfortunately, many environ- run at the same time neither write, nor ments, such as Posix threads, have read and write, the same variables. not provided any real alternatives—so Thus, we can prune a significant num- people cheat. Pthreads code common- ber of instruction-level interleavings ly contains data races, which are typi- that need to be explored for this pur- cally claimed to be “benign.” Some pose. of these are outright bugs, in that the Libraries can be (and generally code, as currently compiled, will fail are) designed to cleanly partition the with small probability. The rest often responsibility for avoiding data races risk getting “miscompiled” by compil- between library and client code. In ers that either outright assume there the client code, we reason about data are no data races4 and are hence mis- races at the level of logical objects, led by bad assumptions or that just not memory locations. When decid- produce some of the surprising effects ing whether it is safe to call two library previously discussed. routines simultaneously, we need to To escape this dilemma, most make sure only that they don’t both ac- modern programming languages pro- cess the same object, or if they do, that vide a way to declare synchronization neither access modifies the object. It variables. These behave as ordinary is the library’s responsibility to make variables, but since accesses to them sure that accesses to logically distinct are considered to be synchronization objects do not introduce a data race operations, not data operations, syn- as a result of unprotected accesses to chronization variables can be safely some internal hidden memory loca- accessed from multiple threads with- tions. Similarly, it is the library’s re- out creating a data race. In Java, a sponsibility to make sure that reading volatile int is an integer that can an object doesn’t introduce an inter- be accessed concurrently from multi-

FEBRUARY 2012 | VOL. 55 | NO. 2 | COMMUNICATIONS OF THE ACM 53 practice

ple threads. In C++11, you would write thread specifications are less precise, that one way or another, we will (we atomic instead (volatile but took basically the same position. must!) conquer evil (data races) in the means something subtly different in C++11 and C11 provide synchro- near future. C or C++). nization variables as atomic and Compilers treat synchronization _Atomic(t), respectively. In addition variables specially, so our basic pro- to reading and writing these variables, Related articles on queue.acm.org gramming model is preserved. If there they support some simple indivisible are no data races, threads still behave compound operations; for example, Trials and Tribulations of Debugging Concurrency as though they execute in an inter- incrementing a synchronization Kang Su Gatlin leaved fashion. Accessing a synchro- (atomic) variable with the “++” opera- http://queue.acm.org/detail.cfm?id=1035623 nization variable is a synchronization tor is an indivisible operation. Scalable Parallel Programming with CUDA operation, however; code sequences The situation for managed lan- John Nickolls, Ian Buck, extending across such accesses no guages is more complex, mostly be- Michael Garland and Kevin Skadron longer appear indivisible. cause of the security requirements http://queue.acm.org/detail.cfm?id=1365500 Synchronization variables are they add to support untrusted code. Building Systems to Be Shared, Securely sometimes the right tool for very sim- Java fully supports our programming Poul-Henning Kamp and Robert Watson ple shared data, such as the done flag model, but it also, with only limited http://queue.acm.org/detail.cfm?id=1017001 in Figure 2. The only data race here success, attempts to provide some is on the done flag, so simply declar- guarantees for programs with data References For a more complete set of background references, please ing that as a synchronization variable races. Although data races are not see reference 1. fixes the problem. officially errors, it is now clear that 1. Adve, S.V. and Boehm, H.-J. Memory models: A case for rethinking parallel languages and hardware. Remember, however, that synchro- we cannot precisely define what pro- Commun. ACM 53, 8 (Aug. 2010), 90–101. nization variables are difficult to use grams with data races actually mean.8 2. Adve, S.V. and Gharachorloo, . Shared memory consistency models: A tutorial. IEEE Computer 29, 12 for complex data structures, since Data races remain evil. (1996), 66–76. there is no easy way to make multiple 3. Bocchino, R, et al. A type and effect system for deterministic parallel Java. In Proceedings of updates to a data structure in one Toward a Future Without Evil? the International Conference on Object-Oriented Programming, Systems, Languages, and Applications, atomic operation. Synchronization We have discussed how the absence 2009. variables are not replacements for of data races leads to a simple pro- 4. Boehm, H.-J. How to miscompile programs with “benign” data races. Hot Topics in Parallelism (HotPar), mutexes. gramming model supported by com- 2011. In cases such as that shown in Fig- mon languages. There simply does 5. Elmas, T., Qadeer, S. and Tasiran, S. Goldilocks: A race-aware Java runtime. Commun. ACM 53, 11 (Nov. ure 2, synchronization variables often not appear to be any other reasonable 2010), 85–92. avoid most of the locking overhead. alternative.1 Unfortunately, one sticky 6. Flanagan, C. and Freund, S. FastTrack: Efficient and precise dynamic race detection. Commun. ACM 53, 11 Since they are still too expensive, both problem remains: guaranteeing data- (Nov. 2010), 93–101. C++11 and Java provide some explicit race-freedom is still difficult. Large 7. Lucia, B., Ceze, L., Strauss, K., Qadeer, S. and Boehm, H.-J. Conflict exceptions: Providing simple concurrent experts-only mechanisms that allow programs almost always contain bugs, language semantics with precise hardware exceptions. you to relax the interleaving-based and often those bugs are data races. In Proceedings of the 2010 International Symposium on Computer Architecture. model, as mentioned before. Unlike Today’s popular languages do not pro- 8. Sevcik, J. and Aspinall, D. On validity of program programming with data races, it is vide any usable semantics to such pro- transformations in the Java memory model. In European Conference on Object-oriented possible to write correct code that grams, making debugging difficult. Programming, 2008, 27–51. uses these mechanisms, but our ex- Looking forward, it is imperative perience is that few people actually that we develop automated tech- Hans-J. Boehm is a research manager at Hewlett get this right. Our hope is that future niques that detect or eliminate data Packard Labs. He is probably best known as the primary author of a commonly used garbage collection library. hardware will reduce the need for it— races. Indeed, there is significant Experiences with threads in that project eventually led and hardware is already getting better recent progress on several fronts: him to initiate the effort to properly define threads and shared variables in C++11. at this. dynamic precise detection of data 5,6 Sarita V. Adve is a professor in the department races; hardware support to raise an of computer science at the University of Illinois at Real Languages exception on a data race;7 and lan- Urbana-Champaign. Her research interests are in computer architecture and systems, parallel Most real languages fit our basic mod- guage-based annotations to eliminate computing, and power and reliability-aware systems. el. C++11 and C11 provide exactly this data races from programs by design.3 She co-developed the memory models for the C++ and Java programming languages, based on her early model. Data races have “undefined These techniques guarantee that the work on data-race-free models. behavior;” they are errors in the same considered execution or program has sense as an out-of-bounds array ac- no data race (allowing the use of the cess. This is often referred to as catch- simple model), but they still require fire semantics for data races (though more research to be commercially vi- we do not know of any cases in which able. Commercial products that de- machines have actually caught fire as tect data races have begun to appear the result of a data race). (for example, Intel Inspector), and Although catch-fire semantics are although they do not guarantee data- sometimes still controversial, they are race-freedom, they are a big step in hardly new. The Ada 83 and 1995 Posix the right direction. We are optimistic © 2012 ACM 0001-0782/12/02 $10.00

54 COMMUNICATIONS OF THE ACM | FEBRUARY 2012 | VOL. 55 | NO. 2