You Don't Know Jack About Shared Variables Or Memory Models

practice DOI10.1145/2076450.2076465 Article development led by queue.acm.org Data races are evil. BY HANS-J. BOEHM AND SARITA V. ADVE You Don’t Know Jack About Shared Variables or Memory Models A GOOGLE SEARCH for “threads are evil” generates can access all of the application’s 18,000 hits, but threads—evil or not—are ubiquitous. memory. Shared variables are either the core strength of threads or the root Almost all of the processes running on a modern of their evil, depending on your per- Windows PC use them. Software threads are typically spective. They allow threads to com- municate easily and quickly, but they how programmers get machines with multiple also make it possible for threads to get cores to work together to solve problems faster. in each other’s way. And often they are what allow user interfaces to Although shared variables are at the core of most programs, even experts remain responsive while the application performs are often confused about the rules for a background calculation. using them. Consider the following simple example. Threads are multiple programs running at the same To implement a function incr that time but sharing variables. Typically, every thread increments a counter x, your first at- 48 COMMUNICATIONS OF THE ACM | FEBRUARY 2012 | VOL. 55 | NO. 2 tempt might be case that two threads coincidentally void incr() perform these in lockstep, they will { void incr() both read the same value, both add mtx.lock(); { one to it, and then both write the x++; x++; same value, incrementing x by only mtx.unlock(); } one instead of two. A call to incr() } does not behave atomically; it is vis- Many would immediately object ible to the user that it is composed In Java, this might look like that this isn’t guaranteed to pro- of different steps. (Atomicity means duce the correct answer when called different things to different com- void incr() by multiple threads. The statement munities; our use is called isolation { x++ is equivalent to x=x+1, which by database folks.) synchronized(mtx) { amounts to three steps: Getting the We might address the problem by x++; value of x; adding one; and writing using a mutex, which can be locked } LLUSTRATION BY AGSANDREW / SHUTTERSTOCK.COM AGSANDREW BY LLUSTRATION I the result back to x. In the unlikely by only one thread at a time: } FEBRUARY 2012 | VOL. 55 | NO. 2 | COMMUNICATIONS OF THE ACM 49 practice or perhaps just the blue thread runs its last step. The result is that we incremented 999 synchronized void incr() twice to get 2000. This is difficult to { explain to a programmer who doesn’t x++; understand precisely how the code is } Although shared being compiled. variables are The fundamental problem is that Those would all work correctly, but multiple threads were accessing x mutex calls can be slow, so the result at the core of at the same time, without proper may run slower than desired. most programs, locking or other synchronization to What if we are concerned only make sure that one occurred after the about getting an approximate count? even experts other. This situation is called a data What if we just leave off the mutex, are often confused race—which really is evil! We will get and settle for some inaccuracy? What back to avoiding data races without could go wrong? about the rules locks later. To begin with, we observed that some actual code incrementing such a for using them. Another Racy Example counter in two threads without a mutex We have only begun to see the prob- routinely missed about half the counts, lems caused by data races. Here is an probably a result of unfortunate tim- example commonly tried in real code. ing caused by communication be- One thread initializes a piece of data tween the processors’ caches. It could (say, x) and sets a flag (call it done) be worse. A thread could do nothing when it finishes. Any thread that later but call incr() once, loading the val- reads x first waits for the done flag, as ue zero from x at the beginning, get in Figure 2. What could possibly go suspended for a long time, and then wrong? write back one just before the program This code may work reliably with terminates. This would result in a final a “dumb” compiler, but any “clever” count of one, no matter what the other optimizing compiler is likely to break threads did. it. When the compiler sees the loop, Those are the cases that are less sur- it is likely to observe that done is prising and easier to explain. The final not modified in the loop (that is, it is count can also be too high. Consider a “loop-invariant”). Thus, it gets to as- case in which the count is bigger than sume that done does not change in a machine word. To avoid dealing with the loop. binary numbers, assume we have a Of course, this assumption isn’t ac- decimal machine in which each word tually correct for our example, but the holds three digits, and the counter x compiler gets to make it anyway, for can hold six digits. The compiler trans- two reasons: compilers were tradition- lates x++ to something like ally designed to compile sequential, not multithreaded code; and because, tmp_hi = x_hi; as we will see, even modern multi- tmp_lo = x_lo; threaded languages continue to allow (tmp_hi, tmp_lo)++; this, for good reason. x_hi = tmp_hi; Thus, the loop is likely to be trans- x_lo = tmp_lo; formed to where tmp_lo and tmp_hi are machine tmp = done; while (!tmp) {} registers, and the increment operation in the middle would really involve sev- or maybe even eral machine instructions. Now assume that x is 999 (x_hi = tmp = done; if (!tmp) while (true) {} 0, and x_lo = 999), and two threads, a blue and a red one, each increment In either case, if done is not already x as shown in Figure 1 (remember set when a red thread starts, the red that each thread has its own copy of thread is guaranteed to enter an the machine registers tmp_hi and infinite loop. tmp_lo). The blue thread runs almost Assume we have a “dumb” compiler to completion; then the red thread that does not perform such transfor- runs all at once to completion; finally mations and compiles the code exactly 50 COMMUNICATIONS OF THE ACM | FEBRUARY 2012 | VOL. 55 | NO. 2 practice as written. Depending on the hard- be zero when they both finished. Al- ever, is critical in understanding the ware, this code can still fail. though the original program was well behavior of real shared variables, for The problem this time is that the behaved and had no data races, the two reasons: hardware may optimize the blue compiler added an implicit update to ! Essentially all modern languages thread. Nearly all processor architec- b2 that created a data race. (Java, C++11, C11) do in fact promise tures allow stores to memory to be This kind of data-race insertion sequential consistency for programs saved in a buffer visible only to that has been clearly disallowed in Java for without data races. This guarantee is processor core before writing them a long time. The recently published normally violated by a few low-level to memory visible to other processor C++11 and C11 standards also disal- language features—notably, Java’s cores.2 Some, such as the ARM chip low it. We know of no Java implemen- lazySet() and C++11 and C11’s ex- that is probably in your smartphone, tations with such problems, nor do plicit memory_order... specifications, allow the stores to become visible to modern C and C++ compilers general- which are easy to avoid (with the pos- other processor cores in a different ly exhibit precisely this problem. Un- sible exception of OpenMP’s atomic order. On such a processor the blue fortunately, many do introduce data directive) and which we’ll mostly ig- thread’s write to done may become races under certain obscure, unlikely, nore here. Most programmers will visible to the red thread, running on and unpredictable conditions. This also want to ignore these features. another core, before the blue thread’s problem will disappear as C++11 and ! So far we have been a bit impre- write to x. Thus, the red thread may C11 become widely supported. cise about what constitutes a data see done set to true, and the loop For C and C++, the story for bit- race. Since this has now become a crit- may terminate before it can retrieve fields is slightly more complicated. ical part of our programming rules, we the proper value of x. Thus, when the We’ll discuss that more, later. can make it more precise as follows: red thread accesses x, it may still get two memory operations conflict if they the uninitialized value. And the Real Rules Are… access the same memory location and Unlike the original problem of The simplest view of threads, and the at least one of the accesses is a write. reading done once outside the loop, one we started with, is that a multi- For our purposes, a memory location this problem will occur infrequently, threaded program is executed by in- is a unit of memory that is separately and may well be missed during test- terleaving steps from each thread. updatable. Normally every scalar (un- ing. Logically the computer executes a step structured) variable or field occupies Again the core problem here is that from one thread, then picks another its own memory location; each can be although the done flag is intended to thread, or possibly the same one, ex- independently updated.

You Don't Know Jack About Shared Variables Or Memory Models

An Array-Oriented Language with Static Rank Polymorphism

Compendium of Technical White Papers

Application and Interpretation

Handout 16: J Dictionary

Design Automation for Integrated Optics

The C Programming Language

Reducing State Explosion for Software Model Checking with Relaxed Mcms 3 We Do Not Distinguish the Framework from Its Implementation and Refer to Both As Mc- SPIN

Chapter 1 Basic Principles of Programming Languages

A Modern Reversible Programming Language April 10, 2015

APL-The Language Debugging Capabilities, Entirely in APL Terms (No Core Symbol Denotes an APL Function Named' 'Compress," Dumps Or Other Machine-Related Details)

Subsistence-Based Socioeconomic Systems in Alaska: an Introduction

The C Programming Language