review articles

doi:10.1145/1787234.1787255 Solving the memory model problem will require an ambitious and cross-disciplinary research direction.

by Sarita V. Adve and Hans-J. Boehm Memory Models: A Case for Rethinking Parallel Languages and Hardware

Most parallel programs today are written using threads potentially provides a performance advantage; for example, by implicitly and shared variables. Although there is no consensus sharing read-mostly data without the on parallel programming models, there are a number space overhead of complete replica- tion. The ability to pass memory refer- ee

of reasons why threads remain popular. Threads h ences among threads makes it easier to

were already widely supported by mainstream share complex data structures. Finally, en Van w operating systems well before the dominance of shared-memory makes it far easier to on by G

selectively parallelize application hot i multicore, largely because they are also useful for other spots without complete redesign of data

purposes. Direct hardware support for shared-memory structures. Illusrat

90 communications of the acm | august 2010 | vol. 53 | no. 8 Art in Development

The memory model, or memory to adjacent fields in a memory location gram (written in a high-level, byte code, consistency model, is at the heart of at the same time? Must the final value assembly, or machine language) or any the concurrency semantics of a shared- of a location always be one of those writ- part of the language implementation memory program or system. It defines ten to it? (including hardware) without an unam- the set of values that a read in a pro- The memory model defines an in- biguous memory model. gram is allowed to return, thereby de- terface between a program and any A complex memory model makes fining the basic semantics of shared hardware or software that may trans- parallel programs difficult to write, and variables. It answers questions such as: form that program (for example, the parallel programming difficult to teach. Is there enough synchronization to en- , the virtual machine, or any An overly constraining one may limit sure a thread’s write will occur before dynamic optimizer). It is not possible to hardware and compiler optimization, another’s read? Can two threads write meaningfully reason about either a pro- severely reducing performance. Since

august 2010 | vol. 53 | no. 8 | communications of the acm 91 review articles

it is an interface property, the memory this experience has made it clear that model decision has a long-lasting im- key insights solving the memory model problem will pact, affecting portability and maintain- require a significantly new and cross-  M emory models, which describe the ability of programs. Thus, a hardware semantics of shared variables, are disciplinary research direction for par- architecture committed to a strong crucial to both correct multithreaded allel computing languages, hardware, memory model cannot later forsake it applications and the entire underlying and environments as a whole. implementation stack. It is difficult for a weaker model without breaking bi- to teach multithreaded programming This article discusses the path that nary compatibility, and a new compiler without clarity on memory models. led to the current convergence in mem- release with a weaker memory model  A fter much prior confusion, major ory models, the fundamental shortcom- may require rewriting source code. Fi- programming languages are converging ings it exposed, and the implications nally, memory-model-related decisions on a model that guarantees simple for future research. The central role of interleaving-based semantics for for a single component must consider “data-race-free” programs and most the memory model in parallel comput- implications for the rest of the system. hardware vendors have committed to ing makes this article relevant to many A processor vendor cannot guarantee support this model. computer science subdisciplines, in- a strong hardware model if the mem-  This process has exposed fundamental cluding algorithms, applications, lan- ory system designer provides a weaker shortcomings in our languages and guages, , formal methods, model; a strong hardware model is not a hardware-software mismatch. software engineering, virtual machines, Semantics for programs that contain very useful to programmers using lan- data races seem fundamentally difficult, runtime systems, and hardware. For guages and compilers that provide only but are necessary for concurrency practitioners and educators, we pro- a weak guarantee. safety and debuggability. We call upon vide a succinct summary of the state of software and hardware communities Nonetheless, the central role of the to develop languages and systems the art of this often-ignored and poorly memory model has often been down- that enforce data-race-freedom, and understood topic. For researchers, we played. This is partly because formally co-designed hardware that exploits and outline an ambitious, cross-disciplinary specifying a model that balances all supports such semantics. agenda toward resolving a fundamental desirable properties of programmabil- problem in parallel computing today— ity, performance, and portability has environments that addressed the is- what value can a shared variable have proven surprisingly complex. At the sue with relative clarity,40 but the most and how to implement it? same time, informal, machine-specific widely used environments had unclear descriptions proved mostly adequate in and questionable specifications.9,32 Sequential Consistency an era where parallel programming was Even when specifications were relative- A natural view of the execution of a the domain of experts and achieving the ly clear, they were often violated to ob- multithreaded program operating on highest possible performance trumped tain sufficient performance,9 tended to shared variables is as follows. Each programmability or portability argu- be misunderstood even by experts, and step in the execution consists of choos- ments. were difficult to teach. ing one of the threads to execute, and In the late 1980s and 1990s, the area Since 2000, we have been involved in then performing the next step in that received attention primarily in the hard- efforts to cleanly specify programming- thread’s execution (as dictated by the ware community, which explored many language-level memory models, first thread’s program text, or program or- approaches, with little consensus.2 for Java and then C++, with efforts now der). This process is repeated until the Commercial hardware memory model under way to adopt similar models for C program as a whole terminates. Effec- descriptions varied greatly in precision, and other languages. In the process, we tively, the execution can be viewed as including cases of complete omission had to address issues created by hard- taking all the steps executed by each of the topic and some reflecting ven- ware that had evolved without the ben- thread, and interleaving them in some dors’ reluctance to make commitments efit of a clear programming model. This way. Whenever an object (that is, vari- with unclear future implications. Al- often made it difficult to reconcile the able, field, or array element) is accessed, though the memory model affects the need for a simple and usable program- the last value stored to the object by this meaning of every load instruction in ev- ming model with that for adequate per- interleaved sequence is retrieved. ery multithreaded application, it is still formance on existing hardware. For example, consider Figure 1, sometimes relegated to the “systems Today, these languages and most which gives the core of Dekker’s mutual programming” section of the architec- hardware vendors have published (or exclusion algorithm. The program can ture manual. plan to publish) compatible memory be executed by interleaving the steps Part of the challenge for hardware model specifications. Although this from the two threads in many ways. For- architects was the lack of clear memory convergence is a dramatic improve- mally, each of these interleavings is a models at the programming language ment over the past, it has exposed fun- total order over all the steps performed level. It was unclear what programmers damental shortcomings in our parallel by all the threads, consistent with the expected hardware to do. Although languages and their interplay with hard- program order of each thread. Each ac- hardware researchers proposed ap- ware. After decades of research, it is still cess to a shared variable “sees” the last proaches to bridge this gap,3 wide- unacceptably difficult to describe what prior value stored to that variable in the spread adoption required consensus value a load can return without com- interleaving. from the software community. Before promising modern safety guarantees or Figure 2 gives three possible execu- 2000, there were a few programming implementation methodologies. To us, tions that together illustrate all possible

92 communications of the acm | august 2010 | vol. 53 | no. 8 review articles final values of the non-shared variables Second, while sequential consisten- neous access to variables by different r1 and r2. Although many other inter- cy may seem to be the simplest model, threads. If we require that these be used leavings are also possible, it is not possi- it is not sufficiently simple and a much correctly, and guarantee sequential ble that both r1 and r2 are 0 at the end of less useful programming model than consistency only if no undesirable con- an execution; any execution must start commonly imagined. For example, it current accesses are present, we avoid with the first statement of one of the two only makes sense to reason about in- the above issues. threads, and the variable assigned there terleaving steps if we know what those We can make this more precise as will later be read as one. steps are. In this case, they are typically follows. We assume the language allows Following Lamport,26 an execution individual memory accesses, a very low- distinguishing between synchroniza- that can be understood as such an in- level notion. Consider two threads con- tion and ordinary (non-synchronization terleaving is referred to as sequentially currently assigning values of 100,000 or data) operations (see below). We say consistent. Sequential consistency gives and 60,000 to the shared variable X on that two memory operations conflict if us the simplest possible meaning for a machine that accesses memory 16 bits they access the same memory location shared variables, but suffers from sev- at a time. The final value of X in a “se- (for example, variable or array element), eral related flaws. quentially consistent” execution may and at least one is a write. First, sequential consistency can be be 125,536 if the assignment of 60,000 We say that a program (on a particu- expensive to implement. For Figure 1, occurred between the bottom and top lar input) allows a data race if it has a a compiler might, for example, reorder half of the assignment of 100,000. At sequentially consistent execution (that the two independent assignments in a somewhat higher level, this implies is, a program-ordered interleaving of the red thread, since scheduling loads the meaning of even simple library op- operations of the individual threads) early tends to hide the load latency. In erations depends on the granularity at in which two conflicting ordinary op- addition, modern processors almost al- which the library carries out those op- erations execute “simultaneously.” For ways use a store buffer to avoid waiting erations. our purposes, two operations execute for stores to complete, also effectively More generally, programmers do “simultaneously” if they occur next to reordering instructions in each thread. not reason about correctness of parallel each other in the interleaving and corre- Both the compiler and hardware opti- code in terms of interleavings of indi- spond to different threads. Since these mization make an outcome of r1 == 0 vidual memory accesses, and sequential operations occur adjacently in the inter- and r2 == 0 possible, and hence may re- consistency does not prevent common leaving, we know that they could equally sult in a non-sequentially consistent ex- sources of concurrency bugs arising well have occurred in the opposite or- ecution. Overall, reordering any pair of from simultaneous access to the same der; there are no intervening operations accesses, reading values from write buf- shared data (for example, data races). to enforce the order. fers, register promotion, common sub- Even with sequential consistency, such To ensure that two conflicting ordi- expression elimination, redundant read simultaneous accesses can remain dan- nary operations do not happen simulta- elimination, and many other hardware gerous, and should be avoided, or at neously, they must be ordered by inter- and compiler optimizations commonly least explicitly highlighted. Relying on vening synchronization operations. For used in uniprocessors can potentially sequential consistency without such example, one thread must release a lock violate sequential consistency.2 highlighting both obscures the code, after accessing a shared variable, and There is some work on compiler and greatly complicates the implemen- analysis to determine when such trans- tation’s job. Figure 1. Core of Dekker’s algorithm. Can r1 = r2 = 0? formations are unsafe (for example, Shasha and Snir37). Compilers, howev- Data-Race-Free er, often have little information about We can avoid both of the problems Initially X=Y=0 sharing between threads, making it mentioned here by observing that: Red Thread Blue Thread expensive to forego the optimizations, ˲˲ The problematic transformations X = 1; Y = 1; since we would have to forego them ev- (for example, reordering accesses to r1 = Y; r2 = X; erywhere. There is also much work on unrelated variables in Figure 1) never speculatively performing these optimi- change the meaning of single-threaded zations in hardware, with rollback on programs, but do affect multithreaded detection of an actual sequential con- programs (for example, by allowing Figure 2. Some executions for Figure 1. sistency violation (for example, Ceze both r1 and r2 to be 0 in Figure 1). et al.14 and Gharachorloo et al.21). How- ˲˲ These transformations are de- ever, these ideas are tied to specific im- tectable only by code that allows two Execution 1 Execution 2 Execution 3 plementation techniques (for example, threads to access the same data simul- X = 1; Y = 1; X = 1; aggressive speculation support), and taneously in conflicting ways; for ex- r1 = Y; r2 = X; Y = 1; vendors have generally been unwilling ample, one thread writes the data and Y = 1; X = 1; r1 = Y; to commit to those for the long term (es- another reads it. r2 = X; r1 = Y; r2 = X; pecially, given non-sequentially consis- Programming languages generally // r1 == 0 // r1 == 1 // r1 == 1 tent compilers). Thus, most hardware already provide synchronization mecha- // r2 == 1 // r2 == 0 // r2 == 1 and compilers today do not provide se- nisms, such as locks, or possibly trans- quential consistency. actional memory, for limiting simulta-

august 2010 | vol. 53 | no. 8 | communications of the acm 93 review articles

the other thread must acquire the lock continue to be allowed for ordinary ac- order requirement of sequential consis- before its access. Thus, it is also possi- cesses—care must be taken primarily tency. For example, Sparc’s TSO guar- ble to define data races as conflicting -ac at the explicitly identified (infrequent) antees that a thread’s memory accesses cesses not ordered by synchronization, synchronization accesses since these will become visible to other threads in as is done in Java. These definitions are are the only ones through which such program order, except for the case of a essentially equivalent.1,12 optimizations and granularity consid- write followed by a read. Such models A program that does not allow a data erations affect program outcome. Fur- additionally provide fence instructions race is said to be data-race-free. The da- ther, synchronization-free sections of to enable programmers to explicitly im- ta-race-free model guarantees sequen- the code appear to execute atomically pose orderings that are otherwise not tial consistency only for data-race-free and the requirement to explicitly iden- guaranteed; for example, TSO pro- programs.1,3 For programs that allow tify concurrent accesses makes it easier grammers may insert a fence between data races, the model does not provide for humans and compilers to under- a thread’s write and read to ensure the any guarantees. stand the code. (For more detail, see our execution preserves that order. The restriction on data races is technical report.11) Such a program-orderings + fences not onerous. In addition to locks for Data-race-free does not give the style of specification is simple, but avoiding data races, modern program- implementation a blanket license to many subtleties make it inadequate.1,2 ming languages generally also provide perform single-threaded program op- First, this style implies that a write is a mechanism, such as Java’s vola- timizations. In particular, optimiza- an atomic or indivisible operation that tile variables, for declaring that cer- tions that amount to copying a shared becomes visible to all threads at once. tain variables or fields are to be used variable to itself; such as, introducing As Figure 3 illustrates, however, hard- for synchronization between threads. the assignment x = x, where x might ware may make writes visible to differ- Conflicting accesses to such variables not otherwise have been written, gener- ent threads at different times through may occur simultaneously—since they ally remain illegal. These are commonly write buffers and shared caches. Incor- are explicitly identified as synchroniza- performed in certain contexts,9 but porating such optimizations increases tion (vs. ordinary), they do not create a should not be. the complexity of the memory model data race. Although data-race-free was for- specification. Thus, the full TSO speci- To write Figure 1 correctly under da- mally proposed in 1990,3 it did not see fication, which incorporates one of the ta-race-free, we need simply identify the widespread adoption as a formal model simplest atomicity optimizations, is shared variables X and Y as synchroni- in industry until recently. We next de- much more involved than the simple zation variables. This would require the scribe the evolution of industry models description here. PowerPC implements implementation to do whatever is nec- to a convergent path centered around more aggressive forms of the optimiza- essary to ensure sequential consistency, data-race-free, the emergent shortcom- tion, with a specification that is com- in spite of those simultaneous accesses. ings of data-race-free, and their implica- plex and difficult to interpret even for It would also obligate the implementa- tions for the future. experts. The documentation from tion to ensure that these synchroniza- both AMD and Intel was ambiguous on tion accesses are performed indivisibly; Industry Practice and Evolution this issue; recent updates now clarify if a 32-bit integer is used for synchroni- Hardware memory models. Most hard- the intent, but remain informal. zation purposes, it should not be visibly ware supports relaxed models that are Second, in well-written software, a accessed as two 16-bit halves. weaker than sequential consistency. thread usually relies on synchronization This “sequential consistency for da- These models take an implementation- interactions to reason about the order- ta-race-free programs” approach allevi- or performance-centric view, where the ing or visibility of memory accesses on ates the problems discussed with pure desirable hardware optimizations drive other threads. Thus, it is usually overkill sequential consistency. Most important the model specification.1,2,20 Typical to require that two program-ordered hardware and compiler optimizations driving optimizations relax the program accesses always become visible to all threads in the same order or a write ap- Figure 3. Hardware may not execute atomic or indivisible writes. pears atomic to all threads regardless of the synchronization among the threads. Assume a fence imposes program order. Assume core 3’s and core 4’s caches have X and Y. Instead, it is sufficient to preserve order- The two writes generate invalidations for these caches. These could reach the caches in a different ing and atomicity only among mutually order, giving the result shown and a deduction that X’s update occurs both before and after Y’s. synchronizing threads. Some hardware implementations attempt to exploit Initially X = Y = 0 this insight, albeit often through ad hoc Core 1 Core 2 Core 3 Core 4 techniques, thereby further complicat- X = 1; Y = 1; r1 = X; r3 = Y; ing the memory model. fence; fence; Third, modern processors perform r2 = Y; r4 = X; various forms of speculation (for ex- Can r1 = 1, r2 = 0, r3 =1, r4 = 0, violating write atomicity? ample, on branches and addresses) which can result in subtle and complex interactions with data and control de- pendences, as illustrated in Figure 4.1,29

94 communications of the acm | august 2010 | vol. 53 | no. 8 review articles

Incorporating these considerations in a The Java memory model. Java pro- precise way adds another source of com- vided first-class support for threads plexity to program-order + fence style with a chapter specifically devoted to specifications. As we discuss later, pre- its memory model. Pugh showed that cise formalization of data and control this model was difficult to interpret dependences is a fundamental obstacle Hardware and badly broken—common compiler to providing clean high-level memory memory model optimizations were prohibited and in model specifications today. many cases the model gave ambiguous In summary, hardware memory specifications or unexpected behavior.32 In 2000, Sun model specifications have often been have often been appointed an expert group to revise the incomplete, excessively complex, and/ model through the Java community or ambiguous enough to be misinter- incomplete, process.33 The effort was coordinated preted even by experts. Further, since through an open mailing list that at- hardware models have largely been excessively tracted a variety of participants, repre- driven by hardware optimizations, they complex, and/or senting hardware and software and re- have often not been well-matched to searchers and practitioners. software requirements, resulting in in- ambiguous enough It was quickly decided that the Java correct code or unnecessary loss in per- to be misinterpreted memory model must provide sequen- formance, as discussed later. tial consistency for data-race-free pro- High-level language memory mod- even by experts. grams, where volatile accesses (and els. Ada was perhaps the first widely locks from synchronized methods used high-level programming lan- and monitors) were deemed synchro- guage to provide first-class support for nization. shared-memory parallel programming. However, data-race-free is inad- Although Ada’s approach to thread syn- equate for Java. Since Java is meant to chronization was initially quite differ- be a safe and secure language, it cannot ent from both that of the earlier Mesa allow arbitrary behavior for data races. design and most later language designs, Specifically, Java must support untrust- it was remarkably advanced in its treat- ed code running as part of a trusted ap- ment of memory semantics.40 It used a plication and hence must limit damage style similar to data-race-free, requiring done by a data race in the untrusted legal programs to be well-synchronized; code. Unfortunately, the notions of safe- however, it did not fully formalize the ty, security, and “limited damage” in a notion of well-synchronized and left un- multithreaded context were not clearly certain the behavior of such programs. defined. The challenge with defining Subsequently, until the introduc- the Java model was to formalize these tion of Java, mainstream programming notions in a way that minimally affected languages did not provide first-class system flexibility. support for threads, and shared-mem- Figure 4(b) illustrates these issues. ory programming was mostly enabled The program has a data race and is bug- through libraries and APIs such as Posix gy. However, Java cannot allow its reads threads and OpenMP. Previous work de- to return values out-of-thin-air (for ex- scribes why the approach of an add-on ample, 42) since this could clearly com- threads library is not entirely satisfac- promise safety and security. It would, tory.9 Without a real definition of the for example, make it impossible to programming language in the context guarantee that similar untrusted code of threads, it is unclear what compiler cannot return a password that it should transformations are legal, and hence not have access to. Such a scenario ap- what the programmer is allowed to as- pears to violate any reasonable cau- sume. Nevertheless, the Posix threads sality expectation and no current pro- specification indicates a model similar cessor produces it. Nevertheless, the to data-race-free, although there are sev- memory model must formally prohibit eral inconsistent aspects, with widely such behavior so that future specula- varying interpretations even among ex- tive processors also avoid it. perts participating in standards com- Prohibiting such causality violations mittee discussions. The OpenMP mod- in a way that does not also prohibit el is also unclear and largely based on other desired optimizations turned out a flush instruction that is analogous to to be surprisingly difficult. Figure 5 il- fence instructions in hardware models, lustrates an example that also appears with related shortcomings. to violate causality, but is allowed by the

august 2010 | vol. 53 | no. 8 | communications of the acm 95 review articles

Figure 4. Subtleties with (a) control and (b) data dependences. behaved executions, and the to-be com- mitted write is not dependent on a read that returns its value from a data race. It is feasible for core 1 to speculate that its read of X will see 1 and speculatively write Y. Core 2 similarly writes X. Both reads now return 1, creating a “self-fulfilling” speculation or a “causality These conditions ensure that a future loop.” Within a single core, no control dependences are violated since the speculation appears correct; data race will never be used to justify a however, most programmers will not expect such an outcome (the code is in fact data-race-free since speculative write that could then later no sequentially consistent execution contains a data race). Part (b) shows an analogous causal loop justify that future data race. with data dependences. Core 1 may speculate X is 42 (for example, using value prediction based on previous store values) and (speculatively) write 42 into Y. Core 2 reads this and writes 42 into X, thereby A key reason for the complexity in proving the speculation right and creating a causal loop that generates a value (42) out-of-thin-air. the Java model is that it is not opera- Fortunately, no processor today behaves this way, but the memory model specification needs to reflect tional—an access in the future can de- this property. termine whether the current access is Initially X=Y=0 Initially X=Y=0 legal. Further, many possible future Core 1 Core 2 Core 1 Core 2 executions must be examined to deter- r1 = X; r2 = Y; r1 = X; r2 = Y; mine this legality. The choice of future if (r1 == 1) if (Y == 1) Y = r1; X = r2; (well-behaved) executions also gives Y = 1; X = 1; some surprising results. In particular, Is r1 = r2 = 1 allowed? Is r1 = r2 = 42 allowed? as discussed in Manson29 if the code of (a) (b) one thread is “inlined” in (concatenated with) another thread, then the inlined code can produce more behaviors than the original. Thus, thread inlining is generally illegal under the Java model Figure 5. Redundant read elimination must be allowed. (even if there are no synchronization- and deadlock-related considerations). For thread 1, the compiler could eliminate the Initially X=Y=0 In practice, the prohibited optimiza- redundant read of X, replacing r2=X with r2=r1. This allows deducing that r1==r2 is always Original Code tions are difficult to implement and this true, making the write of Y unconditional. Then Thread 1 Thread 2 is not a significant performance limi- the compiler may move the write to before r1 = X; r3 = Y; tation. The behavior, however, is non- the read of X since no dependence is violated. r2 = X; X = r3; intuitive, with other implications—it Sequential consistency would allow both the if (r1 == r2) reads of X and Y to return 1 in the new but not occurs because some data races in the Y = 1; the original code. This outcome for the original original code may no longer be data rac- code appears to violate causality since it seems After compiler transformation es in the inlined code. This means that to require a self-justifying speculative write of Y. Thread 1 Thread 2 It must, however, be allowed if compilers are to when determining whether to commit a Y = 1; r3 = Y; perform the common optimization of redundant write early, a read in a well-behaved ex- r1 = X; X = r3; read elimination. r2 = r1; ecution has more choices to return val- if (true); ues than before (since there are fewer Is r1 = r2 = r3 = 1 allowed? data races), resulting in new behaviors. More generally, increasing synchro- nization in the Java model can actually result in new behaviors, even though common compiler optimization of re- cuted (the execution where both reads more synchronization conventionally dundant read elimination.29 After many of X return 0). For Figure 4(b), there is constrains possible executions. Recent- proposals and five years of spirited de- no sequentially consistent execution ly, it has been shown that, for similar bate, the current model was approved where Y=42 could occur. This notion of reasons, adding seemingly irrelevant as the best compromise. This model whether a speculative write could occur reads or removing redundant reads allows the outcome of Figure 5, but not in some well-behaved execution is the sometimes can also add new behaviors, that of Figure 4(b). Unfortunately, this basis of causality in the Java model, and and that all these properties have more model is very complex, was known to the definition of well-behaved is the key serious implications than previously have some surprising behaviors, and source of complexity. thought.36 In particular, some optimiza- has recently been shown to have a bug. The Java model tests for the legal- tions that were intended to be allowed We provide intuition for the model be- ity of an execution by “committing” by the Java model are in fact prohibited low and refer the reader to Manson et one or more of its memory accesses at by the current specification. al.26 for a full description. a time—legality requires all accesses to It is unclear if current hardware or Common to both Figure 4(b) and commit (in any order). Committing a JVMs implement the problematic op- Figure 5 are writes that are executed write early (before its turn in program timizations noted here and therefore earlier than they would be with sequen- order) requires it to occur in a well- violate the current Java model. Cer- tial consistency. The examples differ behaved execution where (informally) tainly the current specification is much in that for the speculative write in the the already committed accesses have improved over the original. Regardless, latter (Y=1), there is some sequentially similar synchronization and data race the situation is still far from satisfac- consistent execution where it is exe- relationships in all previously used well- tory. First, clearly, the current specifi-

96 communications of the acm | august 2010 | vol. 53 | no. 8 review articles cation does not meet its desired intent face of increasing doubt that a Java-like pushes Java’s issues with causality into of having certain common optimizing memory model relying on sequential a much smaller and darker corner of transformations preserve program consistency for data-race-free programs the specification; exactly the same is- meaning. Second, its inherent com- was efficiently implementable on main- sues arise if we rewrite Figure 4(b) with plexity and the new observations make stream architectures, at least given the C++ atomic variables and use low- it difficult to prove the correctness of specifications available at the time. level memory_order_relaxed op- any real system. Third, the specification Largely as a result, much of the early dis- erations. Our current solution to this methodology is inherently fragile— cussion focused on the tension between problem is simpler, but as inelegant small changes usually result in hard-to- the following two observations, both of as the Java one. Unlike Java, it affects a detect unintended consequences. which we still believe to be correct given small number of fairly esoteric library The Java model was largely guided existing hardware: calls, not all memory accesses. by an emergent set of test cases,33 based ˲˲ A programming language model As with the Java model, we feel that on informal code transformations that weaker than data-race-free is probably although this solution involves compro- were or were not deemed desirable. unusable by a large fraction of the pro- mises, it is an important step forward. It While it may be possible to fix the Java gramming community. Earlier work10 clearly establishes data-race-free as the model, it seems undesirable that our points out, for example, that even core guarantee that every programmer specification of multithreaded program thread library implementors often get should understand. It defines precisely behavior would rest on such a complex confused when it comes to dealing ex- what constitutes a data race. It finally and fragile foundation. Instead, the sec- plicitly with memory ordering issues. resolves simple questions such as: If x.a tion entitled “Implications for Languag- Substantial effort was invested in at- and x.b are assigned simultaneously, es” advocates a fundamental rethinking tempts to develop weaker, but compa- is that a data race? (No, unless they are of our approach. rably simple and usable models. We do part of the same contiguous sequence The C++ memory model. The situa- not feel these were successful. of bit-fields.) By doing so, it clearly iden- tion in C++ was significantly different ˲˲ On some architectures, notably tifies shortcomings of existing compil- from Java. The language itself provided on some PowerPC implementations, ers that we can now begin to remedy. no support for threads. Nonetheless, data-race-free involves substantial im- Reconciling language and hard- they were already in widespread use, plementation cost. (In light of modern ware models. Throughout this process, typically with the addition of a library- (2009) specifications, the cost on oth- it repeatedly became clear that cur- based threads implementation, such as ers, notably x86, is modest, and limited rent hardware models and supporting pthreads22 or the corresponding Micro- largely to atomic (C++) or volatile fence instructions are often at best a soft Windows facilities. Unfortunately (Java) store operations.) marginal match for programming lan- the relevant specifications, for example This resulted in a compromise mem- guage memory models, particularly in the combination of the C or C++ stan- ory model that supports data-race-free the presence of Java volatile fields dard with the Posix standard, left sig- for nearly all of the language. However, or C++ atomic objects. It is always pos- nificant uncertainties about the rules atomic data types also provide low-lev- sible to implement such synchroniza- governing shared variables.9 This made el operations with explicit memory or- tion variables by mapping each one to it unclear to compiler writers precisely dering constraints that blatantly violate a lock, and acquiring and releasing the what they needed to implement, result- sequential consistency, even in the ab- corresponding lock around all accesses. ed in very occasional failures for which sence of data races. The low-level op- However, this typically adds an over- it was difficult to assign blame to any erations are easily identified and can head of hundreds of cycles to each ac- specific piece of the implementation be easily avoided by non-expert pro- cess, particularly since the lock accesses and, most importantly, made it difficult grammers. (They require an explicit are likely to result in coherence cache to teach parallel programming since memory_order_ argument.) But they misses, even when only read accesses even the experts were unable to agree on do give expert programmers a way to are involved. some of the basic rules, such as whether write very carefully crafted, but portable, Volatile and atomic variables are Figure 4(a) constitutes a data race. (Cor- synchronization code that approaches typically used to avoid locks for exactly rect answer: No.) the performance of assembly code. these reasons. A typical use is a flag that Motivated by these observations, Since C++ does not support sand- indicates a read-only data structure has we began an effort in 2005 to develop boxed code execution, the C++ draft been lazily initialized. Since the initial- a proper memory model for C++. The standard can and does leave the seman- ization has to happen only once, nearly resulting effort eventually expanded to tics of a program with data races com- all accesses simply read the atomic/ include the definition of atomic (syn- pletely undefined, effectively making it volatile flag and avoid lock acquisi- chronization, analogous to Java vola- erroneous to write such programs. As tions. Acquiring a lock to access the flag tile) operations, and the threads API we point out in Boehm and Adve12 this defeats the purpose. itself. It is part of the current Commit- has a number of (mostly) performance- On hardware that relaxes write ato- tee Draft24 for the next C++ revision. The related advantages, and better reflects micity (see Figure 3), however, it is often next C standard is expected to contain existing compiler implementations. unclear that more efficient mappings a very similar memory model, with very In addition to the issues raised (than the use of locks) are possible; even similar atomic operations. in the “Lessons Learned” section, it the fully fenced implementation may This development took place in the should be noted that this really only not be sequentially consistent. Even

august 2010 | vol. 53 | no. 8 | communications of the acm 97 review articles

on other hardware, there are apparent surprising results later during program nating the debugging issues associated mismatches, most probably caused by execution, possibly long after the data with data races. the lack of a well-understood program- race has resulted in corrupted data. Al- Unfortunately, these both take us to ming language model when the hard- though the usual immediate result of active research areas, with no clear off- ware was designed. On x86, it is almost a data race is that an unexpected, and the-shelf solutions. We discuss some sufficient to map synchronization loads perhaps incomplete value is read, or possible approaches here. and stores directly to ordinary load and that an inconsistent value is written, store instructions. The hardware pro- we point out in prior work12 that other Implications for Languages vides sufficient guarantees to ensure results, such as wild branches, are also In spite of the dramatic convergence that ordinary memory operations are possible as a result of compiler opti- in the debate on memory models, not visibly reordered with synchroniza- mizations that mistakenly assume the the state of the art imposes a difficult tion operations. However it fails to pre- absence of data races. Since such races choice: a language that supposedly has vent reordering of a synchronization are difficult to reproduce, the root cause strong safety and security properties, store followed by a synchronization of such misbehavior is often difficult to but no clear definition of what value a load; thus this implementation does identify, and such bugs may easily take shared-memory read may return (the not prevent the incorrect outcome for weeks to track down. Many tools to aid Java case), versus a language with clear Figure 1. such debugging (for example, CHESS30 semantics, but that requires abandon- This may be addressed by translat- and RaceFuzzer35) also assume sequen- ing security properties promised by lan- ing a synchronization store to an ordi- tial consistency, somewhat limiting guages such as Java (the C++ case). Un- nary store instruction followed by an ex- their utility. fortunately, modern software needs to pensive fence. The sole purpose of this Synchronization variable perfor- be both parallel and secure, and requir- fence is to prevent reordering of the syn- mance on current hardware. As dis- ing a choice between the two should chronization store with a subsequent cussed, ensuring sequential consis- not be acceptable. synchronization load. In practice, such tency in the presence of Java volatile A pessimistic view would be to a synchronization load is unlikely to fol- or C++ atomic on current hardware abandon shared-memory altogether. low closely enough (Dekker’s algorithm can be expensive. As a result, both C++, However, the intrinsic advantages of a is not commonly used) to really con- and to a lesser extent Java, have had to global address space are, at least anec- strain the hardware. But the only avail- provide less expensive alternatives that dotally, supported by the widespread able fence instruction constrains all greatly complicate the model for ex- use of threads despite the inherent memory reordering around it, includ- perts trying to use them. challenges. We believe the fault lies not ing that involving ordinary data access- Untrusted code. There is no way to in the global address space paradigm, es, and thus overly constrains the hard- ensure data-race-freedom in untrusted but in the use of undisciplined or “wild ware. A better solution would involve code. Thus, this model is insufficient shared-memory,” permitted by current distinguishing between two flavors of for languages like Java. systems. loads and stores (ordinary and synchro- An unequivocal lesson from our ex- Data-race-free was a first attempt to nization), roughly along the lines of periences is that for programs with data formalize a shared-memory discipline ’s ld.acq and st.rel.23 This, races, it is very difficult to define seman- via a memory model. It proved inad- however, requires a change to the in- tics that are easy to understand and yet equate because the responsibility for struction set architecture, usually a dif- retain desired system flexibility. While following this discipline was left to the ficult proposition. the Java memory model came a long programmer. Further, data-race-free We suspect the current situation way, its complexity, and subsequent by itself is, arguably, insufficient as a makes the fence instruction more ex- discoveries of its surprising behaviors, discipline for writing correct, easily pensive than necessary, in turn moti- are far from satisfying. Unfortunately, debuggable, and maintainable shared- vating additional language-level com- we know of no alternative specification memory code; for example, it does not plexity such as C++ low-level atomics or that is sufficiently simple to be consid- completely eliminate atomicity viola- lazySet() in Java. ered practical. Second, rules to weaken tions or non-deterministic behavior. the data-race-free guarantee to better Moving forward, we believe a critical Lessons Learned match current hardware, as through research agenda to enable “parallelism Data-race-free provides a simple and C++ low-level atomics, are also more for the masses” is to develop and pro- consistent model for threads and complex than we would like. mote disciplined shared-memory models shared variables. We are convinced it The only clear path to improvement that: is the best model today to target during here seems to be to eliminate the need ˲˲ are simple enough to be easily teach- initial software development. Unfortu- for going beyond the data-race-free able to undergraduates; that is, mini- nately, its lack of any guarantees in the guarantee by: mally provide sequential consistency to presence of data races and mismatch ˲˲ Eliminating the performance moti- programs that obey the required disci- with current hardware implies three vations for going beyond it, and pline; significant weaknesses: ˲˲ Ensuring that data races are never ˲˲ enable the enforcement of the disci- Debugging. Accidental introduction actually executed at runtime, thus both pline; that is, violations of the discipline of a data race results in “undefined be- avoiding the need to specify their be- should not have undefined or horren- havior,” which often takes the form of havior and greatly simplifying or elimi- dously complex semantics, but should

98 communications of the acm | august 2010 | vol. 53 | no. 8 review articles be caught and returned back to the pro- written for performance have determin- grammer as illegal; istic outcomes and can be expressed ˲˲ are general-purpose enough to ex- with deterministic algorithms. Writing press important parallel algorithms and such programs using a deterministic patterns; and environment allows reasoning with se- ˲˲ enable high and scalable perfor- Data-race-free quential semantics (a memory model mance. provides a simple much simpler than sequential consis- Many previous programmer-produc- tency with threads). tivity-driven efforts have sought to raise and consistent A valuable discipline, therefore, is to the level of abstraction with threads; for model for threads provide a guarantee of determinism by example, Cilk,19 TBB,25 OpenMP,39 the default; when non-determinism is in- recent HPCS languages,28 other high- and shared herently required, it should be request- level libraries, frameworks, and APIs variables. We ed explicitly and should not interfere such as java.util.concurrent and the C++ with the deterministic guarantees for boost libraries, as well as more domain- are convinced the remaining program.7 There is much specific ones. While these solutions go prior work in deterministic data paral- a long way toward easing the pain of that it is the best lel, functional, and actor languages. Our orchestrating parallelism, our memory- model today to focus is on general-purpose efforts that models driven argument is deeper—we continue use of widespread program- argue that, at least so far, it is not pos- target during ming practices; for example, global sible to provide reasonable semantics initial software address space, imperative languages, for a language that allows data races, object-oriented programming, and an arguably more fundamental prob- development. complex, pointer-based data structures. lem. In fact, all of these examples either Language-based approaches with provide unclear models or suffer from such goals include Jade34 and the recent the same limitations as C++/Java. These Deterministic Parallel Java (DPJ).8 In approaches, therefore, do not meet our particular, DPJ proposes a region-based enforcement requirement. Similarly, type and effect system for determinis- transactional memory provides a high- tic-by-default semantics—“regions” level mechanism for atomicity, but the name disjoint partitions of the heap memory model in the presence of non- and per-method effect annotations transactional code faces the same is- summarize which regions are read and sues as described here.38 written by each method. Coupled with At the heart of our agenda of disci- a disciplined parallel control structure, plined models are the questions: What the compiler can easily use the effect is the appropriate discipline? How to summaries to ensure that there are no enforce it? unordered conflicting accesses and A near-term transition path is to the program is deterministic. Recent continue with data-race-free and focus results show that DPJ is applicable to a research on its enforcement. The ideal range of applications and complex data solution is for the language to elimi- structures and provides performance nate data races by design (for example, comparable to threads code.8 Boyapati13); however, our semantics dif- There has also been much recent ficulties are avoided even with dynamic progress in runtime methods for deter- techniques (for example, Elmas et al.,17 minism.4,5,16,31 Flanagan and Freund,18 or Lucia et al.27) Both language and runtime ap- that replace all data races with excep- proaches have pros and cons and still tions. (There are other dynamic data require research before mainstream race detection techniques, primarily for adoption. A language-based approach debugging, but they do not guarantee must establish that it is expressive complete accuracy, as required here.) enough and does not incur undue pro- A longer-term direction concerns grammer burden. For the former, the both the appropriate discipline and its new techniques are promising, but the enforcement. A fundamental challenge jury is still out. For the latter, DPJ is at- in debugging, testing, and reasoning tempting to alleviate the burden by us- about threaded programs arises from ing a familiar base language (currently their inherent non-determinism—an Java) and providing semiautomatic execution may exhibit one of many pos- tools to infer the required programmer sible interleavings of its memory ac- annotations.41 Further, language anno- cesses. In contrast, many applications tations such as DPJ’s read/write effect

august 2010 | vol. 53 | no. 8 | communications of the acm 99 review articles

summaries are valuable documenta- ware technology makes this a particu- tion in their own right—they promote larly opportune time to embark on such lifetime benefits for modularity and an agenda. Power and complexity con- maintainability, arguably compensat- straints have led industry to bet that fu- ing for upfront programmer effort. Fi- ture single-chip performance increases nally, a static approach benefits from no We believe that will largely come from increasing num- overhead or surprises at runtime. hardware that bers of cores. Today’s hardware cache- In contrast, the purely runtime ap- coherent multicore designs, however, proaches impose less burden on the takes advantage are optimized for few cores—power-ef- programmer, but a disadvantage is of the emerging ficient, performance scaling to several that the overheads in some cases may hundreds or a thousand cores without still be too high. Further, inherently, a disciplined software consideration of software requirements runtime approach does not provide the will be difficult. guarantees of a static approach before programming We view this challenge as an op- shipping and is susceptible to surpris- models is likely to portunity to not only resolve the prob- es in the field. lems discussed in this article, but in We are optimistic that the recent be more efficient doing so, we expect to build more ef- approaches have opened up many than a software fective hardware and software. First, promising new avenues for disciplined we believe that hardware that takes shared-memory that can overcome the oblivious approach. advantage of the emerging disciplined problems described here. It is likely software programming models is likely that a final solution will consist of a ju- to be more efficient than a software- dicious combination of language and oblivious approach. This observation runtime features, and will derive from a already underlies the work on relaxed rich line of future research. hardware consistency models—we hope the difference this time around Implications for Hardware will be that the software and hardware As discussed earlier, current hard- models will evolve together rather than ware memory models are an imperfect as retrofits for each other, providing match for even current software (data- more effective solutions. Second, hard- race-free) memory models. ISA changes ware research to support the emerging to identify individual loads and stores disciplined software models is also as synchronization can alleviate some likely to be critical. Hardware support short-term problems. An established can be used for efficient enforcement ISA, however, is difficult to change, es- of the required discipline when static pecially when existing code works most- approaches fall short; for example, ly adequately and there is not enough through directly detecting violations of experience to document the benefits of the discipline and/or through effective the change. strategies to sandbox untrusted code. Academic researchers have taken Along these lines, we have recently an alternate path that uses complex begun the DeNovo hardware project at mechanisms (for example, Blundell et Illinois15 in concert with DPJ. We are al.6) to speculatively remove the con- exploiting DPJ-like region and effect straints imposed by fences, rolling annotations to design more power- back the speculation when it is detect- and complexity-efficient, software- ed that the constraints were actually driven com­munication and coher- needed. While these techniques have ence protocols and task scheduling been shown to work well, they come mechanisms. We also plan to provide at an implementation cost and do not hardware and runtime support to deal directly confront the root of the prob- with cases where DPJ’s static informa- lem of mismatched hardware/software tion and analysis might fall short. As views of concurrency semantics. such co-designed models emerge, Taking a longer-term perspective, ultimately, we expect them to drive the we believe a more fundamental solu- future hardware-software interface in- tion to the problem will emerge with cluding the ISA. a co-designed approach, where future multicore hardware research evolves Conclusion in concert with the software models re- This article gives a perspective based search discussed in “Implications for on work collectively spanning approxi- Languages.” The current state of hard- mately 30 years. We have been repeat-

100 communications of the acm | august 2010 | vol. 53 | no. 8 review articles

edly surprised at how difficult it is to el; Lawrence Crowl, Paul McKenney, Implementation, 2009. 19. Frigo, M., Leiserson, C.E. and Randall, K.H. The formalize the seemingly simple and Clark Nelson, and Herb Sutter, for the implementation of the Cilk-5 multithreaded language. fundamental property of “what value a C++ model; and Vikram Adve, Rob Boc- In Proceedings of the ACM Conference on Programming Language Design and Implementation, 1998, 212–223. read should return in a multithreaded chino, and Marc Snir for ongoing work 20. Gharachorloo, K. Memory consistency models for program.” Sequential consistency for on disciplined programming models shared-memory multiprocessors. Ph.D. thesis, 1996, Stanford University, Stanford, CA. data-race-free programs appears to and their enforcement. We thank Doug 21. Gharachorloo, K., Gupta, A. and Hennessy, J. Two be the best we can do at present, but it Lea for continuous encouragement to techniques to enhance the performance of memory consistency models. In Proceedings of the Intl. Conf. is insufficient. The inability to define push the envelope. Finally, we thank Vi- on Parallel Processing, 1991, I355–I364. reasonable semantics for programs kram Adve, Rob Bocchino, Nick Carter, 22. IEEE and The Open Group. IEEE Standard 1003.1- 2001. 2001. with data races is not just a theoretical Lawrence Crowl, Mark Hill, Doug Lea, 23. Intel. Intel Itanium Architecture: Software shortcoming, but a fundamental hole Jeremy Manson, Paul McKenney, Bratin Developer’s Manual, Jan 2006. 24. ISO/IEC JTC1/SC22/WG21. ISO/IEC 14882, in the foundation of our languages Saha, and Rob Schreiber for comments Programming languages - C++ (final committee draft) 2010; http://www.open-std.org/jtc1/sc22/wg21/docs/ and systems. It is well accepted that on earlier drafts. papers/2010/n3092.pdf. most shipped software has bugs and it Sarita Adve is currently funded by 25. reinders, J. Intel Threading Building Blocks: Outfitting C++ for Multi-core Parallelism. O’Reilly, 2007. is likely that much commercial multi- Intel and Microsoft through the Illinois 26. lamport, L. How to make a multiprocessor computer threaded software has data races. De- Universal Parallel Computing Research that correctly executes multiprocess programs. IEEE Transactions on Computers C-28, 9 (1979), 690–691. bugging tools and safe languages that Center. 27. Lucia, B. et al. Conflict exceptions: Simplifying seek to sandbox untrusted code must concurrent language semantics with precise hardware exceptions for data-races. In Proceedings deal with such races, and must be given of the International Symposium on Computer References Architecture, 2010. semantics that reasonable computer 1. adve, S.V. Designing Memory Consistency Models 28. lusk, E. and Yelick, E. Languages for high-productivity for Shared-Memory Multiprocessors. PhD thesis. science graduates and developers can computing: The DARPA HPCS language project. University of Wisconsin-Madison, 1993. Parallel Processing Letters 17, 1 (2007) 89–102. understand. 2. adve, S.V. and Gharachorloo, K. Shared memory 29. Manson, J., Pugh, W. and Adve, S.V. The Java memory consistency models: A tutorial. IEEE Computer 29, 12 We believe it is time to rethink how model. In Proceedings of the Symp. on Principles of (1996), 66–76. Programming Languages, 2005. we design our languages and systems. 3. adve, S.V. and Hill, M.D. Weak ordering—A new 30. Musuvathi, M. and Qadeer, S. Iterative context definition. InProceedings of the 17th Intl. Symp. Minimally, the system, and preferably bounding for systematic testing of multithreaded Computer Architecture, 1990, 2–14. programs. In Proceedings of the ACM Conference on the language, must enforce the absence 4. allen, M.D., Sridharan, S. and Sohi, G.S. Serialization Programming Language Design and Implementation, sets: A dynamic dependence-based parallel execution 2007, 446–455. of data races. A longer term, potentially model. In Proceedings of the Symp. on Principles and 31. olszewski, M., Ansel, J., and Amarasinghe, S. Kendo: Practice of Parallel Programming, 2009. more rewarding strategy is to rethink Efficient deterministic multithreading in software. 5. berger, E.D., Yang, T., Liu, T. and Novark, G. Grace: Safe In Proceedings Intl. Conf. on Architectural Support higher-level disciplines that make it multithreaded programming for C/C++. In Proceedings for Programming Languages and Operating Systems. of the Intl. Conf. on Object-Oriented Programming, much easier to write parallel programs Mar. 2009. Systems, Languages, and Applications, 2009. 32. Pugh, W. The Java memory model is fatally flawed. and that can be enforced by our languag- 6. blundell, C, Martin, M.M.K. and Wenisch, T. Invisifence: Concurrency-Practice and Experience 12, 6 (2000), Performance-transparent memory ordering in es and systems. We also believe some 445–455. conventional multiprocessors. In Proceedings of the 33. Pugh, W. and the JSR 133 Expert Group. The Java of the messiness of memory models to- Intl. Symp. on Computer Architecture, 2009. memory model (July 2009); http://www.cs.umd. 7. bocchino, R. et al. Parallel programming must be day could have been averted with closer edu/~pugh/java/memoryModel/. deterministic by default. In Proceedings of the 1st 34. rinard, M.C. and Lam, M.S. The design, cooperation between hardware and Workshop on Hot Topics in Parallelism, 2009. implementation, and evaluation of Jade. ACM 8. bocchino. R. et al. A type and effect system for software. As we move toward more dis- Transactions on Programming Languages and Deterministic Parallel Java. In Proceedings of the Systems 20, 3 (1998), 483–545. ciplined programming models, there is Intl. Conf. on Object-Oriented Programming, Systems, 35. sen, K. Race directed random testing of concurrent Languages, and Applications, 2009. also a new opportunity for a hardware/ programs. In Conf. on Programming Language Design 9. boehm, H.-J.. Threads cannot be implemented as a and Implementation, 2008. software co-designed approach that re- library. In Proceedings of the Conf. on Programming 36. sevcik, J. and Aspinall, D. On validity of program Language Design and Implementation, 2005. thinks the hardware/software interface transformations in the Java memory model. In 10. boehm, H.-J. Reordering constraints for pthread-style Proceedings of the European Conference on Object- and the hardware implementations of locks. In Proceedings of the 12th Symp. Principles and Oriented Programming, 2008, 27–51. Practice of Parallel Programming, 2007, 173–182. 37. shasha, D and Snir, M. Efficient and correct execution all concurrency mechanisms. These 11. boehm, H.-J. Threads basics. July 2009; http://www. of parallel programs that share memory. ACM hpl.hp.com/personal/Hans_Boehm/threadsintro.html. views embody a rich research agenda Transactions on Programming Languages and 12. boehm, H.-J. and Adve, S.V. Foundations of the Systems 10, 2 (Apr. 1998), 282–312. that will need the involvement of many C++ concurrency memory model. In Proceedings 38. shpeisman, T. et al. Enforcing isolation and ordering of the Conf. on Programming Language Design and computer science sub-disciplines, in- in STM. In Proceedings of the ACM Conference on Implementation, 2008, 68–78. Programming Language Design and Implementation, cluding languages, compilers, formal 13. boyapati, C., Lee, R., and Rinard, M. Ownership types 2007. for safe programming: Preventing data races and methods, software engineering, algo- 39. the OpenMP ARB. OpenMP application programming deadlocks. In Proceedings of the Intl. Conf. on Object- interface: Version 3.0. May 2008; http://www.openmp. rithms, runtime systems, and hardware. Oriented Programming, Systems, Languages, and org/mp-documents/spec30.pdf. Applications, 2002. 40. United States Department of Defense. Reference 14. Ceze, L et al. BulkSC: Bulk Enforcement of Sequential Manual for the Ada Programming Language: ANSI/ Acknowledgments Consistency. In Proceedings of the Intl. Symp. on MIL-STD-1815A-1983 Standard 1003.1-2001. Computer Architecture, 2007. This article is deeply influenced by a Springer, 1983. 15. Choi, B. et al. DeNovo: Rethinking Hardware for 41. Vakilian, M. et al. Inferring method effect summaries collective 30 years of collaborations and Disciplined Parallelism. In Proceedings of the 2nd for nested heap regions. In Proceedings of the 24th Workshop on Hot Topics in Parallelism, June 2010. discussions with more colleagues than Intl. Conf. on Automated Software Engineering, 2009. 16. devietti, J. et al. DMP: Deterministic shared memory can be named here. We would particu- processing. In Proceedings of the Intl. Conf. on larly like to acknowledge the contribu- Architectural Support for Programming Languages Sarita V. Adve ([email protected]) is a professor in the and Operating Systems (Mar. 2009), 85–96. Department of Computer Science at the University of tions of Mark Hill for co-developing 17. elmas, T., Qadeer, S. and Tasiran, S. Goldilocks: A race Illinois at Urbana-Champaign. and transaction-aware Java runtime. In Proceedings the data-race-free approach and other of the ACM Conference on Programming Language Hans-J. Boehm ([email protected]) is a member of foundational work; Kourosh Ghara- Design and Implementation, 2007, 245–255. Hewlett-Packard’s Exascale Computing Lab, Palo Alto, CA. 18. Flanagan, C. and Freund, S. FastTrack: Efficient and chorloo for hardware models; Jeremy precise dynamic race detection. In Proceedings of Manson and Bill Pugh for the Java mod- the Conf. on Programming Language Design and © 2010 ACM 0001-0782/10/0800 $10.00

august 2010 | vol. 53 | no. 8 | communications of the acm 101