<<

[3B2-9] mmi2010010040.3d 28/1/010 12:40 Page 40

...... DMP: DETERMINISTIC SHARED- MEMORY

...... SHARED-MEMORY MULTICORE AND MULTIPROCESSOR SYSTEMS ARE NONDETERMINISTIC,

WHICH FRUSTRATES DEBUGGING AND COMPLICATES TESTING OF MULTITHREADED

CODE, IMPEDING PARALLEL PROGRAMMING’S WIDESPREAD ADOPTION.THE AUTHORS

PROPOSE FULLY DETERMINISTIC SHARED-MEMORY MULTIPROCESSING THAT NOT ONLY

ENHANCES DEBUGGING BY OFFERING REPEATABILITY BY DEFAULT, BUT ALSO IMPROVES

THE QUALITY OF TESTING AND THE DEPLOYMENT OF PRODUCTION CODE.THEY SHOW

THAT DETERMINISM CAN BE PROVIDED WITH LITTLE PERFORMANCE COST ON FUTURE

HARDWARE.

...... Developing multithreaded soft- reason, much of the research on testing par- ware has proven to be much more difficult allel programs is devoted to dealing with than writing single-threaded code. One nondeterminism. major challenge is that current multicore Although researchers have made signifi- processors execute multithreaded code non- cant efforts to address nondeterminism, deterministically: given the same input, they’ve mostly focused on deterministically threads can interleave their memory and replaying multithreaded execution based on I/O operations differently in each execution. a previously generated log.1-5 (See the Joseph Devietti Nondeterminism in multithreaded exe- ‘‘Related work on dealing with nondetermin- cution arises from small perturbations in ism in multithreaded software’’ sidebar for Brandon Lucia the execution environment—for example, more information.) In contrast, we make the other processes executing simultaneously, case for fully deterministic shared-memory Luis Ceze differences in the operating system’s resource multiprocessing (DMP). We show that, with allocation, the state of caches, TLBs, buses, hardware support, we can execute arbitrary Mark Oskin and other microarchitectural structures. shared-memory parallel programs determinis- Nondeterminism complicates the software tically, with scant performance penalty. For a University of Washington development significantly. Defective discussion of determinism’s benefits, see the software might execute correctly hundreds of ‘‘How determinism benefits multithreaded times before a subtle synchronization bug software development’’ sidebar. appears—and when it does, developers typi- In this article, we describe a multiproces- cally can’t reproduce it during debugging. sor architecture for determinism. This work Because a program can behave differently has led to several follow-on projects; for ex- each time it’s run with the same input, ample, we recently published work on pure assessing test coverage is difficult. For that software approaches to determinism.6 ......

40 Published by the IEEE Computer Society 0272-1732/10/$26.00 c 2010 IEEE 3B2_10 mmi2010010040.3d 27/1/010 12:43 Page 41

...... Related work on dealing with nondeterminism in multithreaded software Past work on dealing with nondeterminism in shared-memory multi- 3. M. Ronsee and K. De Bosschere, ‘‘RecPlay: A Fully processors has focused primarily on deterministically replaying an execu- Integrated Practical Record/Replay System,’’ ACM Transac- tion for debugging. The idea is to record a log of the ordering of events tions on Computer Systems, vol. 17, no. 2, 1999, pp. 133-152. that occurred in a parallel execution and later replay the execution on the 4. D.F. Bacon and S.C. Goldstein, ‘‘Hardware-Assisted Replay of basis of the log. Software-based systems1-3 typically suffer from high Multiprocessor Programs,’’ Proc. 1991 ACM/ONR Workshop Par- overhead or limitations on the types of events recorded—for example, allel and Distributed Debugging, ACM Press, 1991, pp. 194-206. only synchronization events, which are insufficient to reproduce the out- 5. S. Narayanasamy, G. Pokam, and B. Calder, ‘‘BugNet: Con- comes of races. Hardware-based record and replay mechanisms4-6 ad- tinuously Recording Program Execution for Deterministic dress these issues by enabling the recording of low-level events more Replay Debugging,’’ Proc. 32nd Int’l Symp. Computer Archi- efficiently, allowing faster and more accurate record and replay. tecture (ISCA 05), IEEE CS Press, 2005, pp. 284-295. Most follow-up research has been on further reducing deterministic 6. M. Xu, R. Bodik, and M.D. Hill, ‘‘A ‘Flight Data Recorder’ for replay’s log size and performance impact. Strata7 and FDR28 exploit re- Enabling Full-System Multiprocessor Deterministic Replay,’’ dundancy in the memory race log. Recent advances are ReRun9 and Proc. 30th Int’l Symp. Computer Architecture (ISCA 03), DeLorean,10 which aim to reduce log size and hardware complexity. IEEE CS Press, 2003, pp. 122-135. ReRun is a hardware memory race recording mechanism that records 7. S. Narayanasamy, C. Pereira, and B. Calder, ‘‘Recording Shared periods of the execution without memory communication (the same phe- Memory Dependencies Using Strata,’’ Proc. 12th Int’l Conf. Ar- nomenon that DMP-ShTab leverages), requiring little hardware state and chitectural Support for Programming Languages and Operating producing a small race log. DeLorean executes instructions as blocks Systems (ASPLOS 06), ACM Press, 2006, pp. 229-240. (similar to quanta), and only needs to record the commit order of blocks 8. M. Xu, M. Hill, and R. Bodik, ‘‘A Regulated Transitive Reduction of instructions. Additionally, DeLorean uses predefined commit ordering (RTR) for Longer Memory Race Recording,’’ Proc. 12th Int’l (albeit with block size logging in special cases) to further reduce the Conf. Architectural Support for Programming Languages and memory ordering log. In contrast, DMP makes memory communication Operating Systems (ASPLOS 06), ACM Press, 2006, pp. 49-60. deterministic by construction, eliminating the need for logging. 9. D.R. Hower and M.D. Hill, ‘‘Rerun: Exploiting Episodes for Light- Similarly to DMP, Kendo,11 which was concurrently developed with weight Memory Race Recording,’’ Proc. 35th Int’l Symp. Com- our work, provides log-free deterministic execution. Kendo leverages puter Architecture (ISCA 08), IEEE CS Press, 2008, pp. 265-276. an application’s data-race freedom to provide determinism efficiently 10. P. Montesinos, L. Ceze, and J. Torrellas, ‘‘DeLorean: Record- and purely in software via a deterministic synchronization . In con- ing and Deterministically Replaying Shared-Memory Multiproc- trast, DMP uses hardware support to provide determinism for arbitrary essor Execution Efficiently,’’ Proc. 35th Int’l Symp. Computer programs, including those with data races. Architecture (ISCA 08), IEEE CS Press, 2008, pp. 289-300. In this work, we explore enforcing determinism at the execution level, 11. M. Olszewski, J. Ansel, and S. Amarasinghe, ‘‘Kendo: Efficient providing generality but requiring architecture modifications. A body of Deterministic Multithreading in Software,’’ Proc. 14th Int’l Conf. related work proposes a different trade-off: moving to new, deterministic Architectural Support for Programming Languages and Operat- languages such as StreamIt,12 Jade,13 and DPJ,14 while continuing to run ing Systems (ASPLOS 09), ACM Press, 2009, pp. 97-108. on commodity hardware. 12. W. Thies, M. Karczmarek, and S.P. Amarasinghe, ‘‘StreamIt: A Language for Streaming Applications,’’ Proc. 11th Int’l Conf. Compiler Construction, LNCS 2304, Springer, 2002, pp. 179-196. References 13. M.C. Rinard and M.S. Lam, ‘‘The Design, Implementation, 1. J.-D. Choi and H. Srinivasan, ‘‘Deterministic Replay of Java and Evaluation of Jade,’’ ACM Trans. Programming Lan- Multithreaded Applications,’’ Proc. SIGMETRICS Symp. Par- guages and Systems, vol. 20, no. 3, 1998, pp. 483-545. allel and Distributed Tools, ACM Press, 1998, pp. 48-59. 14. R.L. Bocchino Jr. et al., ‘‘A Type and Effect System for Deter- 2. T.J. Leblanc and J.M. Mellor-Crummey, ‘‘Debugging Parallel ministic Parallel Java,’’ Proc. 24th ACM SIGPLAN Conf. Programs with Instant Replay,’’ IEEE Trans. Computers, Object Oriented Programming Systems Languages and Appli- vol. 36, no. 4, 1987, pp. 471-482. cations (OOPSLA 09), 2009, pp. 97-116.

Defining deterministic parallel execution  produces the same program output if We define a DMP system as a computer given the same program input. system that This definition implies that a parallel pro-  executes multiple threads that commu- gram running on a DMP system is as deter- nicate via shared memory, and ministic as a single-threaded program......

JANUARY/FEBRUARY 2010 41 3B2_10 mmi2010010040.3d 27/1/010 12:43 Page 42

...... TOP PICKS

...... How determinism benefits multithreaded software development Deterministic execution could potentially improve the entire multi- same inputs to expose different interleavings; instead, thread threaded software development process (see Figure A). In a develop- interleavings can be controlled in a more principled fashion by modify- ment environment, determinism allows for repeatable debugging, ing inputs. However, the greatest rewards accrue when parallel pro- which in turn enables reversible debugging for multithreaded pro- grams always execute deterministically in the field. Deployed grams. Determinism also acts as a powerful lever for thread analysis determinism has the potential to make testing more valuable, as exe- tools (such as race detectors) because any properties verified for an cution in the field will better resemble in-house testing, and to allow execution with a given input will continue to hold for all future execu- for easier reproduction of bugs from the field back in a debugging en- tions of that input. Determinism simplifies testing by eliminating the vironment. Ultimately, determinism will enable the deployment of more need to stress test a program by running it repeatedly with the reliable parallel software.

Test Tested inputs behave No need to identically in production stress test

Testing results are reproducible

Deploy More robust production code

Debug Reverse debugging is possible; more effective thread analysis tools Production bugs can be reproduced in-house

Figure A. The benefits of determinism throughout the development cycle for multithreaded software.

The most direct way to guarantee deter- need not be in the same thread, so this com- ministic behavior is to preserve the same munication happens via shared memory. global interleaving of instructions in every Interestingly, multiple global interleavings execution of a parallel program. However, can lead to the same communication this interleaving has several aspects that are between instructions—that is, they are irrelevant for ensuring deterministic behav- communication-equivalent interleavings. ior. It’s not important which particular Any communication-equivalent interleav- global interleaving is chosen, as long as it’s al- ing will the same program behavior. ways the same. Also, if two instructions don’t To guarantee deterministic behavior, then, communicate, the system can swap their ex- we must carefully control only the behavior ecution order with no observable effect on of load and store operations that cause program behavior. The key to deterministic communication between threads. This in- execution is that all communication between sight is the key to efficient deterministic threads is precisely the same for every execution. execution. To guarantee deterministic interthread Enforcing Deterministic Execution communication, each dynamic instance of a In this section, we describe how to build a load (consumer) must read data produced deterministic multiprocessor system, focus- from the same dynamic instance of another ing on the key mechanisms. We begin with store (producer). The producer and consumer a basic naı¨ve approach and then refine this ...... 42 IEEE MICRO 3B2_10 mmi2010010040.3d 27/1/010 12:43 Page 43

simple technique into progressively more P0 P1 P0 P1 P0 P1 P0 P1 efficient organizations. We discuss specific implementations later. Basic idea: DMP-Serial As we noted earlier, making multiproces- sors deterministic depends on ensuring that the communication between threads is deter- ministic. The easiest way to accomplish this is to take a nondeterministic parallel execu- (a) (b) (c) (d) tion (see Figure 1a) and allow only one processor at a time to access memory deter- Figure 1. Deterministic serialization of memory operations: nondeterministic ministically.Wecallthisthedeterministic parallel execution (a), fine-grain deterministic serialized execution (b), serialization of a parallel execution (see coarse-grain deterministic serialized execution (c), and overlapping Figure 1b). communication-free execution periods (d). Dots are memory operations, The simplest way to implement such seri- and dashed arrows are happens-before synchronization. alization is to have each processor obtain a deterministic token before performing a memory operation and, when the memory determinism. We propose two techniques operation is complete, pass the token to the to recover parallelism from communication- next processor deterministically. A processor free periods of an execution (Figure 1d). blocks whenever it needs to access memory The first technique leverages an invalidation- but doesn’t have the token. based coherence protocol to identify Waiting for the token at every memory when interthread communication is about operation will cause significant performance to occur. The second technique uses specula- degradation for two reasons. First, the pro- tion to allow parallel execution of quanta cessor must wait for and pass the determinis- from different processors, re-executing tic token. Second, the serialization removes quanta when determinism might have been the performance benefits of parallel execution. violated. We can mitigate the synchronization overhead of token passing by synchronizing Leveraging the protocol: at a coarser granularity, letting each processor DMP-ShTab. With DMP-ShTab, we im- execute a finite, deterministic number of prove deterministic parallel execution’s per- instructions, or a quantum, before passing formance by serializing only while threads the token to the next processor (Figure 1c). communicate. We divide each quantum We call such a system DMP-Serial. The pro- into two parts: a communication-free prefix cess of dividing the execution into quanta is that executes in parallel with other quanta, called quantum building; the simplest way and a suffix that, from the first point of to build a quantum is to break execution communication onward, executes serially. into fixed instruction counts (for example, The serial suffix’s execution is deterministic 10,000). because each thread runs serially in an order Notably, DMP doesn’t interfere with established by the deterministic token, just application-level synchronization (for exam- as in DMP-Serial. The transition from par- ple, DMP introduces no ). DMP allel execution to serial execution is deter- offers the same memory access semantics ministic because it occurs only when all as traditional nondeterministic systems. threads are blocked—each thread will The extra synchronization a DMP system block either at its first point of interthread imposes resides below application-level syn- communication or, if it doesn’t communi- chronization, and the two don’t interact. cate with other threads, at the end of its cur- rentquantum.Thus,eachthreadblocks Recovering parallelism during each of its quanta (though possibly Reducing serialization’s impact requires not until the end), and each thread enabling parallel execution while preserving blocks at a deterministic point within ......

JANUARY/FEBRUARY 2010 43 3B2_10 mmi2010010040.3d 27/1/010 12:43 Page 44

...... TOP PICKS

Memory operation N Deterministic order Deterministic token passing Atomic quantum Ld: load St: store

Initially A is private in PO Initially A is shared P0 P1 P0 P1 ...... St A Ld A St A Parallel 1 Ld A prefix ... (Blocked) (Blocked) ...

Ld A 1 St A Serial (Blocked) 2 2 suffix

(a) (b)

Figure 2. Recovering parallelism by overlapping communication-free execution: remote processor P1 deterministically reads another thread’s private data (a), and remote processor P0 deterministically writes to shared data (b).

each quantum because communication is that all processors observe the change in loca- detected deterministically (we’ll describe tion A’s state (from shared to privately held this later). by a remote processor) at a deterministic Interthread communication occurs when point in their execution. When each thread a thread writes to shared (or nonprivate) reaches the end of a quantum, it waits to pieces of data. In this case, the system must receive the token before starting its next guarantee that all threads observe such writes quantum, which periodically (and determin- at a deterministic point in their execution. istically) allows a thread that’s waiting for Figure 2 illustrates how DMP-ShTab enfor- all other threads to be blocked to make ces this. There are two important cases: read- progress. ing data held private by a remote processor, and writing to shared data (privatizing it). Leveraging support for : Figure 2a shows the first case: when quan- DMP-TM and DMP-TMFwd. Executing tum 2 attempts to read data that a remote quanta atomically, in isolation, and in a processor P0 holds private, it must first deterministic total order is equivalent to wait for the deterministic token and for all deterministic serialization. To see why, con- other threads to be blocked waiting for the sider a quantum, executed atomically and in deterministic token. In this example, the isolation, to be a single instruction in the read can’t execute until quantum 1 finishes deterministic total order (which is the executing. This guarantees that quantum 2 same as DMP-Serial). Leveraging transac- always gets the same data, because quantum 1 tional memory7,8 can make quanta appear might still write to location A before it fin- to execute atomically and in isolation. ishes executing. Coupledwithadeterministiccommit Figure 2b shows the second case: when order, this makes execution functionally quantum 1, which already holds the deter- equivalent to deterministic serialization, ministic token, attempts to write to a piece while recovering parallelism. Additionally, of shared data, it must also wait for all we need mechanisms to form quanta deter- other threads to be blocked waiting for the ministically and to enforce a deterministic deterministic token. In this example, the commit order. As Figure 3a shows, specula- store can’t execute until quantum 2 finishes tion lets a quantum run concurrently with executing. This is necessary to guarantee other quanta in the system, as long as ...... 44 IEEE MICRO 3B2_10 mmi2010010040.3d 27/1/010 12:43 Page 45

Memory operation Atomic quantum Deterministic token passing Uncommitted N Deterministic order value flow Ld: load St: store

P0 P1 St A Ld A P0 P1 (Squash) 1 Ld B St C St A St C Ld A St A 1 Ld B Ld A 2 St C 2 3 Ld T St A St A St A 3 Ld B Ld B 4 Ld B 4

(a) (b)

Figure 3. Recovering parallelism by executing quanta as memory transactions (a). Avoiding unnecessary squashes with uncommitted data forwarding (b).

there are no overlapping memory accesses guarantee correctness, if a quantum that pro- that would violate the original deterministic vided data to other quanta is squashed, all serialization of memory operations. In case subsequent quanta must also be squashed, of conflict, the quantum gets squashed because they might have directly or indirectly and re-executed later in the deterministic consumed incorrect data. We call a DMP total order (see quantum 2 in Figure 3a). system that leverages support for transac- Notably, the deterministic total order of tional memory with forwarding DMP- quantum commits is a key component in TMFwd. guaranteeing deterministic behavior. We call this system DMP-TM. Building quanta more intelligently Another interesting effect of predefined We devised several heuristics to modify commit ordering is that it lets us use memory our simplistic quantum building algorithm renaming to avoid squashes on write-after- forhigherperformance.Forexample,we write and write-after-read conflicts. For ex- can end a quantum when an unlock opera- ample, in Figure 3a, if quanta 3 and 4 exe- tion occurs. The rationale is that when a cute concurrently, the store to A in thread releases a lock, other threads might quantum 3 need not squash quantum 4, de- be waiting for that lock, so the deterministic spite their write-after-write conflict. Having a token should be sent forward as early as pos- deterministic commit order also lets us selec- sible to let a waiting thread progress. We dis- tively relax isolation, further improving per- cuss other heuristics based on data-sharing formance by letting quanta forward patterns in the full conference paper.9 uncommitted (or speculative) data. This could save a large number of squashes in Implementing DMP applications that have more interthread com- DMP-ShTab’s sharing table data structure munication. To avoid spurious squashes, we keeps track of the sharing state of data in mem- let a quantum fetch speculative data from ory. Our hardware implementation of the another uncommitted quantum earlier in sharing table leverages the cache line state the deterministic order. This is shown in maintained by a MESI cache coherence proto- Figure 3b, where quantum 2 fetches an un- col. The local processor considers a line in an committed version of A from quantum 1. ‘‘exclusive’’ or ‘‘modified’’ state to be private, Recall that, without forwarding support, so its owner thread can freely read or write it quantum 2 would have been squashed. To without holding the deterministic token......

JANUARY/FEBRUARY 2010 45 3B2_10 mmi2010010040.3d 27/1/010 12:43 Page 46

...... TOP PICKS

The same applies for a read operation on a squashes, and buffering for a single outstand- line in shared state. Conversely, a thread ing transaction per processor. (Our simula- must acquire the deterministic token before tions fully serialize quanta execution, writing to a line in shared state—and more- affecting how the system executes the pro- over, all other threads must be at a determin- gram. This accurately models quanta serial- istic point in their execution (that is, ization’s effects on application behavior). blocked). The memory controller keeps and To reduce simulation time, our model manages the state of the entries in the sharing assumes that all the different DMP modes table corresponding to lines that aren’t cached have the same IPC (including squashed by any processor—much like a directory in instructions). This reasonable assumption directory-based cache coherence. However, lets us compare performance between differ- we don’t require directory-based coherence ent DMP schemes using our infrastructure. per se. When the system services cache Even if execution behavior is deterministic, misses, this state is transferred. Nevertheless, performance might not be deterministic. directory-based systems can simplify DMP- Therefore, we run our simulator multiple ShTab’s implementation even further. times, averaging the results and providing Both DMP-TM and DMP-TMFwd re- error bars showing the 90 percent confidence quire, on top of standard transactional mem- interval for the mean. The comparison base- ory support, a small amount of machinery to line (nondeterministic parallel execution) enforce a deterministic transaction commit also runs on our simulator. order by allowing a transaction to commit For workloads, we use the Splash212 and only when the processor receives the deter- Parsec13 benchmark suites and run the ministic token. DMP-TMFwd additionally benchmarks to completion. We exclude requires more elaborate transactional memory some benchmarks due to infrastructure prob- support to allow speculative data to flow lems, such as out-of-memory errors and lack from uncommitted quanta earlier in the of 64-bit compatibility. deterministic order. We accomplish this by making the coherence protocol aware of the Evaluation data version of quanta, much like in version- Figure 4 shows DMP’s com- ing protocols used in thread-level speculation pared to the nondeterministic, parallel base- systems.10 One interesting aspect of DMP- line. We ran each benchmark with 16 TM is that if we make a transaction overflow threads, and a heuristic-based quantum event deterministic, we can use it as a quan- building strategy producing 1,000-instruction tum boundary, making a bounded transac- quanta while leveraging synchronization tional memory implementation perfectly events and sharing information. As you suitable for a DMP-TM system. Making might expect, DMP-Serial exhibits slow- transaction overflow deterministic requires down that’s nearly linear with the number that updates to the speculative state of cache of threads. The degradation can be sublinear lines happen strictly as a function of memory because DMP affects only the parallel por- instruction retirement (that is, updates from tion of an application’s execution. DMP- speculative instructions aren’t permitted). In ShTab has 38 percent overhead on average addition, it requires all nonspeculative lines with 16 threads, and in the few cases where to be displaced before an overflow is triggered DMP-ShTab has larger overheads (for exam- (that is, the state of nonspeculative lines can’t ple, the lu-nc benchmark; see Figure 4), affect the overflow decision). DMP-TM provides much better perfor- mance. For an additional hardware complex- Experimental setup ity cost, DMP-TMFwd, with an average We assess the performance trade-offs of overhead of only 21 percent, provides a con- the different hardware implementations of sistent performance improvement over DMP systems with a simulator written DMP-TM. The overhead for TM-based using Pin.11 The model includes the effects schemes is flat for most benchmarks, suggest- of serialized execution, quantum building, ing that a TM-based system would be ideal memory conflicts, speculative execution for larger DMP systems. Thus, with the ...... 46 IEEE MICRO 3B2_10 mmi2010010040.3d 27/1/010 12:43 Page 47

3.0 10.7 3.4 10.2 3.8 10.3 11.0 8.4 5.6 6.4 12.1 3.2 4.4 11.8 6.7

2.5

2.0

Runtime normalized to 1.5 non-deterministic parallel execution 1.0 barnes cholesky fft(P) fmm lu-nc ocean-c(P)radix(P) volrend water-sp Splash blacksch bodytr fluid strmcl Parsec gmean gmean

DMP-TMFwd DMP-TM DMP-ShTab DMP-Serial

Figure 4. Runtime overheads with 16 threads. (P) indicates page-level conflict detection; otherwise, the conflict detection is at line-granularity.

right hardware support, deterministic execu- instrumentation purposes only; such code tion’s performance can compete with non- will not affect quantum building, and thus deterministic parallel execution. preserves the original behavior. Full-system implications Dealing with nondeterminism from the operating Beyond the correctness and performance system and I/O considerations we’ve presented, we must A DMP system hides most sources of also consider the DMP architecture’s full- nondeterminism in today’s systems, allowing system implications. many multithreaded programs to run deter- ministically. Slight modifications to existing Implementation trade-offs operating system interfaces can handle Our evaluation showed that using specu- some remaining sources of nondeterminism. lation pays off in terms of performance. For example, parallel programs that use the However, speculation potentially wastes en- operating system to communicate between ergy, requires complex hardware, and has threads must do so deterministically, either implications in system design, because some by executing operating system code deter- code (such as I/O and parts of an operating ministically or detecting communication system) can’t execute speculatively. Fortu- and synchronization via the kernel and pro- nately, DMP-TM, DMP-ShTab, and viding it within the application itself. Our DMP-Serial can coexist in the same system. simulations use the latter approach. Another One easy way to coexist is to switch modes fixable source of nondeterminism is operat- at a deterministic boundary in the program ing system that allow unnecessarily (such as the edge of a quantum). nondeterministic outcomes. For example, the return value of a read system call Supporting debugging instrumentation might vary from run to run even if the file To enable a debugging environment in a being read doesn’t change. A solution for DMP system, we need a way of allowing the read is to always return the maximum user to instrument code for debugging while amount of data requested until end-of-file. preserving the original execution’s interleav- Ultimately, however, the real world is ing. To accomplish this, implementations simply nondeterministic. Programs interact must support a mechanism that lets a com- with remote systems and users, which are piler mark code as being inserted for all nondeterministic and can affect thread ......

JANUARY/FEBRUARY 2010 47 3B2_10 mmi2010010040.3d 27/1/010 12:43 Page 48

...... TOP PICKS

interleaving. That might seem to imply that and Shaz Qadeer from Microsoft Research there’s no hope for building a deterministic for very helpful discussions. multiprocessor system, but that’s not true. Other members of our group involved in When multithreaded code synchronizes, determinism-related projects are Owen that’s an opportunity to be deterministic Anderson, Tom Bergan, Nick Hunt, Steve from that point forward because threads are Gribble, and Dan Grossman. in a known state. Once the system is deter- ministic and the interaction with the external ...... world is considered part of the input, it’s References much easier to write, debug, and deploy reli- 1. D.F. Bacon and S.C. Goldstein, ‘‘Hardware- able multithreaded software. Assisted Replay of Multiprocessor Pro- grams,’’ Proc. 1991 ACM/ONR Workshop Support for deployment Parallel and Distributed Debugging, ACM Deterministic systems shouldn’t just be Press, 1991, pp. 194-206. used for development, but for deployment 2. P. Montesinos, L. Ceze, and J. Torrellas, as well, so that systems in the field will be- ‘‘DeLorean: Recording and Deterministically have like systems used for testing. The reason Replaying Shared-Memory Multiprocessor is twofold. First, developers will be more Execution Efficiently,’’ Proc. 35th Int’l confident that their programs will work cor- Symp. Computer Architecture (ISCA 08), rectly when deployed. Second, if the program IEEE CS Press, 2008, pp. 289-300. does crash in the field, deterministic execu- 3. S. Narayanasamy, C. Pereira, and B. Calder, tion provides a meaningful way to collect ‘‘Recording Shared Memory Dependencies and replay crash history data. Supporting Using Strata,’’ Proc. 12th Int’l Conf. Archi- deterministic execution across different phys- tectural Support for Programming Lan- ical machines places additional constraints on guages and Operating Systems (ASPLOS the implementation. Quanta must be built 06), ACM Press, 2006, pp. 229-240. the same across all systems. This means 4. M. Ronsee and K. De Bosschere, ‘‘RecPlay: machine-specific effects (such as micro-op A Fully Integrated Practical Record/Replay count, or a full cache-set for bounded TM- System,’’ ACM Trans. Computer Systems, based implementations) can’t be used to vol. 17, no. 2, 1999, pp. 133-152. end quanta. Furthermore, passing the deter- 5. M. Xu, R. Bodik, and M.D. Hill, ‘‘A ‘Flight ministic token across processors must be Data Recorder’ for Enabling Full-System the same for all systems. This suggests that Multiprocessor Deterministic Replay,’’ Proc. DMP hardware should provide the core 30th Int’l Symp. Computer Architecture mechanisms and leave the quanta building (ISCA 03), IEEE CS Press, 2003, pp. 122-135. and control up to software. 6. T. Bergan et al., ‘‘CoreDet: A Compiler and Runtime System for Deterministic Multi- erhaps contrary to popular belief, a threaded Execution,’’ to be presented at P shared-memory multiprocessor system 15th Int’l Conf. Architectural Support for can execute programs deterministically with Programming Languages and Operating little performance cost. We believe that Systems, 2010, www.cs.washington.edu/ deterministic multiprocessor systems are a homes/tbergan/papers/asplos10-coredet.pdf. valuable goal; besides yielding several inter- 7. L. Hammond et al., ‘‘Transactional Memory esting research questions, they abstract away Coherence and Consistency,’’ Proc. 31st several difficulties in writing, debugging, Int’l Symp. Computer Architecture (ISCA and deploying parallel code. MICRO 04), IEEE CS Press, 2004, pp. 102-114. 8. M. Herlihy and J.E.B. Moss, ‘‘Transactional Acknowledgments Memory: Architectural Support for Lock- We thank Karin Strauss, Dan Grossman, Free Data Structures,’’ Proc. 20th Int’l Susan Eggers, Martha Kim, Andrew Putnam, Symp. Computer Architecture (ISCA 93), Tom Bergan, and Owen Anderson from the IEEE CS Press, 1993, pp. 289-300. University of Washington for their feedback 9. J. Devietti et al., ‘‘DMP: Deterministic Shared on this article. Finally, we thank Jim Larus Memory Multiprocessing,’’ Proc. 14th Int’l ...... 48 IEEE MICRO 3B2_10 mmi2010010040.3d 27/1/010 12:43 Page 49

Conf. Architectural Support for Programming from the University of Illinois at Urbana- Languages and Operating Systems (ASPLOS Champaign. He is a member of the ACM 09), ACM Press, 2009, pp. 85-96. and the IEEE. 10. S. Gopal et al., ‘‘Speculative Ver- Mark Oskin is an associate professor in the sioning Cache,’’ Proc. 4th Int’l Symp. High- and Engineering Depart- Performance Computer Architecture (HPCA ment at the University of Washington. He is 98), IEEE CS Press, 1998, pp. 195-205. currently on leave at PetraVM, a start-up 11. C.K. Luk et al., ‘‘Pin: Building Customized commercializing many of the ideas behind Program Analysis Tools with Dynamic Instru- DMP. His research interests include parallel mentation,’’ Proc. 2005 ACM SIGPLAN computing, communication networks, and Conf. Programming Language Design silicon manufacturing. Oskin has a PhD in and Implementation, ACM Press, 2005, computer science from the University of pp. 190-200. California at Davis. 12. S. Woo et al., ‘‘The SPLASH-2 Programs: Characterization and Methodological Con- Direct questions and comments to siderations,’’ Proc. 22nd Int’l Symp. Com- Joseph Devietti, Computer Science & puter Architecture (ISCA 95), IEEE CS Engineering, Univ. of Washington, Box Press, 1995, pp. 24-36. 352350, Seattle, WA 98195-2350; devietti@ 13. C. Bienia et al., ‘‘The PARSEC Benchmark cs.washington.edu. Suite: Characterization and Architectural Implications,’’ Proc. 17th Int’l Conf. Parallel Architectures and Compilation Techniques, ACM Press, 2008, pp. 72-81.

Joseph Devietti is a PhD candidate in the Computer Science and Engineering De- partment at the University of Washington. His research interests include parallel- programming models and the intersection of architecture and programming languages. Devietti has an MS in computer science and engineering from the University of Washington. He is a member of the ACM.

Brandon Lucia is a PhD candidate in the Computer Science and Engineering Depart- ment at the University of Washington. His research interests include hardware support for dynamic analysis of concurrent programs, with a focus on debugging and increasing software reliability. Lucia has an MS in computer science and engineering from the University of Washington. He is a member of the ACM. Luis Ceze is an assistant professor in the Computer Science and Engineering Depart- ment at the University of Washington. His research interests include computer architec- ture, programming languages, and compilers and operating systems to improve program- mability and reliability of multiprocessors. He is a cofounder of PetraVM, a start-up company formed around the DMP technol- ogy. Ceze has a PhD in computer science ......

JANUARY/FEBRUARY 2010 49