A Generalized Reduction Construct for Deterministic Openmp

A Generalized Reduction Construct for Deterministic OpenMP Amittai Aviram and Bryan Ford (famittai.aviram,[email protected]) January 28, 2012 Abstract dle shared documents|providing an isolated working copy for each thread, merging these copies at syn- In parallel programming, a reduction is an opera- chronization points, detecting any conflicting writes tion that combines values across threads into a sin- at merge time, and treating them as errors. gle result, and can be designed and implemented The OpenMP standard supports a reduction fea- so as to enforce determinism, not only on the re- ture, which makes it possible to aggregate computed sult, but also on the intermediate values and eval- data across threads|for instance, to get the sum or uation sequence. These features make the reduction product of individual values computed severally by an attractive feature for a language-based approach concurrent threads. The reduction is an attractive to deterministic parallelism, especially one that, like feature for DOMP. Although it may look like a pur- OpenMP, provides annotations to parallelize legacy poseful race condition, because each thread seems to serial code. Hence reductions are a planned feature update the reduction variable at the same time, it of the Deterministic OpenMP (DOMP) project. To is in fact inherently deterministic: a sum or product enable DOMP to help programmers eliminate non- will always be the same on every run with the same deterministic code wherever possible, we propose a input, regardless of scheduling accidents, so long as generalized reduction that supports arbitrary data the individual summands or factors are the same. types and user-defined operations|a generalization Reductions fit in well with DOMP's goals. But, rich in challenges for language design and implemen- in order to be compatible with working-copies de- tation. terminism, DOMP's API must exclude certain other features of OpenMP because they are naturally nondeterministic|low-level race management con- 1 Introduction structs such as atomic, critical, and flush. Research suggests, however, that programmers often use these Deterministic OpenMP (DOMP) is a recently pro- features only in order to implement, by hand, higher- posed approach to the problem of achieving efficient, level idioms that are, themselves, inherently deter- flexible, user-friendly deterministic parallelism. It ministic, but for which OpenMP offers no ready-made brings together previous developments from the two constructs. One such case is a more general form areas of programming languages and systems in or- of reduction. OpenMP only supports reductions for der to offer a combination of convenience to the pro- scalar value types, such as int and double, and for a grammer and strict enforcement of race-freedom and restricted set of simple combining operations, such as determinism guarantees. addition (for sums) and multiplication (for products). On the language side, DOMP is based on the well- If DOMP offered a generalized reduction for arbitrary known OpenMP specification [10, 22], a set of anno- types, including those accessible only by indirection, tations and function calls that can be added to legacy and for user-defined binary operations, then, we be- serial code in languages such as C, C++, or For- lieve, programmers would be able to avoid many in- tran, enabling the programmer to parallelize it con- stances of unsafe, low-level synchronization abstrac- veniently and incrementally. OpenMP offers a range tions. of constructs sufficiently broad to provide reasonable A generalized, deterministic reduction construct expressiveness in parallelizing code. comes with its own new questions and challenges, On the systems side, DOMP draws from the ideas however, which the rest of this paper considers. Are of Workspace Consistency and working-copies de- there any constraints at all that DOMP must place terminism [4], a memory and programming model upon the user-defined operations that a reduction can that eliminates race conditions by handling shared support? What should the API look like and what data, roughly speaking, as versioning systems han- are possible tradeoffs in its semantics? What should 1 the fixed evaluation order be? How might it be im- locks and condition variables, relying instead on de- plemented efficiently so as to take advantage of par- terministic abstractions such as fork, join, and bar- allelism while safeguarding determinism? What sort rier. The execution environment then enforces race of threading model is best suited for this purpose? freedom by providing each concurrent thread with its What are the other major implementation challenges own private logical copy of shared state at the fork, and how might we address them? and merging the changes written by these threads Section 2 reviews the background to these ques- into the emerging parent thread at join or barrier, tions, and related work, in greater detail; Section 3 checking for conflicting writes in the process and considers design-related challenges, such as the def- treating any such conflicts as errors. While the Deter- inition of the API; Section 4 does the same with minator project demonstrated the efficient implemen- implementation challenges; Section 5 reports on the tation of working-copies determinism, the program- project's current state; and Section 6 concludes. ming model, constrained to this small set of synchronization abstractions, suffered from a lack of expressiveness and flexibility. 2 Background A deterministic version of OpenMP may remedy this shortcoming. The OpenMP standard defines a Research has established the desirability of achieving set of constructs, optional clauses, and routines that deterministic parallelism at a reasonable cost to per- a programmer can add to a program written in a formance [5, 11, 21]. Various teams have proposed legacy language such as C, C++, or Fortran, in or- distinct solution approaches, of which Deterministic der to parallelize it easily and incrementally. The OpenMP [3] is a recent one whose motivation we re- constructs annotate structured blocks. The fork-join view here. We then take up the importance of reduc- synchronization pattern and the orientation toward tions in general and within DOMP in particular. structured blocks make for a convenient fit between working-copies determinism and OpenMP. For a deterministic version truly compatible with working- 2.1 Deterministic OpenMP copies determinism, we must exclude the relatively Among the earliest efforts to provide deterministic few naturally nondeterministic constructs, such as parallelism, language-based approaches are promis- atomic, critical, and flush, and provide an imple- ing, but may apply only to special purposes, or may mentation that applies working-copies semantics to entail new and unfamiliar concepts, requiring exten- the remaining features. This is the Deterministic sive rewriting of legacy code [1, 2, 6, 12, 17, 19, 23, 24]. OpenMP (DOMP) project [3]. In order to support legacy code, other teams have de- While DOMP promises greater expressiveness and veloped deterministic replay systems, but these are ease of adaptation of legacy code for working- generally more practical for development and debug- copies determinism, its exclusion of nondeterminis- ging than for production systems [14, 16, 20, 25]. tic OpenMP features might seem to limit its use- Deterministic scheduling systems also support legacy fulness in adapting real-world code. The research code [7, 8, 9, 15], but only allow race conditions to be supporting DOMP, however, suggests that many in- reproduced (and perhaps silently executed) without stances of naturally nondeterministic synchronization eliminating them [3, 4, 5]. abstractions actually arise from the need to imple- Working-copies determinism also supports legacy ment higher-level idioms that are, themselves, nat- code in languages such as C, but it eliminates race urally deterministic in principle, but for which the conditions by catching them and treating them as available language constructs do not offer direct sup- runtime errors. Working-copies determinism is based port. With two such features, generalized reductions on the Workspace Consistency memory model [4], and pipelines, most uses of nondeterministic OpenMP and underlies the design of the Determinator op- constructs can be eliminated. The first of these is the erating system [5]. Workspace consistency requires concern of this paper. that the programming language, implementation, and runtime system all remain naturally determin- 2.2 Generalized Reductions istic, meaning that interprocess or inter-thread synchronization events must occur in a manner that the A reduction, also called a fold in functional program- program alone specifies, wholly immune to the sched- ming, is a kind of higher-order operation whose spe- uler or any hardware timing effects. Accordingly, cial features make it a potentially useful language a compatible program eschews such naturally non- construct for deterministic parallelism. Given a data deterministic synchronization abstractions as mutex structure, such as a list z whose elements have type T 2 and a binary \combining" function f :(T × T ) ! T , &, |, &&, ||, ^. (The operator - counts as com- a reduction r :(f × [T ]) ! T applies f recursively to mutative because the initial value for the reduction the elements of z, first to the first pair, and then to is 0; the initial value given in the program is added the result and the next element, etc., to arrive at the at the end.) These

Load more