A Generalized Streaming Model for Concurrent Computing
Total Page:16
File Type:pdf, Size:1020Kb
A Generalized Streaming Model For Concurrent Computing∗ Yibing Wang† Abstract- Multicore parallel programming has some Information processing on modern computers has very difficult problems such as deadlocks during synchro- never followed a monotonous path, but the main- nizations and race conditions brought by concurrency. stream has come along a relatively unified way due Added to the difficulty is the lack of a simple, well- to the Turing machine abstraction in theory [26] and accepted computing model for multicore architectures— von Neumann model [3] in design. The adoption because of that it is hard to develop powerful program- of multicore architectures, however, brings the in- ming environments and debugging tools. To tackle the dustry into a new chaos at various computing levels challenges, we promote a generalized stream comput- ranging from programming language designs to op- ing model, inspired by previous researches on stream erating system designs. Not only are programmers computing, that unifies parallelization strategies for pro- often frustrated with primitive solutions or the lack gramming language design, compiler design and operat- of ideal software tools (e.g. programming languages ing system design. Our model provides a high-level ab- and debuggers), but also are chip makers facing chal- straction in designing language constructs to convey con- lenges in delivering new designs [34, 48]; whereas cepts of concurrent operations, in organizing a program’s tools’ builders are constrained by the multithread- runtime layout for parallel execution, and in scheduling ing model, such as in the cases of OpenMP [52] and concurrent instruction blocks through runtime and/or C++-0x thread library extension [68], or by impera- operating systems. In this paper, we give a high-level tive languages, such as in the cases of CUDA [49, 51] description of the proposed model: we define the founda- and OpenCL [31]. To change the impasse situa- tion of the model, show its simplicity through algebraic/- tion where researchers and developers at different computational operation analysis, illustrate a program- computing levels have uncertain expectations from ming framework enabled by the model, and demonstrate each other, first a well-accepted computing model is its potential through powerful design options for pro- needed, i.e. we need a model that can guide designs gramming languages, compilers and operating systems. at different computing levels. PRAM [19] has often been used as a conceptual parallel computing model in demonstrating synchro- 1 Introduction nized executions such as concurrent read concurrent write and concurrent read exclusive write. We ar- 1.1 Motivation gue that PRAM captures certain essences such as parallelism in concurrent computing1 but not other When searching for a general approach to the chal- arXiv:1012.1641v1 [cs.PL] 7 Dec 2010 substances such as precedence constraints in opera- lenges [2] faced by multicore programming, partic- tion; and the very nature of its dependence on shared ularly deadlocks during synchronizations and race memory limits PRAM’s power in design. Therefore, conditions [47] introduced by concurrency, which if PRAM is only a generalization (or parallelization) render solutions built on conventional multithread- of the RAM [26] model (Turing machine equivalent) ing models [13, 52, 68] unappealing to programmers, for abstract analysis, what is the parallel-formed gen- we found that there was a need for a unified com- eralization of the von Neumann model and its al- puting model on which computer professionals at ternatives for design? Conventional threads and programming language design level, compiler design transactional memory [11, 24, 35] are not the an- level, and operating system design level could bene- swer; they are implementation details with little ex- fit from each others’ works; and a generalization of tension to the von Neumann model. Many people the stream computing model [12] seemed to have the potential to serve as such a design model for future 1In this paper, concurrent and parallel are interchange- generation computing. Some reasons are explained able in most cases, though we tend to regard concurrent as follows. as having broader meaning in computing than parallel does. The authors of [16] think parallel systems are mostly syn- ∗First draft: Jan. 31, 2010; this draft: Dec. 07, 2010. chronous, whereas concurrent systems also include distributed †Author contact: [email protected]. cases that are mostly asynchronous. 1 agree that threads are low-level programming util- or functional; and the re-use of existing (modi- ities and should remain so; transactional memory fied when necessary) library routines. may not be computationally productive nor energy- • efficient when transactions are large in size or long It can be reduced to sequential case for debug- in time [24]. ging purpose or when used for single core archi- Stream computing [12, 17, 33, 44] has been around tectures. for more than a decade. Perhaps due to their hum- Furthermore, we regard algorithm design as an in- ble origin (i.e. graphics processors; not their noble separable part of achieving high-performance in mul- originators) in the computer industry, technologies ticore parallel computing. We believe that, due to based on stream computing were often used only as different algorithmic thinking, solutions to an ap- performance accelerators. While surveying the land- plication problem may show different concurrency scape of multicore parallel computing, we realized structures, which have different levels of difficulty that, maybe, the application of stream computing when mapped to the underlying runtime environ- model should go beyond even GPGPU computing, ment and hardware. By supplying an expressive and a generalization of the model could be applied to computing model, we give programmers the needed general purpose software designs and programming power in defining the lest convoluted concurrency as well. Note our pursuit of a general design model structures in their solutions. We use the following was not to find a killer solution (i.e. the implementa- example to further justify our motivation. tion of a certain method) for all applications, but to find a design principle in handling some of the most 1.2 Example difficult yet fundamental problems, such as paral- lelization and nondeterminism. A naive solution to the N-body problem [5, 66] has To serve our purpose, we recommends the separa- the computation complexity of O(N 2). Figure 1 tion of computing models for abstract analysis from shows such an algorithm, where gforce() is a function computing models for generalized design. In [57], for calculating gravitational forces from the other Snyder expressed similar idea but from a different N − 1 objects to the jth object, following the New- angle; he emphasized that a useful parallel machine tonian laws of physics. To parallelize this algorithm, model should be able to capture key features that we often rely on loop nest parallelization by hand or have impact on performance, which to us is a de- through compilers [1, 18]. sign (model) issue. Based on such understanding, plus other people’s publications [10, 11, 20, 25, 35, 2 1 /∗ calculate gravitational forces in time 36, 37, 40] and our own experiences , we propose 2 sequence from zero to tmax ∗/ a generalized stream computing (hereinafter known 3 for (ti = 0; ti < tmax; ti++) as GeneSC, pronounced “g-e-n-e-s-i-s”) model as a 4 { 5 for (j=0; j<N; j++) guideline for multicore computing. Our expectations 6 { of the model can be summarized as follows. 7 f=gforce(j); 8 vnew[j]=v[j]+f ∗ dt; • It will help with defining language constructs for 9 xnew[j] =x[j] +vnew[j] ∗ dt; expressing concurrency structures in programs 10 } and algorithms, but leaving parallelization and 11 for (j=0; j<N; j++) 12 { synchronization controls to compilers, runtimes 13 v[j]=vnew[j]; and/or operating systems. 14 x[j]=xnew[j]; 15 } • It will bring parallelizing compilers out of the 16 } confinement of sequential execution constructs such as loop nests. Figure 1: The N-body algorithm of complexity O(N 2). • It will bring in operating system designs for O N 2 tractable, reliable and concurrent instruction Instead of doing ( ) direct force integrations, sets executions. an improved design [5] recursively divides the space into smaller cubic cells until each cubic cell contains • It will support backward compatible program- no or only one of the N objects, then builds a tree ming schemes, be it structured, object-oriented, with cubic cells that have more than one objects as 2 the intermediate nodes and cells with only one object For example, the author wrote his first multithreaded server in 1997. Part of that early experience was published as the leaves. Gravitational forces among the N ob- in [64]. jects are calculated in a way where far away objects 2 are approximated by their conglomeration masses at bounded program behaviors, with non-determinism their geometry centers. Figure 2 shows the stream- refers to program behaviors that are different (but like skeleton of such an algorithm with complexity well-defined) from run to run even for the same O(NlogN ), where each inner-loop function consists given inputs; and indeterminacy for unbounded, data parallelism. i.e. unpredictable program behaviors. Synchronous operations without randomized functions normally 1 for (ti = 0; ti < tmax; ti++) yield deterministic results; asynchronous operations 2 { may yield either deterministic (when no ordering 3 space_subdivision(); is needed) or non-deterministic (when ordering is 4 tree_construction(); required but, by mistake, not enforced) results. 5 mass_center_calc(); 6 approximate_force(); Shared memory could add more complexities, e.g. 7 position_update(); when supposedly deterministic operations show non- 8 } deterministic or even indeterminacy behaviors, or when non-deterministic operations show indetermi- Figure 2: An improved N-body algorithm of complexity nacy behaviors, due to race conditions. N O(Nlog ). Deadlocks and race conditions are notorious be- cause they bring indeterminacy.