<<

arXiv:1012.1641v1 [cs.PL] 7 Dec 2010 eeaincmuig oeraosaeexplained future are for follows. reasons as model Some design a . such generation as the serve have of to to generalization seemed potential [12] a model and computing stream works; the others’ bene- each could level from design fit system at operating com- and design professionals unified level, level, a design for which language programming need on a model was puting there that programmers, to found unappealing we 68] 52, multithread- [13, which conventional models ing on concurrency, built by solutions render race introduced and [47] synchronizations partic- conditions during programming, multicore chal- by the ularly to faced approach [2] general lenges a for searching When Motivation 1.1 Introduction 1 pro- for systems. operating options and design languages, powerful gramming through demonstrate potential and model, its the by program- enabled framework a ming illustrate analysis, algebraic/- through operation simplicity computational its high-level show founda- a model, the the define give of we tion we model: proposed paper, the this of description In and/or runtime systems. through operating blocks in instruction and execution, concurrent parallel for program’s layout a runtime organizing in operations, concurrent ab- con- of convey high-level cepts to a constructs language provides designing model in straction Our design. operat- system and design ing stream compiler design, on pro language for researches gramming strategies parallelization previous unifies comput- by that computing, stream inspired generalized model, a the ing tackle promote To program- we powerful tools. well-challenges, develop debugging to simple, and hard environments a is ming it of that lack of architectures— because the multicore for is model computing difficulty concurrency. accepted the by brought to conditions Added race and synchro- during nizations deadlocks as such problems difficult very Abstract- ∗ † is rf:Jn 1 00 hsdat .0,2010. 07, Dec. draft: [email protected]. this 2010; contact: 31, Author Jan. draft: First eeaie temn oe o ocretComputing Concurrent For Model Streaming Generalized A utcr aallpormighssome has programming parallel Multicore iigWang Yibing - 1 u htPA atrscranesne uhas such essences computing concurrent ar- certain We in captures parallelism PRAM write. exclusive that read gue concurrent concurrent and read write concurrent as such executions synchro- nized demonstrating in model computing parallel designs guide can levels. that computing model different a at need is we model i.e. computing well-accepted needed, a from first different situa- expectations other, each at uncertain impasse have developers the levels and change computing researchers To where 51] tion [49, [31]. CUDA of OpenCL cases the and in and as impera- such [52] by languages, or OpenMP tive [68], of extension cases library the multithread- in C++-0x whereas as the such 48]; by model, constrained [34, ing are designs builders new tools’ delivering chal- in facing makers chip lenges are languages also programming but debuggers), (e.g. and lack tools the or ideal programmers solutions of primitive are with only frustrated Not often op- designs. to system designs levels erating language computing programming various from at in- adoption chaos ranging the new The a brings into however, dustry design. architectures, in multicore [3] of model and [26] Neumann due theory way in von unified abstraction main- machine relatively Turing the the a to along but come path, has monotonous stream a followed never h uhr f[6 hn aallssesaemsl syn- mostly asynchronous. are mostly are systems concurrent that distrib parallel cases include regard also think does. systems to concurrent parallel [16] whereas chronous, tend than of computing we authors in though The meaning broader cases, having as most in able eso otevnNuanmdl aypeople Many model. Neumann an- von ex- the the little to not with details tension are implementation 35] are they 24, al- swer; [11, its memory and transactional model Neumann for von ternatives the of eralization equivalent) machine (Turing for model [26] parallelization) RAM (or the generalization of a only Therefore, is design. PRAM if in power shared on PRAM’s dependence limits its memory of nature opera- very the in and constraints tion; precedence as such substances RM[9 a fe enue saconceptual a as used been often has [19] PRAM has modern on processing Information 1 btatanalysis abstract nti paper, this In † design concurrent hti h aallfre gen- parallel-formed the is what , ovninltrasand threads Conventional ? and parallel 1 r interchange- are u o other not but ∗ uted agree that threads are low-level programming util- or functional; and the re-use of existing (modi- ities and should remain so; fied when necessary) library routines. may not be computationally productive nor energy- • efficient when transactions are large in size or long It can be reduced to sequential case for debug- in time [24]. ging purpose or when used for single core archi- Stream computing [12, 17, 33, 44] has been around tectures. for more than a decade. Perhaps due to their hum- Furthermore, we regard design as an in- ble origin (i.e. graphics processors; not their noble separable part of achieving high-performance in mul- originators) in the computer industry, technologies ticore . We believe that, due to based on stream computing were often used only as different algorithmic thinking, solutions to an ap- performance accelerators. While surveying the land- plication problem may show different concurrency scape of multicore parallel computing, we realized structures, which have different levels of difficulty that, maybe, the application of stream computing when mapped to the underlying runtime environ- model should go beyond even GPGPU computing, ment and hardware. By supplying an expressive and a generalization of the model could be applied to computing model, we give programmers the needed general purpose software designs and programming power in defining the lest convoluted concurrency as well. Note our pursuit of a general design model structures in their solutions. We use the following was not to find a killer solution (i.e. the implementa- example to further justify our motivation. tion of a certain method) for all applications, but to find a design principle in handling some of the most 1.2 Example difficult yet fundamental problems, such as paral- lelization and nondeterminism. A naive solution to the N-body problem [5, 66] has To serve our purpose, we recommends the separa- the complexity of O(N 2). Figure 1 tion of computing models for abstract analysis from shows such an algorithm, where gforce() is a function computing models for generalized design. In [57], for calculating gravitational forces from the other Snyder expressed similar idea but from a different N − 1 objects to the jth object, following the New- angle; he emphasized that a useful parallel machine tonian laws of physics. To parallelize this algorithm, model should be able to capture key features that we often rely on loop nest parallelization by hand or have impact on performance, which to us is a de- through compilers [1, 18]. sign (model) issue. Based on such understanding, plus other people’s publications [10, 11, 20, 25, 35, 2 1 /∗ calculate gravitational forces in time 36, 37, 40] and our own experiences , we propose 2 sequence from zero to tmax ∗/ a generalized stream computing (hereinafter known 3 for (ti = 0; ti < tmax; ti++) as GeneSC, pronounced “g-e-n-e-s-i-s”) model as a 4 { 5 for (j=0; j

2 are approximated by their conglomeration masses at bounded program behaviors, with non-determinism their geometry centers. Figure 2 shows the stream- refers to program behaviors that are different (but like skeleton of such an algorithm with complexity well-defined) from run to run even for the same O(NlogN ), where each inner-loop function consists given inputs; and indeterminacy for unbounded, data parallelism. i.e. unpredictable program behaviors. Synchronous operations without randomized functions normally 1 for (ti = 0; ti < tmax; ti++) yield deterministic results; asynchronous operations 2 { may yield either deterministic (when no ordering 3 space_subdivision(); is needed) or non-deterministic (when ordering is 4 tree_construction(); required but, by mistake, not enforced) results. 5 mass_center_calc(); 6 approximate_force(); could add more complexities, e.g. 7 position_update(); when supposedly deterministic operations show non- 8 } deterministic or even indeterminacy behaviors, or when non-deterministic operations show indetermi- Figure 2: An improved N-body algorithm of complexity nacy behaviors, due to race conditions. N O(Nlog ). Deadlocks and race conditions are notorious be- cause they bring indeterminacy. General races [47] The N-body problem is interesting in three ways. can be defined as concurrent accesses to shared data First, it is a large computation problem that must with at least one access is a write operation. Gen- be solved on parallel computers. Second, inter-node eral races may be acceptable if they just cause non- communication (i.e. synchronization) overhead in determinism, but are unacceptable if they cause in- executions of each parallelized inner functions is not determinacy. Data races are special cases of general negligible. Third, it has the potential to show how races; data races are program bugs, e.g. missing languages (i.e. the expression of reasoning) may af- shared locks in concurrent accesses to shared data. fect algorithm designs. Applications show parallelism in two ways: Further improved algorithms such as [4, 65] aim data parallelism and task parallelism (we consider to reduce communication overhead. Multicore pro- pipelined concurrency as a subtype of task paral- cessors may reduce the overhead even more through lelism). Performance in data parallelism cases relies shared memory. Besides hardware support, a suit- on data processing speed, whereas performance in able computing model is needed. The GeneSC model task parallelism cases relies on overall throughput. that we propose has considered such applications as To make a parallel program work correctly, we well as many others. prevent indeterminacy; to make the behavior of a parallel program tractable, we deal with non- 1.3 Organization determinism. In our GeneSC model, programming, compilation The rest of the paper explains what is this gener- and runtime designs adopt a concept that views com- alized stream computing and what are the enabling puting as applying well-defined functions to flows of technologies. Section 2 first highlights a few impor- data sets, which we call stream data. At the core of tant concepts in the problem domain of interests and the model is an entity, E = (k, , r), that encapsu- then gives a definition of the GeneSC model, fol- lates an execution unit consisting of three elements: lowed by Section 3 on algebraic/computational op- eration analyses, Section 4 on a programming frame- • k: a close-form function work enabled by our model, and Section 5 on design considerations including three crucial additions that • d: well-defined input/output data depart from traditional practices. Section 6 intro- • duces one potential application. Section 7 discusses r: relations with other entities related works. Section 8 gives the concluding re- The close-form function is in fact a non-trivial pri- marks. vate algorithm and is often called a (mathematical) kernel. The input/output data defines a function’s 2 Definition interactions with the outside, without exposing in- termediate computation results. Inter-stream-entity We use synchronism and asynchronism for discussing relations offer timing/ordering dependence. operations’ timing or ordering characteristics. We This model is different from conventional multi- use determinism and non-determinism for discussing threading models in at least two ways: First, stream

3 entities are functional units, whereas threads are ex- examples, addition operations in the form of paral- ecution units. Second, explicit concurrency struc- lel executions of stream entities are supported; con- tures defined by stream entity’s syntax and seman- catenation operations are allowed as long as inter- tics allow optimization in compilation and schedul- stream-entity relations, if any, are preserved; also ing at runtime, whereas threads prescribe execution allowed are composition operations. With these op- structures without rich timing/ordering knowledge. erations, building large-scale software systems and Some people might suggest that a stream entity reusing library components are safe and tractable. resembles a neuron in artificial neuron networks, ex- The composition capability is even more crucial for cept that the former normally has a linear function hierarchical and scalable computing, which will be whereas the latter a non-linear function. We will not discussed in Sections 5.2 and 5.3. diverge too far along that direction. The proposed streaming model preserves asyn- chronism inherited from the problem domain, but reduces non-determinism to the minimum through 3 Operations explicit concurrency structures in a program4. By encapsulating internal computation states, isolating According to its definition, a stream entity contains external states, and separating computation from a miniature Turing machine whose input symbols are communication, stream entities help reducing data from a subset or a segment of all input symbols to races. Meanwhile, inter-stream-entity relations help a larger Turing machine, and whose finite control is reducing general races. defined by the stream entity’s kernel function. But the overall program defined by stream entities oper- ates as a finite automata (FA), where each state in 4 Programming the FA has a miniature Turing machine inside. Inter- stream-entity relations define precedence constraints Figure 4 shows a programming framework based on among the FA states but do not define connection the GeneSC model. At the top level, programmers (i.e. transition) policies, thus the program operates focus on coding concurrency structures defined by as a FA with ǫ-moves, shown in Figure 3, where the stream entities, which relate to each other accord- ǫ-moves simply mean undefined transition functions, ing to precedence constraints, if any, that exist in 3 not important to stream entities . high-level control flows and data flows. Each stream entity has a kernel function, which may consist of k none, one or more than one lower-level stream enti- ties. The lowest level kernel functions will call rou- e (d,r) (d,r) e tines in conventional (e.g. non-threaded but thread- Start E safe) programming libraries.

Figure 3: An abstract operation model of a program Stream Entity 4 Stream Entity 4 Stream Entity 4 represented by a finite automata with ǫ-moves in the proposed streaming model nomenclature. Stream Entity 2 Stream Entity 1 Stream Entity 2’ Stream Entity 3 The GeneSC model has all the advantages such as Stream Entity 2’’ concise foundations and simple semantics of simple operational models without being weak in program Kernel 1Kernel 4 Kernel 2 Kernel 3 clarity, which is guaranteed by detailed designs and implementations of individual stream entities. For

3While reading Snyder’s paper [57] after the conception of the GeneSC model, the author could instantly recognize the similarity between his ideal of an operational stream comput- libs libs ing model and Snyder’s considerations for the candidate type architecture (CTA) abstraction for MIMD parallel machines. As another note, from our point of view, one major dif- Figure 4: A programming framework based on the ference between multicore and manycore resembles the differ- GeneSC model. ence between shared-memory systems and distributed mem- ory (i.e. networking) systems, where inter- commu- 4This model provides the benefit of functional program- nication mechanisms set them apart. Our GeneSC model ming in compositional operations but is also intended to yield supports both systems. better performance.

4 In the above figure, hierarchical structures among the former makes composition operations easier and stream entities embody compositional operations. safe, whereas the latter harder and unsafe. One ma- Dependence constraints among stream entities de- jor difference between a stream entity and a class cide their runtime execution orders; if there is no object is that the former performs an overall well- dependence constraint between two stream enti- defined, close-form function independent of context, ties, there is no mandatory execution order between whereas the latter does not perform such a function them. For example, because stream entity No.4 is in- without context, except in the case of a function dependent of others, it may be executed in parallel object or functor [56]. Different from a functor, a with stream entity No.1, No.2 or No.3. Program- stream entity’s internal states are not important. mers see no thread or any low-level parallel execu- Stream entity does not imply how its kernel func- tion controls, which are left to compilers, runtime tion is implemented, thus does not enforce an im- and operating systems to handle. Programmers do perative language implementation targeting CPU or have the responsibility to write kernel functions and GPU, or a functional language implementation, or conventional library routines, if there are no existing something else. Building stream entities on exist- ones for reuse. ing, low-level library routines is possible provided that the library routines do not create POSIX-like threads. In the proposed streaming model, user pro- 5 Designs grams are not encouraged to create threads; library routines should not create threads at all. The GeneSC model provides a highly abstracted, The introduction of the stream-entity construct mostly tractable view of concurrent operations. in a is the first crucial addi- Here are some examples as design considerations tion required by our steaming model. In this model, that support the programming framework discussed kernel functions define while inter- in Section 4. stream-entity relations define precedence constraints in concurrent executions. Inter-stream-entity com- 5.1 Language munications are purposely left out because they Stream entity is the foundation of conveying con- are considered as implementation details; both dis- currency structures in a program. Though similar tributed memory methods (e.g. ) concept may already exist in some programming lan- and shared memory methods are allowed. guages, we decide to make it explicit. A stream en- At program control level, two more constructs tity’s API may have the following contents in a new may be desirable: one for data parallelism and one programming-language construct: for task parallelism. In the data parallelism case, on the one hand, concurrent executions have a flat { relations : structure in term of precedence constraint; some { stream entities may have multiple instances at run- before:{(function_k ,[hard/soft]), ...} time, when each instance takes a portion of a large after: {(function_i ,[hard/soft]), ...} stream data set. On the other hand, data partition } input: x algorithms are often non-trivial. Therefore, a lan- output: y guage construct, such as map in pMatlab [60], will kernel: function_j be useful for conveying the data partition informa- } tion. Loops had often been used to do the job, but Inter-stream-entity relations are defined by two expressing parallel concept in sequential executions sets. Those sets can be empty. Elements in the rela- reduces program clarity. tion sets are kernel function and relation constraint In the task parallelism case, sometimes concurrent pairs. A “hard” relation constraint means “this” en- tasks are divided into groups for completely differ- tity must finish before related entities begin, or must ent functionalities, and have limited synchronization begin after related entities finish; a “soft” constraint points; sometimes increased security requirements, means “this” entity should finish before related enti- such as those of a modern web browser [6], demand ties begin, or “this” entity should begin after related greater isolation among the tasks. Thus, a task par- entities finish. Compositional relations are gener- allelism construct may be needed to define such par- ated automatically by compilers and are kept in an allel structures to allow special runtime scheduling internal data structure for stream entities. and layout treatments. One major difference between a stream entity and The research community has been trying to in- a conventional task in the form of a thread is that vent new language constructs for the needs of

5 multithreaded parallel computing, such as in the compilers (and later runtime schedulers) are al- cases of /Cilk++ [9, 15] and Galois [32]. The lowed to optimize computations through restructur- former emphasized the idea of separating designs ing (e.g. flattening) the initial hierarchical hyper- for computation from concerns for runtime load- graphs derived from user programs. balancing and scheduling; the latter emphasized Architectural designs [63, 67] have been proposed auto-parallelization of sequential programs with new to provide much homogeneous programming en- language abstractions. Our language designs align vironment on asymmetric or heterogeneous multi- well with those interests, but are not handicapped core. Such works make mapping stream entities to by the reliance on POSIX-like multithreading model. asymmetric multicore through compilers practically achievable. Depending on what the underlying pro- 5.2 Compiler cessor architecture is, just-in-time (JIT) compilation may be employed to further optimize a program at The introduction of new language constructs such runtime. as that for stream entity has two major impacts on compiler designs. 5.3 Operating System First, increased power in program analysis and parallelization. Stream entity defines a scope that Program speedup technologies that take advantage can be bigger than loop nests, and the scope is fi- of parallelism, such as instruction-level pipelines, nite so is superior than intractable number of pro- data-level SIMD, and task-level hyperthreading [22, cedures that hinder previous practices in interpro- 28], all have hardware assistants. However, in the cedural analyses [1, 21]. In the GeneSC model, a multicore case, the up-scaled, parallel processing re- control construct for task parallelism as mentioned source is exposed directly to programmers. Specifi- in Section 5.1 can guide code generation; a control cally, parallel task creation, mapping and scheduling construct for data parallelism can specify the num- to multiple cores become programmers’ responsibil- ber of concurrent streams, if programmers provide ity, with or without operating system involvement. that information. Such arguable architectural defect or design trade-off Second, added to compiler generated code thus may be regarded as the chief culprit for current diffi- a program’s virtual address space layout, see Fig- culties in parallel programming on multicore proces- ure 5, is a new segment for storing hypergraphs of sors. Architectural support for multi-thread (-, inter-stream-entity relations, where the hypergraph -fiber, -shred, -strand, etc.) programming, like those vertices are stream entities, and the edges are sets of in [23, 63], would be helpful. Unfortunately, to our data-flow- or control-flow-connected stream entities; best knowledge, no such CPU production exists. To the size of an edge can be one. The hypergraph in- solve that problem, we propose software emulated, formation is the second crucial addition enabled by e.g. operating-system initiated, task pipelines where the proposed stream computing model. a task is a stream entity plus stream data. Such superscalar pipelines are the third crucial addition

Hypergraph E_i: v_i v_i v_i enabled by the proposed streaming model. To illus- kernel: > instance: 3 trate the idea, we use the following shared memory Text include: {} v_j before: {} v_k model for more detailed discussions. Data after: {} E_j: First, change how kernel creates and destroys BSS kernel: > instance: 1 a program’s runtime image. For example, when include: {} the exec system calls are invoked, a kernel thread, before: {} after: {} named micro scheduler to be distinguished from E_k: kernel: > the process level scheduler, parses a program’s hy- instance: 1 include: {E_x} pergraphs information, and decides the number of Stack before: {} worker threads that form the superscalar pipelines. after: {} (a) (b) (c) Work-load information and power management in- structions may be available to the micro scheduler. Figure 5: Program virtual address space layout (a), with Ideally, it will take those information into consider- hypothetical internal structures (b), and drawing of a ation when creating worker threads. To reduce con- 5 hypergraph (c). text switching overhead introduced by process level

5Context switching is a multiprogramming strategy on sin- A hypergraph vertex, i.e. a stream entity, is de- gle core. In multicore systems, the strategy may need en- signed as a non-distributive execution unit, thus hancement or extension.

6 scheduling, the micro scheduler dynamically changes tion, which was defined as a set of sequential instruc- the number of worker threads. tions executed as one operation unit with processor Second, schedule a process’ concurrent instruc- hardware assistance. Although it is still early for us tion blocks through work-stealing [9]. Although the to assess the hardware design itself, the chunk exe- micro scheduler speculates inter-stream-entity rela- cution idea reaffirmed us the value of our task-level tions and dynamically control the number of super- pipelines for multicore. scalar pipelines (i.e. worker threads), stream entities are scheduled through work-stealing. Parsed and maybe further optimized hypergraphs information 6 Application is saved as process context for reuse. Since different operating systems have very different thread mod- The modern web browser as a els [42, 45], we do not enforce how worker threads and a familiar but non-trivial application is used to are implemented. Figure 6 shows a hypothetical show the advantages of stream computing. case where the application process has three worker Besides plain text HTML contents, current and threads, assisted at runtime by a micro scheduler. future web browsers are expected to present to end users dynamic and rich media such as video, 3D im- fork ages, virtual worlds, computer games, semantic web information, etc. Those complex contents require exec spawn s0 part, if not all, of the computations be done on user Process Scheduler dispatch/sync computers or smart hand held devices. For each dif- s1 vmem hypergraph ferent content format, the browser may run a virtual Micro Scheduler machine, e.g. a language , to process the s2 A content. There is also the need to isolate content B C handling tasks inside the browser for privacy and s3 exit cleanup security reasons [6, 54]. s4 Figure 7 shows a simplified of the render- ing engine in a browser. Each of the three functional units, namely parsing, synthesis, and rendering has Figure 6: A hypothetical case of operating system kernel multiple sub-units of different capabilities, and the initiated threads for stream computing. sub-units may be composed of different algorithms in implementation. Third, monitor virtual memory accesses and man- age worker threads synchronizations based on an ap- plication’s hypergraphs. Virtual memory access con- Parse Synthesize Render trol is enforced through shadowing, e.g. locking a memory address so that only one worker thread can write to the address at a specific time; or coloring, R.E. XML DOM VMs HTML Image HTML Binding Layout Plug−in

i.e. making certain addresses accessible by only one Add−ons JavaScript Interpreter Decoding of the worker threads at any time. When race condi- Embedded tions happen, block the contender thread(s) or issue Figure 7: Components in a simplified pipeline of the a signal. One immediate question is: how can such rendering engine in a modern web browser. fine-grained memory coloring be possible, if mod- ern virtual memory systems are page based? We In this case, parallel programming using POSIX are considering a dynamic version of memory over- threads or OpenMP APIs would be hard. We see lays [39] on top of virtual memory. Memory overlays at least three problems with the user or compiler- provide task-level program isolation. initiated multithreading model: (1) Maintaining Finally, let operating system kernel cleanup the scalable number of threads while optimizing paral- worker threads at process exit, and output tractable lelization is difficult. (2) Thread creation and de- runtime states in case of errors. For example, when stroy overheads may be significant, where threads system signals make a process abort, hypergraphs are temporary objects. (3) Thread isolation is not information is dumped to a core file. A snapshot guaranteed. of current running stream entities is useful but may By using the proposed streaming model, compu- not be possible in practice. tations are defined by stream entities. Each com- We’ve noticed a recent publication [59] on archi- ponent in Figure 7 can be built with stream enti- tecture design built on a concept called chunk execu- ties through addition, concatenation, and composi-

7 tion operations. At runtime, the micro scheduler In his thesis on StreamIt [58], Thies summarized defined in Section 5.3 keeps a dynamically scalable languages from Kahn’s Process Networks to syn- number of worker threads in a browser, and sched- chronous dataflow [38] and CSP (Communicating ules stream entities as execution units. Isolation is Sequential Processes) [27]. Such languages have a enforced through memory coloring in the browser’s common feature that is the pipe-lining of concurrent virtual address space. task-executions. StreamIt also experimented on the idea of stream graph that allows compiler optimiza- tions. Our main reservation with such languages is 7 Related Works their weak power in modeling data parallelism. Jade [55] represents an important experiment on All directly related works have been mentioned in parallel programming language design. Jade pre- the previous sections. What we discuss here are serves the serial semantics of a program, implicitly publications that may provide alternative solutions exploits task parallelism, and moves data closer to to various aspects of the problems we try to solve, processors. However, if it was designed for a smooth or are noteworthy references to earlier works. We transition from sequential programming to parallel skip publications on virtualization, threads schedul- programming, Jade’s dependence on its type system ing and thread models, because they are highly spe- and explicit object-access control still exposes low- cialized topics concerning implementations. level synchronizations to programmers. While ar- A school of analytical models, represented by the guably such low-level controls provide programming most recent Multi-BSP [62], provides theoretical flexibility, not all the details are essential to algo- abstractions that bridge designs rithm design. On the other hand, we might consider with multicore architectures. One common trait of that Jade lacks the power for defining concurrent such models is their emphasize on portable perfor- execution structures in a larger program-scope. mance. However, such motivation may not ensure Another parallel programming language that has engineering success, as there are more imminent con- limited adoption in the academia is UPC [61]. UPC cerns, such as indeterminacy, synchronization over- is another multithreading language that extends C. head, and scalability, that highlight the weaknesses The two most noteworthy features of UPC include of current parallel programming tools. In the Multi- the partitioned global address space model and the BSP case, the model should be useful in analyzing memory consistence model, where the latter is im- a given parallel computing system in some multi- plemented using explicit access controls to shared core architectures, but the model does not provide objects. Again, the weakness of UPC might be its enough engineering guidance in designing such a sys- explicit synchronizations exposed to programmers. tem. It is not surprising that the Multi-BSP model’s Chapel [14] is a complex parallel programming hierarchically nested component structures resemble language that supports global level data and control the stream entity hypergraphs in our model, but our abstractions, architecture-aware locality mapping, stream entities have richer engineering implications. object-oriented programming and generic program- Someone has suggested that the GeneSC model ming. What impressed us most are the language’s that we promote was a rediscovery of Kahn’s Process global views on data and control, and its ability to Networks [29, 30]. But that’s a misunderstanding, if map the global views onto data/control affinity ab- one can ignore the superficial resemblance in our use stractions called locales. Though we do not have ex- of certain terminologies. The simple language Kahn perience with Chapel programming, by studying its defined and discussed is both synchronous and deter- ancestor ZPL [57], we see one motivation that em- ministic, due to the language’s primitives phasizes both performance and portability on var- such as “wait” and “send”, which build the language’s ious parallel architectures. For that purpose, ZPL FIFO queuing model for communication. The Kahn and Chapel may strike for a balance between ab- model suits concurrent tasks, but has little emphasis straction and expressiveness. The emphasis of our on high performance computation. We argue that a GeneSC model, however, is (high-level) concurrency model should convey intrin- structures in problems’ solutions. Therefore, we can sic data and task precedence constraints but should rely on extensions to existing languages to achieve not sacrifice asynchronism when it is possible. The concurrency structure abstraction, and delegate per- flexibility provided by asynchronism is essential for formance concerns (e.g. the mapping to parallel ar- achieving high-performance through out-of-order ex- chitectures) to languages and compilers that imple- ecution, thus is needed by a broader range of appli- ment the kernel functions encapsulated by a stream cations. entity.

8 We’ve mentioned Cilk++ earlier but consider that structured computing. In that system, a component it deserves more attention. First, Cilk++ is inter- encapsulates data and computing, which are defined esting because of its theoretical foundation and its as services to be produced by one component and internal handling of both task-parallelism and data- consumed by another. The emphasis on relations parallelism. Cilk++ has a provable threading model among components is like ours, except that com- that can be analyzed using DAG (Directed Acyclic munications among related components use explicit Graph). Conceptually, a Cilk++ strand is like a message passing. stream entity of our model, though Cilk++ does Finally, the GeneSC model is for concurrent not have an explicit construct for strand. Second, computing at instruction-block level. Hardware- Cilk++ uses only two language constructs, namely dependent considerations, such as supports to many- cilk_spawn and cilk_for, to define task paral- core and heterogeneous multicore, will rely on imple- lelism and data parallelism at language level, though mentations, particularly those at runtime and oper- Cilk++ language constructs have scope restrictions ating system levels, such as those in [7, 41, 50]. that make C++/Cilk++ mixed-language program- ming cumbersome [15]. Finally, Cilk++ has re- stricted but innovative debugging model and hyper- 8 Concluding Remarks objects support. In a survey [43] on parallel computing languages In this paper, we discuss a generalized stream com- for multicore architectures, McCool compared and puting model, known as the GeneSC model, for contrasted existing programming language mod- concurrent computing. This model is not about els and computing paradigms. One shared inter- yet another programming technique or a program- est is about stream computing, particularly the ming paradigm, but an abstract design guidance for “SPMD stream processing model”, implemented on parallel programming languages, compilers, runtime the RapidMind platform [46]. Three of the plat- and/or operating systems for multicore. We high- form’s major contributions may be summarized as light the benefits of the model through operational follows. First, at language level, RapidMind adopts analysis, and demonstrate its power through an en- a type-system approach inlining with Jade, CUDA abled programming framework and example design and OpenCL; but the type system is much simpler considerations that support the framework. Specifi- and exposes no synchronization nor access control cally, the GeneSC model brings in a non-distributive details. Second, at runtime level, RapidMind sched- execution unit called stream entity, and uses it as ules and load-balances the executions of concurrent the foundation for designs at programming language “program objects” through backend (runtime) sup- level, compiler level, and operating system level. Be- ports, which identify and parallelize the program ob- sides stream entity, two other crucial additions un- jects. Noticeably, a program object has similar pur- like traditional practices are also introduced: a new pose as our stream entity but lacks of precedence hypergraph section in a program’s runtime layout, constraints. Third, the RapidMind platform sup- and operating system kernel initiated threads that ports heterogeneous multicores. One needs to be form stream-entity/stream-data pipelines to multi- reminded, however, that while it does not relying core processors. The significance of the GeneSC on conventional threads, RapidMind’s implementa- model is that, by harnessing concurrency informa- tion of the SPMD stream processing model limits the tion expressed in algorithms and by applying the in- platform to applications that show data parallelism. formation to runtime scheduling, the model provides In [53], the authors proposed a close-to-metal sub- tractable, concurrent operations; and may serve as strate called “lithe”, which uses a primitive named a natural extension of the von Neumann model for “hart” and its context to carry out application-level parallel computing architectures. threads in multicore processors. Lithe also has an Our GeneSC model facilitates a high-level abstrac- interface layer that allows processor allocation and tion of concurrency in computing, yet does not incur thread scheduling through runtime systems. The a deep learning curve because the kernel of a stream motivation was to control hardware resources sub- entity is the encapsulation of a single-/multiple- scription and to support parallel library composi- procedure function that can be implemented in ex- tions. As an alternative OS-level thread implemen- isting sequential languages. By delegating many tation, Lithe has been used to run existing multi- of the tough issues (e.g. synchronization and task threading libraries such as OpenMP. scheduling) in concurrent computing to platform Bläser in [8] reported a component-based language software such as compilers and operating systems, and operating system that supported concurrent and programmers can focus more on analyzing their ap-

9 plication problems, designing algorithms, and defin- [8] L. Bläser, A High-Performance Operating Sys- ing concurrency structures of the solutions—a shift tem for Structured Concurrent Programs, In from machine-oriented programming to application- Proceedings of the 4th workshop on Programming oriented programming; at the meantime, sequen- languages and operating systems, Oct. 2007. tial programming develops smoothly into concurrent [9] R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, programming. et al., Cilk: An Efficient Multithreaded Run- time System, Technical Report TM-548, Mas- Acknowledgments. The author learned of the sachusetts Institute of Technology, 1996. existence of hypergraphs through professor Vitaly Voloshin of the Troy State University, USA. An [10] H. Boehm, Threads Cannot Be Implemented As anonymous reader of the first draft of this paper a Library, In Proceedings of the 2005 ACM SIG- brought up Kahn’s works, which led to the compar- PLAN Conference on Parallel Language Design ison between Kahn’s model and ours. The author and Implementation, ACM Press, Jun., 2005, once questioned professors Peter J. Denning and pages 261-268. Jack B. Dennis about their discussion of determinate computation in an article published in CACM, Vol. [11] H. Boehm, Transactional Memory Should Be an 53, No. 6; the two professors’ kind replies further Implementation Technique, Not a Programming clarified this author’s use of certain terminologies. Interface, In Proceedings of the First USENIX Workshop on Hot Topics in Parallelism (Hot- Par’09), March, 2009. References [12] I. Buck, Stream Computing on Graphics Hard- ware, Ph.D. thesis, Stanford University, 2006. [1] R. Allen and K. Kennedy, Optimizing Compilers for Modern Architectures, A Dependence-Based [13] D. R. Butenhof, Programming with POSIX Approach, Morgan Kaufmann Publishers, 2002. Threads, Addison-Wesley Publishing Company, 1997. [2] K. Asanovic, R. Bodik, B. C. Catanzaro, et al, The Landscape of Parallel Computing Research: [14] Chapel Language Specification 0.782, Cray Inc., A View from Berkeley, Technical Report No. 2009. UCB/EECS-2006-183, Dec. 2006. [15] Cilk Arts, Cilk++ Programmer’s Guide, Revi- sion 58, Cilk Arts, Inc., March 16, 2009. [3] J. Backus, Can Programming Be Liberated From the von Neumann Style? A Functional Style and [16] R. Cleaveland, S. A. Smolka, et al, Strategic Its Algebra of Programs, OCCAM, Vol. 21, No. Directions in Concurrency Research, ACM Com- 8, 1978, pages 613-641. puting Surveys, Vol. 28, No. 4, December 1996, pages 607-625. [4] S. Bhatt, M. Chen, C. Lin, and P. Liu, Abstrac- tions for Parallel N-body Simulations, In Pro- [17] W. Dally, P. Hanrahan, M. Erez, et al., Mer- ceedings of Scalable High Performance Comput- rimac: Supercomputing with Streams, In Pro- ing Conference, IEEE Press, April 1992, pages ceedings of the 2003 ACM/IEEE Conference on 38-45. Supercomputing, ACM Press, Nov., 2003. [18] R. Gerber, The Software Optimization Cook- [5] J. Barnes and P. Hut, A Hierarchical O(NlogN ) book, High-performance Recipe for the Intel Ar- Force-Calculation Algorithm, Nature, Vol. 324, chitecture, Intel Press, 2002. No. 4, December 1986, pages 446-449. [19] R. Greenlaw, H. J. Hoover, and W. L. Ruzzo, [6] A. Barth, C. Jackson, C. Reis, and Google Limits to Parallel Computation, P-Completeness Chrome Team, The Security Architecture of the Theory, Oxford University Press, 1995. Chromium Browser, http://seclab.stanford. edu/websec/chromium, 2008. [20] J. Gummaraju, M. Erez, J. Coburn, et al., Architecture Support for the Stream Execution [7] A. Baumann, P. Barham, P. Dagand, et al., The Model on General-Purpose Processors, In Pro- Multikernel: A New OS Architecture for Scalable ceedings of the 16th International Conference Multicore Systems, In Proceedings of the ACM on Parallel Architecture and Compilation Tech- SIGOPS 22nd Symposium on Operating Systems niques (PACT 2007), IEEE Computer Society Principles, ACM Press, Oct. 2009. Press, Sept., 2007, pages 3-12.

10 [21] M. W. Hall, S. P. Amarasinghe, B. R. Murphy, [33] F. Labonte, P. Mattson, W. Thies, and I. et al., Interprocedural Parallelization Analysis in Buck, The Stream , In Pro- SUIF, ACM Transactions on Programming Lan- ceedings of the 13th International Conference guages and Systems, Vol. 27, No. 4 (July), 2005, on Parallel Architecture and Compilation Tech- pages 662-731. niques (PACT’04), IEEE Computer Society Press, 2004. [22] J. L. Hennessy and D. A. Patterson, : A Quantitative Approach, 4th Edi- [34] Larrabee Microarchitceture, http://www.int- tion, Morgan Kaufmann Publishers, 2006. el.com/technology/visual/microarch.htm, Jan., 2010. [23] R. A. Hankins, G. N. Chinya, J. D. Collins, et al., Multiple Instruction Stream Processor, [35] J. Larus and C. Kozyrakis, Transactional Mem- In Proceedings of the 33rd International Sympo- ory: Is TM the Answer for Improving Paral- sium on Computer Architecture (ISCA06), IEEE lel Programming?, CACM, Vol. 51, No. 7, 2008, Computer Society Press, 2006. pages 80-88.

[24] M. Herlihy and J. E. B. Moss, Transactional [36] E. A. Lee, The Problem with Threads, IEEE Memory: Architectural Support for -Free Computer, Vol. 39, No. 5, May, 2006, pages 33- Data Structures, In Proceedings of the 20th An- 42. nual International Symposium on Computer Ar- chitecture, IEEE Computer Society Press, 1993, [37] E. A. Lee, Computing Needs Time, CACM, Vol. pages 289-300. 52, No. 5, May 2009, pages 70-79.

[25] M. Herlihy and N. Shavit, The Art of Multipro- [38] E. A. Lee and D. G. Messerschmitt, Static cessor Programming, Morgan Kaufmann Pub- scheduling of synchronous data flow programs for lishers, 2008. digital signal processing, IEEE Transactions on Computers, Vol. 36, No. 1, 1987, pages 24-35. [26] J. E. Hopcroft and J. D. Ullman, Introduc- tion to , Languages, and Com- [39] J. R. Levine, Linkers & Loaders, Morgan Kauf- putation, Addison-Wesley Publishing Company, mann Publishers, 2000. 1979. [40] S. Liao, Z. Du, G. Wu, and G. Lueh, Data [27] C.A.R. Hoare, Communicating Sequential Pro- and Computation Transformations for Brook cesses, Printice Hall International, 1985. Also Streaming Applications on Multiprocessors, In freely available on the Internet since 2004. Proceedings of the International Symposium on [28] Intel 64 and IA-32 Architectures Optimization Code Generation and Optimization (CGO’06), Reference Manual, Order No. 248966-014, Intel IEEE Computer Society Press, March 2006. Corporation, Nov., 2006. [41] R. Liu, K. Klues, S. Bird, et al., Tessellation: [29] G. Kahn, The Semantics of a Simple Language Space-Time Partitioning in a Manycore Client for Parallel Programming, Information Process- OS, In Proceedings of the First USENIX Work- ing 74, North-Holland Publishing Co., 1974, shop on Hot Topics in Parallelism (HotPar’09), pages 471-475. March, 2009.

[30] G. Kahn and D. B. MacQueen, and [42] R. Love, Development, 2nd Edi- Networks of Parallel Processes, Information Pro- tion, Novell Press, 2005. cessing 77, North-Holland Publishing Co., 1977, pages 993-998. [43] M. D. McCool, Scalable Programming Models for Massively Multicore Processors, Proceedings [31] Khronos Group, OpenCL Parallel Computing of the IEEE, Vol. 96, No. 5, May 2008. for Heterogeneous Devices, http://www.khro- nos.org/opencl, 2009. [44] P. S. McCormick, J. Inman, J. P. Ahrens, et al., Scout: A Hardware-Accelerated System for [32] M. Kulkarni, K. Pingali, B. Walter, et al., Quantitatively Driven and Analy- Optimistic Parallelism Requires Abstractions, sis, In Proceedings of the Conference on Visual- CACM, Vol. 52, No. 9, 2009, pages 89-97. ization ’04, ACM Press, 2004.

11 [45] R. McDougall and J. Mauro, Solaris Internals, [57] L. Snyder, The Design and Development of Solaris 10 and OpenSolaris Kernel Architecture, ZPL, In Proceedings of the Third ACM SIG- 2nd Edition, Sun Microsystems Press, 2006. PLAN Conference on History of Programming Languages, ACM Press, 2007. [46] M. Monteyne, RapidMind Multi-Core Devel- opement Platform, RapidMind While Paper, [58] W. Thies, Language and Compiler Support for RapidMindTM, Feb. 2008. Stream Programs, PhD Thesis, Massachusetts Institute of Technology, 2009. [47] R. H. B. Netzer and B. P. Miller, What Are [59] J. Torrellas, L. Ceze, J. Tuck, et al., The Race Conditions? Some Issues and Formaliza- Bulk Multicore Architecture for Improved Pro- tions, ACM Letters on Programming Languages grammability, CACM, Vol. 52, No. 12, 2009, and Systems, Vol. 1, No. 1, March 1992, pages pages 58-65. 74-88. [60] N. Travinin and J. Kepner, pMatlab Parallel [48] News on Larrabee’s delay again, http://ars- Matlab Library, Lincoln Lab, MIT, 2006. technica.com/hardware/news/2009/12/inte- ls-larrabee-gpu-put-on-ice-more-news-to [61] UPC Consortium, UPC Language Specifica- -come-in-2010.ars, Dec. 04, 2009. tions, Version 1.2, May 31, 2005. [62] L. G. Valiant, A Bridging Model for Multi-Core [49] J. Nickolls, I. Buck, M. Garland and K. Computing, Journal of Computer and System Skadron, Scalable Parallel Programming with Sciences, Vol. 77, Issue 1, Jan. 2011, pages 154- CUDA, Queue, Vol. 6, No. 2, 2008, ACM Press, 166. pages 40-53. [63] P. H. Wang, J. D. Collins, G. N. Chinya, [50] E. B. Nightingale, O. Hodson, R. Mcllroy, et et al., EXOCHI: Architecture and Program- al., Helios: Heterogeneous with ming Environment for A Heterogeneous Multi- Satellite Kernels, In Proceedings of the ACM core Multithreaded System, In Proceedings of SIGOPS 22nd Symposium on Operating Systems the 2007 ACM SIGPLAN Conference on Pro- Principles, ACM Press, Oct. 2009. gramming Language Design and Implementation (PLDI’10), ACM Press, June 2007. [51] NVIDIA CUDA Compute Unified Device Ar- chitecture Programming Guide, Version 1.1, [64] Y. Wang, R. M. Hyatt, and B. R. Bryant, "Ar- NVIDIA Corporation, Nov., 2007. chitectural considerations with distributed com- puting", In Proceedings of the 2nd International [52] OpenMP Application Program Interface, Ver- Conference on Enterprise Information Systems sion 3.0, http://openmp.org, May, 2008. (ICEIS 2000), July 2000, pages 535-536. [53] H. Pan, B. Hindman, and K. Asanović, Com- [65] M. S. Warren and J. K. Salmon, Astro- posing Parallel Software Efficiently with Lithe, physical N-body Simulation Using Hierarchical In Proceedings of the 2010 ACM SIGPLAN Con- Tree Data Structures, In Proceedings of the ference on Programming Language Design and 1992 ACM/IEEE Conference on Supercomput- Implementation (PLDI’10), June, 2010. ing, IEEE process, 1992, pp.570-576.

[54] C. Reis and S. D. Gribble, Isolating Web Pro- [66] B. Wilkinson and M. Allen, Parallel Program- grams in Modern Browser Architectures, In Pro- ming, Techniques and Applications Using Net- ceedings of the 4th ACM European conference worked Workstations and Parallel Computers, on Computer systems (EuroSys’09), ACM Press, Prentice Hall, Inc., 1999. 2009, pages 219-231. [67] H. Wong, A. Bracy, E. Schuchman, et al., Pan- [55] M. C. Rinard and M. S. Lam, The Design, gaea: A Tightly-Coupled IA32 Heterogeneous Implementation, and Evaluation of Jade, ACM Chip Multiprocessor, In Proceedings of Par- Transactions on Programming Languages and allel Architectures and Compilation Techniques Systems, Vol. 20, No. 3, May 1998. (PACT 08), 2008. [68] Working Draft, Standard for Programming Lan- [56] B. Stroustrup, The C++ Programming Lan- guage C++, Document No. N2798, ISO/IEC, guage, Special Edition, Addison-Wesley Publish- Oct., 2008. ing Company, 2001.

12