<<

and Multithreading

Ben Lee, Oregon University A.R. Hurson, Pennsylvania State University

he dataflow model of offers attractive properties for parallel pro- cessing. First, it is asynchronous: Because it bases instruction execution on operand availability, synchronizationof parallel activities is implicit in the dataflow model. Second, it is self-:Except for data dependencies in the pre gram, dataflow instructions do not constrain sequencing; hence, the dataflow graph representation of a program exposes all forms of parallelism, eliminatingthe need to explicitly manage parallel execution. For high-speed ,the advantage of the dataflow approach over the control-flowmethod stems from the inherent paral- lelism embedded at the instruction level. This allows efficient exploitation of fine-grain parallelism in applicationprograms. Due to its simplicity and elegance in describing parallelism and data dependencies, the dataflow execution model has been the subject of many research efforts. Since the early 197Os, a number of hardware prototypeshave been built and evaluated,’ and dif- ferent designs and compiling techniques have been simulated? The experience gained from these efforts has led to progressive development in dataflow .How- ever, a direct implementation of based on the dataflow model has been found to be an arduous . Studies from past dataflow projects revealed some inefficiencies in dataflow com- puting? For example, compared to its control-flow counterpart, the dataflow model’s fine-grained approach to parallelism incurs more overhead in exe- -Contrary to initial cution. The overhead involved in detecting enabled instructionsand constructingre- sult tokens generally results in poor performance in applicationswith low degrees of expectations, parallelism. Another problem with the dataflow model is its inefficiency in handling implementing dataflow data structures (for example, arrays of data). The execution of an instructioninvolves consuming tokens at the input arcs and generating result tokens at the output arcs. computers has Since tokens represent scalar values, the representation of data structures (collections of many tokens) poses serious problems. presented a In spite of these shortcomings, we’re seeing renewed interest in dataflow comput- monumental challenge. ing. This revival is facilitated by a lack of developmentsin the conventionalparallel- processing arena, as well as by changes in the actual implementation of the dataflow Now, however, model. One important development, the emergence of an efficient mechanism to multithreading offers a detect enabled nodes, replaces the expensive and complex of matching tags used in past projects. Another change is the convergence of the control-flow and viable alternative for dataflow models of execution. This shift in viewpoint allows incorporation of con- ventional control-flow execution into the dataflow approach, thus alleviating building hybrid the inefficiencies associated with the pure-dataflow method. Finally, some researchers architectures that suggest supporting the dataflow concept with appropriate compiling technology and program representation instead of specific hardware. This view allows implementa- exploit parallelism. tion on existing control-flow processors.These developments,coupled with experi-

August 1994 0018-9162/94/ $4.008 1994 IEEE 27 mental evidence of the dataflow ap- are carried by tokens. These tokens travel code, data, and destination list to one of proach’s success in exposing substantial along the arcs connecting various in- the operation units and clears the pres- parallelism in application programs, have structionsin the program graph. The arcs ence bits. The operation unit performs motivated new research in dataflow com- are assumed to be FIFO queues of un- the operation, forms result tokens, and puting? However, issues such as program bounded capacity. However, a direct im- sends them to the update unit. The up- partitioning and scheduling and resource plementation of this model is impossible. date unit stores each result in the appro- requirements must still be resolved be- Instead, the dataflow execution model priate operand slot and checks the pres- fore the dataflow model can provide the has been traditionally classified as either ence bits to determine whether the necessary computing power to meet to- static or dynamic. activity is enabled. day’s and future demands. Static. The static dataflow model was Dynamic. The dynamic dataflow proposed by Dennis and his research model was proposed by Arvind at MIT1 group at MIT? Figure 1 shows the gen- and by Gurd and Watson at the Univer- Past dataflow eral organizationof their Static Dataflow sity of Manchester? Figure 3 shows the architectures Machine. The activity store contains in- general organization of the dynamic struction templates that represent the dataflow model. Tokens are received by Three classic dataflow machines were nodes in a dataflow graph. Each instruc- the matching unit, which is a memory developed at MIT and the University of tion template contains an operation code, containing a pool of waiting tokens. The Manchester: the Static Dataflow Ma- slots for the operands, and destination unit’s basic operation brings together to- chine? the Tagged-Token Dataflow Ar- addresses (Figure 2). To determine the kens with identicaltags. If a match exists, chitecture (TfDA),’ and the Manchester availability of the operands, slots contain the corresponding token is extracted Machine? These machines - including presence bits. The update unit is respon- from the matching unit, and the matched the problems encountered in their design sible for detecting instructions that are token set is passed to the fetch unit. If no and their major shortcomings - pro- available to execute. When this condition match is found, the token is stored in the vided the foundation that has inspired is verified, the unit sends the address of matching unit to await a partner. In the many current dataflow projects. the enabled instruction to the fetch unit fetch unit, the tags of the token pair via the instruction queue. The fetch unit uniquely identify an instruction to be Static versus dynamic architectures. In fetches and sends a complete operation fetched from the program memory. Fig- the abstract dataflow model, data values packet containingthe correspondingop- ure 4 shows a typical instruction format for the dynamic dataflow model. It con- sists of an operational code, a literalkon- stant field, and destination fields. The in- struction and the token pair form the Definitions enabled instruction, which is sent to the processing unit. The processing unit exe- cutes the enabled instructions and pro- Dataflow execution model. In the dataflow execution model, an instruction duces result tokens to be sent to the execution is triggered by the availability of data rather than by a program . matching unit via the token queue. Static dataflow execution model. The static approach allows at most one of a to be enabled for firing. A dataflow actor can be executed Centralized or distributed. Dataflow ar- only when all of the tokens are available on its input arc and no tokens exist on chitectures can also be classified as cen- any of its output arcs. tralized or distributed,based on the orga- Dynamic dataflow execution model. The dynamic approach permits activa- nization of their instruction memories.’ tion of several instances of a node at the same time during runtime. To distin- The Static Dataflow Machine and the node, a tag associated with each token Manchester Machine both have central- lar token was generated. An actor is con- ized memory organizations. MIT’s dy- contain a set of tokens with identical tags. namic dataflow organization is a multi- ing scheme, any is completely system in which the instruction described by a pointer to an instruction (IP) and a pointer to an activation frame memory is distributed among the process- (FP). A typical instruction pointed to by an IP specifies an opcode, an offset in ing elements. The choice between central- the activation frame where the match will take place, and one or more displace- ized or organization ments that define the destination instructionsthat will receive the result token(s). has a direct effect on program allocation. The actual matching process is achieved by checking the disposition of the slot in the frame. If the slot is empty, the value of the token is written in the slot and Comparison. The major advantage of its presence bit is set to indicate that the slot is full. If the slot is already full, the the static dataflow model is its simplified value is extracted, leaving the slot empty, and the corresponding instruction is mechanism for detecting enabled nodes. executed. Presence bits (or a counter) determine Thread. Within the scope of a dataflow environment, a thread is a sequence the availability of all required operands. of statically ordered instructions such that once the first instruction in the thread However, the static dataflow model has is executed, the remaining instructionsexecute without interruption (synchro- a performance drawback when dealing nization occurs only at the beginning of the thread). Multithreading implies the with iterative constructs and reentrancy. interleavingof these threads. Interleavingcan be done in many ways, that is, on The attractiveness of the dataflow con- every cycle, on remote reads, and so on. cept stems from the possibility of concur- rent execution of all independent nodes if sufficient resources are available, but I reentrant graphs require strict enforce- ment of the static firing rule to avoid non- determinate behavior.’ (The static firing rule states that a node is enabled for firing queue when a token exists on each of its input token exists its output arcs and no on arc.) Enabled To guard against nondeterminacy, extra tokens instructions arcs carry acknowledgesignals from con- suming nodes to producing nodes. These I acknowledge signals ensure that no arc Figure 1. The basic organization of the Figure 3. The general organization of will contain more than one token. static dataflow model. the dynamic datafiow model. The acknowledge scheme can trans- form a reentrant code into an equivalent graph that allows pipelined execution of Opcode consecutive iterations, but this transfor- s: ODCOde I mation increases the number of arcs and Literalkonstant tokens. More important, it exploits only Destination s, I a limited amount of parallelism,since the Destination s, ... execution of consecutive iterations can Destination s, never fully overlap - even if no loop- Destination s, carried dependencies exist. Although this inefficiency can be alleviated in case of Figure 2. An instruction template for Figure 4. An instruction format for the loops by providing multiple copies of the the static dataflow model. dynamic dataflow model. program graph, the static dataflow model lacks the general support for program- ming constructs essential for any modern programming environment (for example, dataflow computation model. Previous ex- formed, (3) computing the results, and (4) procedure calls and recursion). periences have shown that performance generating and communicating result to- The major advantage of the dynamic depends directly on the rate at which the kens to appropriate target nodes. Even dataflow model is the higher perfor- matching mechanism processes token^.^ though the characteristicsof this instruc- mance it obtains by allowing multiple to- To facilitate matching while considering tion cycle may be consistent with the pure- kens on an arc. For example, a loop can the cost and the availability of a large-ca- dataflow model, it incurs more overhead be dynamically unfolded at runtime by pacity associative memory, hindpro- than its control-flow counterpart. For ex- creating multiple instances of the loop posed a pseudoassociative matching ample, matching tokens is more complex body and allowing concurrent execution mechanism that typically requires several than simply incrementing a program of the instances. For this reason, current memory accesses,l but this increase in counter.The generation and communica- dataflow research efforts indicate a trend memory accesses severely degrades the tion of result tokens imposes inordinate toward adopting the dynamic dataflow performance and the efficiency of the un- overhead compared with simply writing model. However, as we’ll see in the next derlying dataflow machines. the result to a memory location or a regis- section, its implementation presents a A more subtle problem with token ter. These inefficiencies in the pure- number of difficult problems. matching is the complexity of allocating dataflow model tend to degrade perfor- resources (memory cells). A failure to mance under a low . Lessons learned. Despite the dynamic find a match implicitly allocates memory Another formidable problem is the dataflow model’s potential for large-scale within the matching hardware. In other management of data structures. The parallel computer systems, earlier expe- words, mapping a code-block to a pro- dataflow functionality principle implies riences have identified a number of short- cessor places an unspecified commitment that all operations are side-effect free; comings: on the processor’s matching unit. If this that is, when a scalar operation is per- resource becomes overcommitted, the formed, new tokens are generated after overhead involved in matching to- program may . In addition, due the input tokens have been consumed. kens is heavy, to hardware complexity and cost, one However, absence of side effects implies resource allocation is a complicated cannot assume this resource is so plenti- that if tokens are allowed to carry vec- process, ful it can be wasted. (See Culler and tors, arrays, or other complex structures, the dataflow instruction cycle is inef- Arvind4for a good discussion of the re- an operation on a structure element must ficient, and source requirements issue.) result in an entirely new structure. Al- *handling data structures is not A more general criticism leveled at though this solution is theoretically ac- trivial. dataflow computing is instruction cycle in- ceptable, in practice it creates excessive efficiency. A typical dataflow instruction overhead at the level of system perfor- Detection of matching tokens is one of cycle involves (1) detecting enabled nodes, mance. A number of schemes have been the most important aspects of the dynamic (2) determining the operation to be per- proposed in the literature: but the prob-

August 1994 29 Table 1. Architectural features of current datanow systems.

Category General Characteristics Machine Key Features Pure- Implements the traditional Monsoonl.2 Direct matching of tokens using rendezvous slots dataflow dataflow instruction cycle. in the frame memory (ETS model). 0 Direct matching of tokens Associates three temporary registers with each thread of computation. Sequential schedulingimplemented by recirculatingscheduling paradigm using a direct recirculation path (instructions are annotated with special marks to indicate that the successor instruction is IP+l). Multiple threads supported by and implicit JOIN. Epsilon-23 A separate match memory maintains match counts for the rendezvous slots, and each operand is stored separately in the frame memory. Repeat unit is used to reduce the overhead of copying tokens and to represent a thread of computation (macroactor) as a linked-list. Use of a to temporarily store values within a thread. Macro- Integration of a token-based EM-44 Use of macroactors based on the strongly dataflow dataflow circular and connected arc model, and the execution of an advanced control pipeline. macroactors using advanced control pipeline. Direct matching of tokens. Use of registers to reduce the instruction cycle and the communicationoverhead of transferring tokens within a macroactor. Special thread functions,FORK and NULL, to spawn and synchronize multiple threads. Hybrid Based on conventional control- P-RIS6 Support for multithreading using a flow processor, i.., sequential token queue and circulatingcontinuations. schedulingis implicit by the switching can occur on every cycle RISC-based architecture. or when a thread dies due to LOADS or JOINS. Tokens do not cany data, only . *T6 Overhead is reduced by off-loading 0 Provides limited token matching the burden of message handling and capability through special synchro- synchronizationto separate . nization primitives (i.e., JOIN). Message handlers implement inter- Threaded Placing all ,scheduling, processor communication. Abstract and storage management responsibilityunder Can use both conventional and data- Machinel control, e.g., exposes the token queue flow compiling technologies. (i.e., vector) for schedulingthreads. Having the compiler produce specialized message handlers as inlets to each code-block.

1. D.E. Culler and G.M. Papadopoulos, “The Explicit Token Store,”J. Parallel & ,Vol. 10,1990,pp. 289-308. 2. G.M. Papadopoulos and K.R. Traub, “Multithreading: A Revisionist View of Dataflow Architectures,” Proc. 18th Annual Int ‘1 Symp. , IEEE CS Press, Los Alamitos, Calif., Order No. 2146,1991, pp. 342-351. 3. .G. Grafe and J.E. Hoch, “The Epsilon-2 Multiprocessor System,” J. Parallel & Distributed Computing, Vol. 10,1990,pp. 309-318. 4. S. Sakai et al., “An Architectureof a Dataflow Single-ChipProcessor,” Proc. 16th Annual Int’l Symp. Computer Architecture, IEEE CS Press, Los Alamitos, Calif., Order No. 1948 (microfiche only), 1989, pp. 46-53. 5. R.S. Nikhil and Arvind, “Can Dataflow Subsume von Neumann Computing?” Proc. 16th Annual Int‘l Symp. Computer Architecture, IEEE CS Press, Los Alamitos, Calif., Order No. 1948 (microfiche only), 1989, pp. 262-272. 6. R. S. Nikhil, G.M. Papadopoulos, and Arvind, “*T: A Multithreaded Architecture,” Proc. 19th Annual Int’l Symp. Computer Architecture, IEEE CS Press, Los Alamitos, Calif., Order No. 2941 (microfiche only), 1992, pp. 156-167. 7. D.E. Culler et al., “Fine-Grain Parallelism with Minimal Hardware Support: A Compiler-Controlled Threaded Abstracted Machine,” Proc. Fourth Int‘l Con$ Architectural Support for Programming Languages and Operating Systems, 1991.

30 COMPUTER From communication network Conventional RISC-based Local memory ...... nstruction

Messages Token to/from queue memory Frames and Direct other PES recirculation

t ~ start 1

To communication network

Figure 5. Organization of a pure-dataflow processing element. Figure 6. Organization of a hybrid processing element.

lem of efficiently representing and ma- (The processing unit contains an ALU categorized as hybrid organizations. nipulating data structures remains a dif- and a target address calculation unit for The macro-dataflow organization, ficult challenge. computing the destination address(es).It shown in Figure 7, is a compromise be- may also contain a set of registers to tem- tween the other two approaches. It uses porarily store operands between instruc- a token-based circular pipeline and an tions.) The pure-dataflow organizationis advanced control pipeline (a look-ahead Recent architectural a slight modification of an architecture control that implements instruction developments that implements the traditional dataflow prefetching and token prematching to instruction cycle. The major differences reduce idle times caused by unsuccessful Table 1 lists the features of six ma- between the current organizations and matches). The basic idea is to shift to a chines that represent the current trend in the classic dynamic architectures are (1) coarser grain of parallelism by incorpo- : MIT’s Monsoon, the reversal of the instruction fetch unit rating control-flow sequencing into the Sandia National Labs’ Epsilon-2, Elec- and the matching unit and (2) the intro- dataflow approach. EM4 is an example trotechnical Labs’ EM-4, MIT’s P-RISC, duction of frames to represent contexts. of a macro-datafloworganization. MIT’s *T, and UC Berkeley’s TAM These changes are mainly due to the im- (Threaded ). Each pro- plementation of a new matching scheme. Token matching. One of the most im- posal provides a different perspective on Monsoon and Epsilon-2are examples of portant developments to emerge from how parallel processing based on the machines based on this organization. current dataflow proposals is a novel and dataflow concept can be realized; how- The hybrid organizationdeparts more simplified process of matching tags -di- ever, they share the goal of alleviating the radically from the classic dynamic archi- rect matching. The idea is to eliminate inefficiencies associated with the tectures. Tokens carry only tags, and the the expensive and complex process of as- dataflow . From architecture is based on conventional sociative search used in previous dynamic these efforts, a number of innovative so- control-flow sequencing (see Figure 6). dataflow architectures.In a direct match- lutions have emerged. Therefore, architecturesbased on this or- ing scheme, storage (called an activation ganization can be viewed as von Neu- frame) is dynamically allocated for all the Three categories. The dataflow ma- mann machines that have been extended tokens generated by a code-block. The chines currently advanced in the litera- to support fine-grained dataflow capa- actual location used within a code-block ture can be classified into three cate- bility. Moreover, unlike the pure- is determined at compile time; however, gories: pure-dataflow, macro-dataflow, dataflow organization where token the allocation of activation frames is de- and hybrid. matching is implicit in the architecture, termined during runtime. For example, Figure 5 shows a typical processing el- machines based on the hybrid organiza- “unfolding” a loop body is achieved by ement (PE) based on the pure-dataflow tion provide a limited token-matching ca- allocating an activation frame for each organization. It consists of an execution pability through special synchronization loop iteration. The matching tokens gen- pipeline connected by a token queue. primitives. P-RISC, *T, and TAM can be erated within an iteration have a unique

August 1994 31 From communication network

...... Code-block activation ...... "..* Instruction memory

Instruction match memory

Token queue Frame memory

Y FP+2 T Form token

Presence bits To communication network Figure 8. Explicit-token-store representation of a dataflow Figure 7. Organization of a macro-datatlow processing element. program execution. slot in the activationframe in which they the corresponding instruction is exe- the unique location in the operand seg- converge. The actual matching process cuted. The result token(s) generated ment (called an entry point) to match to- simply checks the disposition of the slot from the operation is communicated to kens and then to fetch the corresponding in the frame memory. the destination instruction(s) by updat- instruction word. In a direct matching scheme, any com- ing the IP according to the displace- In the Epsilon-2 dataflow multipro- putation is completely described by a ment(s) encoded in the instruction. For cessor, a separate storage (match mem- pointer to an instruction (IP) and a example,execution of the ADD operation ory) contains rendezvous slots for in- pointer to an activation frame (FP). The produces two result tokens , is called a con- 3.55and

32 COMPUTER ample of such an architecture dataflow organization).For the (Table 1, reference 5). P- sake of presentation, we side- RISC is based on a conven- step discussion of the first ap- Operand segment tional RISC-type architec- proach, since sequential ture, which is extended to FP scheduling is implicit in hybrid support fine-grain dataflow architectures,and focus on the capability. To synchronize second. two threads of computations, Control-flow sequencing can JOIN x be incorporated into the a instruction toggles offset the contents of the frame lo- dataflow model in a couple cation FP + x. If the frame lo- ways. The first method is a sim- cation FP + x is empty, no ple recirculate scheduling continuation is produced, and paradigm. Since a continuation the thread dies. If the frame is completely described by a location is full, it produces a pair of pointers , continuation . where IP represents a pointer Therefore, a JOIN instruction to the current instruction, the implements the direct match- successor instruction is simply ing scheme and provides a Figure 9. The EM-4’sdirect matching scheme. . In addition to this general mechanism for syn- simple manipulation, the hard- chronizing two threads of ware must support the immedi- computations. Note that the ate reinsertionof continuations JOIN operation can be generalizedto sup- ming medium to encode imperative op- into the execution pipeline. This is port n-way synchronization;that is, the erations essential for execution of oper- achieved by using a direct recirculation frame location x is initialized to n - 1and ating system functions (for example, re- path (see Figure 5) to bypass the token different JOIN operationsdecrement it un- source management).4 In addition, the queue. til it reaches zero. JOIN instructions are instruction cycle can be further reduced One potential problem with the recir- used in *T and TAM to provide explicit by using a register file to temporarily culation method is that successor contin- synchronization. store operands in a grain, which elimi- uations are not generated until the end nates the overhead involved in con- of the pipeline (“Form token unit” in Fig- Convergence of dataflow and von structing and transferring result tokens ure 5). Therefore, execution of the next Neumann models. Dataflow architec- within a grain. instruction in a computationalthread will tures based on the original model pro- experience a delay equal to the number vide well-integratedsynchronization at a of stages in the pipe. This means that the very basic level - the instruction level. total number of cycles required to exe- The combined dataflowlvon Neumann cute a single thread of computation in a model groups instructions into larger The combined model k-stage pipeline will be k times the num- grains so that instructionswithin a grain ber of instructions in the thread. On the can be scheduled in a control-flow fash- groups instructions other hand, the recirculation method al- ion and the grains themselves in a into larger grains lows interleaving up to k independent dataflow fashion. This convergence com- for control-flow threads in the pipeline, effectively mask- bines the power of the dataflow model ing long and unpredictable latency due for exposing parallelism with the execu- scheduling of instruc- to remote loads. tion efficiency of the control-flow model. tions but dataflow The second technique for implement- Although the spectrum of dataflowlvon scheduling of grains. ing sequential scheduling uses the Neumann hybrid is very broad, two key macroactor concept.This scheme groups features supportingthis shift are sequen- the nodes in a dataflow graph into tial scheduling and use of registers to macroactors. The nodes within a temporarily buffer the results between There are contrasting views on merg- macroactor are executed sequentially; instructions. ing the two conceptually different execu- the macroactors themselves are sched- tion models. The first is to extend existing uled according to the dataflow model. Sequential scheduling. Exploiting a conventionalmultiprocessors to provide The EM-4 dataflow multiprocessor im- coarser grain of parallelism (compared dataflow capability (hybrid organiza- plements macroactors based on the with instruction-levelparallelism) allows tion). The idea here is to avoid a radical strongly connected arc model (Table 1, use of a simple control-flow sequencing departure from existing programming reference 4). This model categorizesthe within the grain. This is in recognition of methodology and architecturein favor of arcs of a dataflow graph as normal or the fact that data-driven sequencing is a smoother transition that provides incre- strongly connected. A subgraph whose unnecessarily general and such flexible mental improvement as well as software nodes are connected by strongly con- instruction scheduling comes at a cost of compatibility with conventionalmachines. nected arcs is called a strongly connected overhead required to match tokens. The second approach is to incorporate block (SCB). An SCB is enabled (fired) Moreover, the self-scheduling paradigm control-flow sequencing into existing when all its input tokens are available. fails to provide an adequate program- dataflow architectures (pure- or macro- Once an SCB fires, all the nodes in the

August 1994 33 Token in I Execution Dimline

Figure 10. Linked-list rep- resentation of a dataflow graph using the repeat unit (d and r represent the repeat offsets).

block are executed exclusively by means the number of active CDs can be large, $. of the advanced control pipeline. only threads occupying the execution Token out Epsilon-2 uses a unique architectural pipeline are associated with T registers. feature - the repeat unit - to imple- As long as a thread does not die, its in- ment sequential scheduling (Table 1, ref- structions can freely use its registers. Figure 11. Use of a register file. erence 3). The unit generates repeat to- Once a thread dies, its registers may be kens, which efficiently implement data committed to a new thread entering the fanouts in dataflow graphs and signifi- pipeline; therefore, the register values cantly reduce the overhead of copying to- are not necessarily preserved across ecute without interruption? A thread de- kens. The unit represents a thread of grain boundaries. All current architec- fines a basic unit of work consistent with computation as a linked list and uses reg- tures use registers but, unlike Monsoon, the dataflow model, and current dataflow isters to buffer the results between in- do not fix the number of registers asso- projects are adopting multithreading as structions (see Figure 10). To generate a ciated with each computational grain. a viable method for combining the fea- repeat token, it adds the repeat offset en- Therefore, proper use of resources re- tures of the dataflow and von Neumann coded in the instruction word to the cur- quires a compile-time analysis. execution models. rent token’s instruction pointer. Thus, the repeat unit can prioritize instruction ex- Supporting multiple threads. Basically, ecution within a grain. Multithreading hybrid dataflow architectures can be viewed as von Neumann machines ex- Register use. Inefficient communica- Architectures based on the dataflow tended to support fine-grained interleav- tion of tokens among nodes was a major model offer the advantage of instruction- ing of multiple threads. As an illustration problem in past dataflow architectures. level context switching. Since each datum of how this is accomplished, consider P- The execution model requires combin- carries context-identifying information RISC, which is strongly influenced by Ian- ing result values with target addresses to in a continuation or tag, context switching nucci’s dataflowlvon Neumann hybrid ar- form result tokens for communication to can occur on a per-instruction basis. chitecture? P-RISC is a RISC-based successor nodes. Sequential scheduling Thus, these architectures tolerate long, architecture in the sense that, except for avoids this overhead whenever locality unpredictable latency resulting from split loadktore instructions, instructions are can be exploited, registers can be used to transactions. (This refers to message- frame-to-frame (that is, register-to-regis- temporarily buffer results. based communication due to remote ter) operations that operate within the PE The general method of incorporating loads. A split transaction involves a read (Figure 6). The local memory contains in- registers in the dataflow execution request from processor A to processor B structions and frames. Notice that the P- pipeline is illustrated in Figure 11. A set containing the address of the location to RISC executes three address instructions of registers is associated with each com- be read and a return address, followed by on data stored in the frames; hence, “to- putational grain, and the instructions in a response from processor B to processor kens” carry only continuations. The in- the grain are allowed to refer to these A containing the requested value.) struction-fetch and operand-fetch units registers as operands. For example, Combining instruction-level context fetch appropriate instructions and Monsoon employs three temporary reg- switching with sequential scheduling operands, respectively, pointed to by the isters (called T registers) with each com- leads to another perspective on dataflow continuation. The functional unit per- putational thread. The T registers, com- architectures - multithreading. In the forms general RISC operations, and the bined with a continuation and a value, context of multithreading, a thread is a operand store is responsible for storing re- are called computation descriptors sequence of statically ordered instruc- sults in the appropriate slots of the frame. (CDs). A CD completely describes the tions where once the first instruction is For the most part, P-RISC executes in- state of a computational thread. Since executed, the remaining instructions ex- structions in the same manner as a con-

34 COMPUTER FORK labell I A’ FP.IP+l \FP. labell Remote Thread 1 i Thread2 memory Synchronization Data request processor coprocessor F P.label2 JFP.label2 7 t label2: JOINX Local memory

Figure 12. Application of FORK and JOINconstructs. Figure 13. Organization of a *Tprocessor node.

ventional RISC. An arithmeticllogicin- spectively. For example, the general form from the token queue every pipeline cycle. struction generates a continuation that is of a synchronizing memory read, such as *T, a successor to P-RISC, provides simply .However, unlike ar- I-structure,is x, d, where an address similar extensionsto the conventionalin- chitectures based on the pure-dataflow a in frame location FP + x is used to load struction set (Table 1, reference 6). How- organization, 1P represents a program data onto frame location FP + d. Execut- ever, to improve overall performance, the counter in the conventional sense and is ing a LOAD instruction generates an out- thread-execution and the message-han- incremented in the instruction-fetch going message (via the loadlstore unit) dling responsibilities are distributed stage. For a JUMP x instruction, the con- of the form among three asynchronous processors. tinuation is simply . To provide The general organization of *T is shown fine-grained dataflow capability, the in- in Figure 13. A *T node consists of the struction set is extended with two special data processor, the remote-memory re- instructions - FORK and JOIN - used to where a represents the location of the quest coprocessor, and the synchroniza- spawn and synchronize independent value to be read and d is the offset rela- tion coprocessor, which all share a local threads. These are simple operations ex- tive to FP. This causes the current thread memory. The data processor executes ecuted within the normal processor to die. Therefore, a new continuation is threads by extractingcontinuations from pipeline, not calls. A extracted from the token queue, and the the token queue using the NEXT instruc- FORK instruction is a combination of a new thread will be initiated. On its return tion. It is optimized to provide excellent JUMP and a fall-through to the next in- trip from the I-structure memory, an in- single-threadperformance. The remote- struction. Executing a FORK label has two coming START message (via the start unit) memory request coprocessorhandles in- effects. First, the current thread continues writes the value at location FP + d and coming remote LOADISTORErequests. For to the next instruction by generating a continuesexecution of the thread. A pro- example, a LOAD request is processed by continuation of the form . cedure is invoked in a similar fashion a message handler, which (1) accesses the Second, a new thread is created with the when the caller writes the argumentsinto local memory and (2) sends a START mes- same continuation as the current contin- the callee’s activationframe and initiates sage as a direct response without dis- uation except that the “IPis replaced by the threads. This is achieved by the in- turbing the data processor. On the other “label” (that is, ). JOIN,on the struction hand, the synchronization coprocessor other hand, is an explicit synchronization handles returning LOAD responses and primitive that provides the limited direct- START,dv, &PIP, dd JOIN operations. Its responsibility is to (1) matching capability used in other continually queue messages from the net- dataflow machines. FORK and JOIN opera- which reads a value v from FP + dv, a con- work interface; (2) complete the unfin- tions are illustrated in Figure 12. tinuation from FP+dFPIP, and ished remote LOADS by placing message In addition to FORK and JOIN, a special an offset d from FP + dd. Executing this values in destination locations and, if START message initiates new threads and instruction sends a START message to an necessary, performing JOIN operations; implements interprocessor communica- appropriate processing element and ini- and (3) place continuations in the token tion. The message of the form tiates a thread. queue for later pickup and execution by P-RISC supportsmultithreading in one the data processor. of two ways. In the first method, as long as In contrast to the two hybrid projects a thread does not die due to LOADS or discussed thus far, TAM provides a con- writes the value in the location FP + d JOINS,it is executed using the von Neu- ceptually different implementation of the and initiates a thread describedby FP.IP. mann scheduling IP + 1. When a thread dataflow execution model and multi- START messages are generated by LOAD dies, a context is performed by ex- threading (Table 1, reference 7). In and STORE instructions used to implement tracting a token from the token queue, TAM, the executionmodel for fine-grain I-structure reads and procedure calls, re- The second method is to extract a token interleaving of multiple threads is sup-

August 1994 35 Figure 14. A Threaded Ab- FORK labdl stract Machine (TAM) activa- tion.

FP.label2, vi / FP.label2, vr

...... FP.label1,labell VI)( .labell, vr

[FP+offset]: ported by an appropriate compilation thread scheduling is explicit and under FP.labell+l , FP.label2, strategy and program representation, not compiler control. vl+M Vl+M by elaborate hardware. In other words, In contrast to P-RISC, *T, and TAM, rather than viewing the execution model Monsoon’s multithreading capability is for fine-grain parallelism as a property based on the pure-dataflow model in the Figure 15. Application of FORK and sense that tokens not only schedule in- of the machine, all synchronization, JOIN constructs and equivalent dataflow scheduling, and storage management is structions but also carry data. Monsoon actor in Monsoon. explicit and under compiler control. incorporates sequential scheduling, but Figure 14 shows an example of a TAM with a shift in viewpoint about how the activation. Whenever a code-block is in- architecture works. It uses presence bits voked, an activation frame is allocated. to synchronize predecessorlsuccessor in- as thread-spawning and thread-synchro- The frame provides storage for local vari- structions through rendezvous points in nizing primitives, these operations can ables, synchronization counters, and a the activation frame. To implement the also be recognized as instructions in the continuation vector that contains ad- recirculate scheduling paradigm, instruc- dataflow execution model. For example, dresses of enabled threads within the tions that depend on such scheduling are a FORK instruction is similar to but less called code-block. Thus, when a frame is annotated with a special mark to indicate general than a COPY operation used in the scheduled, threads are executed from its that the successor instruction is IP + 1. pure-dataflow model. JOIN operations are continuation vector, and the last thread This allows each instruction within a implemented through the direct match- schedules the next frame. computation thread to enter the execu- ing scheme. For example, consider the TAM supports the usual FORK opera- tion pipeline every k cycles, where k is dataflow actor shown in Figure 15, which tions that cause additional threads to be the number of stages in the pipe. The ac- receives two input tokens, evaluates the scheduled for execution. A thread can be tual thread interleaving is accomplished sum, and generates two output tokens. synchronized using SYNC (same as JOIN) by extracting a token from the token The operation of this actor can be real- operations that decrement the entry queue and inserting it into the execution ized by a combination of instructions of count for the thread. A conditional flow pipeline every clock cycle. Thus, up to k the form of execution is supported by a SWITCH op- active threads can be interleaved through eration that forks one of two threads the execution pipeline. labell[FP + offset]: based on a Boolean input value. A STOP New threads and synchronizing ADD vl, vr II FORK laben, (same as NEXT) operation terminates the threads are introduced by primitives sim- current thread and causes initiation of an- ilar to those used in hybrid machines - where II represents the combination of an other thread. TAM also supports inter- FORK and JOIN. For example, a FORK la- ADD and a FORK. In a pure-dataflow or- frame messages, which arise in passing bell generates two continuations, and

36 COMPUTER achieved by having thread library func- switching occurs when a main memory al.6 proposed a partitioning scheme based tions that allow parallelism to be ex- access is required (due to a miss). on dual graphs. A dual graph is a directed pressed explicitly? For example, a master As a consequence,the thread granularity graph with three types of arcs: data, con- thread spawns and terminates a slave in these models tends to be coarse, trol, and dependence. A data arc repre- thread using functions FORK and NULL, re- thereby limiting the amount of paral- sents the data dependency between pro- spectively. The general format for the lelism that can be exposed. On the other ducer and consumer nodes. A control arc FORK is given as hand, non-strict functional languagesfor represents the scheduling order between dataflow architectures, such as Id, com- two nodes, and a dependence arc speci- FORK(PE, func, n, argl, . . . , argn), plicate partitioning due to feedback de- fies long latency operation from message pendenciesthat may only be resolved dy- handlers (that is, inlets and outlets) send- where PE specifies the processing ele- namically. These situationsarise because ingheceiving messages across code-block ment where the thread was created, and of the possibility of functionsor arbitrary boundaries. n represents the number of arguments in expressions returning results before all The actual partitioning uses only the the thread. This function causes the fol- operands are computed (for example, I- control and dependence edges. First, the lowing operations: (1) allocate a new structure semantics). Therefore, a more nodes are grouped as dependence sets operand segment on the processing ele- restrictive constraint is placed on parti- that guarantee a safe partition with no ment and link it to the template segment tioning programs written in non-strict cyclic dependencies.A safe partition has specified by the token’s address portion, languages. the following characteristics6:(1) no out- (2) write the arguments into the operand Iannucci’ has outlined several impor- put of the partition needs to be produced segment, (3) send the NULL routine’s ad- tant issues to consider in partitioning pro- before all inputs are available; (2) when dress as a return address for the newly the inputs to the partition are available, created thread, and (4) continue the ex- all the nodes in the partition are exe- ecution of the current thread. Once exe- cuted; and (3) no arc connects a node in cution completes, the new thread termi- the partition to an input node of the same nates by executing a NULL function. Thread definitions partition. The partitions are merged into Using the FORK and NULL functions, a vary according to larger partitions based on rules that gen- master thread can distribute the work erate safe partitions. Once the general over a number of slave threads to com- language characteris- partitioning has been completed, a num- pute partial results. The final result is col- tics and context- ber of optimizationsare performed in an lected when each slave thread sends its attempt to reduce the synchronization partial result to the master thread. Mul- switching criteria. cost. Finally, the output of the partitioner tithreading is achieved by switching is a set of threads wherein the nodes in among threads whenever a remote LOAD each thread are executed sequentially operation occurs. However, since each grams: First, a partitioning method and the synchronization requirement is thread is an SCB (that is, an uninterrupt- should maximize the exploitable paral- determined statically and only occurs at able sequence of instructions),interleav- lelism. In other words, the attempt to the beginning of a thread. ing of threads on each cycle is not al- aggregateinstructions does not imply re- lowed. stricting or limiting parallelism. Instruc- tions that can be grouped into a thread Future prospects Partitioning programs to threads. An should be the parts of a code where little important issue in multithreading is the or no exploitable parallelism exists. Sec- To predict whether dataflow comput- partitioning of programs to multiple se- ond, the longer the thread length, the ing will have a legitimateplace in a world quentialthreads. A thread defines the ba- longer the interval between context dominated by massively parallel von sic unit of work for scheduling - and . This also increases the locality Neumann machines, consider the suit- thus a computation’s granularity. Since for better utilization of the processor’s ability of current commercial parallel ma- each thread has an associated cost, it di- resources. Third, any arc (that is, data de- chines for fine-grain parallel program- rectly affects the amount of overhead re- pendency) crossing thread boundaries ming. Studies on TAM show that it is quired for synchronization and context implies dynamic synchronization. Since possible to implement the dataflow exe- switching. Therefore, the main goal in synchronization operations introduce cution model on conventional architec- partitioning is maximizing parallelism hardware overhead and/or increase pro- tures and obtain reasonable performance while minimizing the overhead required cessor cycles, they should be minimized. (Table 1, reference 7). This has been to support the threads. Finally, the self-scheduling paradigm of demonstrated by compiling Id90 pro- A number of proposals based on the program graphs implies that execution grams to TAM and translatingthem, first control-flow model use multithreadingas ordering cannot be independent of pro- to TLO,the TAM assembly language, and a means of tolerating high-latency mem- gram inputs. In other words, this dynamic finally to native for a vari- ory operations, but thread definitions ordering behavior in the code must be ety of platforms, mainly CM-5.1° How- vary according to language characteris- understood and considered as a con- ever, TAM’S translation of dataflow tics and context-switching criteria. For straint on partitioning. graph program representationto control- example, the multiple-context schemes A number of thread partitioning algo- flow execution also shows a basic mis- used in Weber and Gupta9obtain threads rithms convert dataflow graph represen- match between the requirementsfor fine- by subdividing a parallel loop into a num- tation of programs into threads based on grain parallelism and the underlying ber of sequential processes, and context the criteria outlined above. Schauser et architecture. Fine-grain parallel pro-

August 1994 37 gramming models dynamically create - for example, by off-loading message TAM alleviates this problem to some ex- parallel threads of control that execute handling to coprocessors, as in *T -may tent by relegating the responsibilitiesof on data structuresdistributed among pro- reduce this even more. scheduling and storage management to cessors. Therefore, efficient support for With the advent of multithreading,fu- the compiler.For example, continuation synchronization, communication, and dy- ture dataflow machines will no doubt vectors that hold active threads are im- namic scheduling becomes crucial to adopt a hybrid flavor, with emphasis on plemented as stacks, and all frames hold- overall performance. improving the execution efficiency of ing enabled threads are linked in a ready One major problem of supportingfine- multiple threads of computations. Ex- queue. However, both hardware and grain parallelism on commercial parallel periments on TAM have already shown software methods discussed are based on machines is communication overhead. In how implementation of the dataflow ex- a naive, local scheduling discipline with recent years, the communicationperfor- ecution model can be approached as a no global strategy. Therefore, appropri- mance of commercial parallel machines compilation issue. These studies also in- ate means of directing scheduling based has improved significantly." One of the dicate that considerable improvement is on some global-level understanding of first commercial message-passing multi- possible through hardware support, program execution will be crucial to the computers, 's iPSC, incurs a com- which was the original goal of dataflow success of future dataflow architectures.12 munication overhead of several millisec- computer designers- to build a special- Another related problem is the alloca- onds, but current parallel machines, such ized architecture that allows direct - tion of frames and data structures.When- as KSRl, Paragon, and CM-5, incur a ping of dataflow graph programs onto the ever a function is invoked, a frame must be one-way communication overhead of just hardware. Therefore,the next generation allocated; therefore, how frames are allo- 25 to 86 microseconds." Despite this im- of dataflow machines will rely less on cated among processors is a major issue. provement, the communication overhead very specialized processors and empha- Two extreme examples are a random al- in these machines is too high to efficiently size incorporating general mechanisms location on any processor or local alloca- support dataflow execution. This can be tion on the processor invoking the func- attributed to the lack of integrationof the tion. The proper selection of an allocation I network interface as part of hardware strategy will greatly affect the balance of functionality.That is, individual proces- the computationalload. The distribution sors offer high execution performanceon of data structuresamong the processors is sequential streams of computations, but The Of closely linked to the allocation of frames. communication and synchronization threading &Den& For example, an allocation policy that dis- among processors have substantial over- U1 tributes a-large data structure among the head.l03l1 on rapid support of processors may experience a large num- In comparison, processors in current context switching- ber of remote messages, which will require dataflow machines-communicateby exe- more threads to mask the delays. On the cuting message handlers that directly other hand, allocating data structures lo- move data in and out of preallocated stor- cally may cause one processor to serve a age. Message handlers are short threads to support fine-grainparallelism into ex- large number of accesses, limiting the en- that messages entirely in user isting sequential processors. The major tire system's performance. Therefore,the mode, with no transfer of control to the challenge, however, will be to strike a bal- computation and data must be studied in operating system. For example, in Mon- ance between hardware complexity and unison to develop an effective allocation soon, message handlers are supported by performance. methodology. hardware: A sender node can format and Despite recent advances toward de- send a message in exactly one cycle, and veloping effective architecturesthat sup- a receiver node can process an incoming port fine-grain parallelism and tolerate he eventual success of dataflow message by storing its value and per- latency, some challengesremain. One of computers will depend on their forming a JOIN operation in one or two cy- these challenges is dynamic scheduling. T programmability. Traditionally, cles. In hybrid-class architectures, mes- The success of multithreading depends they've been programmed in languages sage handlers are implemented either by on rapid support of context switching. such as Id and SISAL (Streams and It- specialized hardware, as in P-RISC and This is possible only if threads are resi- erations in a Single Assignment Lan- *T, or through software (for example,in- dent at fast but small memories (that is, at guage)* that use functional semantics. terrupt or polling) as part of the processor the top level of the storage hierarchy), These languages reveal high levels of pipeline execution. The most recent hy- which limits the number of active threads concurrencyand translate onto dataflow brid-class prototype under development and thus the amount of latency that can machines and conventional parallel ma- at MIT and Motorola, inspired by the con- be tolerated. All the architectures de- chines via TAM. However, because their ceptual *T design at MIT, uses processor- scribed -Monsoon, Epsilon-2, EM-4, P- syntax and semantics differ from the im- integrated networking that directly inte- RISC, and *T - rely on a simple perative counterparts such as Fortran grates communication into the MC88110 dataflow scheduling strategy based on and , they have been slow to gain ac- superscalarRISC ." Pro- hardware token queues. The generality ceptance in the programming commu- cessor-integrated networking efficiently of the dataflow scheduling makes it diffi- nity. An alternative is to explore the use implements the message-passing mecha- cult to execute a logically related set of of established imperative languages to nism. It provides a low communication threads through the processor pipeline, program dataflow machines. However, overhead of 100 nanoseconds, and over- thereby removing any opportunity to uti- the difficulty will be analyzing data de- lapping computation with communication lize registers across thread boundaries. pendencies and extracting parallelism

38 COMPUTER from source code that contains side ef- 11. G.M. Papadopouloset al., “*T: Integrated He is a member of ACM and the IEEE Com- Building Blocks for ,” puter Society. fects. Therefore, more research is still Proc. Supercomputing 93, IEEE CS Press, needed to develop for con- Los Alamitos, Calif., Order No. 4340, ventional languages that can produce 1993, pp. 624-635. parallel code comparable to that of par- 12. D.E. Culler, K.E. Schauser, and T. von allel functional languages. H Eicken, “Two Fundamental Limits on Dataflow ,” Proc. IFIP Working Group 10.3 (Concurrent Systems) Working Conf: on Architecmre and Com- References pilation Techniques for Fine- and Medium- Grain Parallelism,Jan. 1993. 1. Arvind and D.E. Culler, “Dataflow Ar- chitectures,’’ Ann. Review in Computer A.R. Hurson is on the Science, Vol. 1,1986,pp. 225-253. faculty at Pennsylvania State University. His research for the past 12 years has been directed 2. J.-L. Gaudiot and L. Bic, Advanced Top- toward the design and analysis of general as ics in Dataflow Computing, Prentice Hall, well as special-purpose computers. He has Englewood Cliffs, N.J., 1991. published over 120 technical papers and has served as guest coeditorof special issues of the 3. Arvind, D.E. Culler, and K. Ekanadham, IEEE Proceedings, the Journal of Parallel and “The Price of Fine-Grain Asynchronous Distributed Computing, and the Journal of In- Parallelism: An Analysis of Dataflow tegrated Computer-Aided Engineering. He is Methods,” Proc. CONPAR88, Sept. 1988, the coauthor of Parallel Architectures for Data pp. 541-555. Ben Lee is an assistant professor in the De- and Knowledge Base Systems (IEEE CS Press, partment of Electrical and Computer Engi- to be published in October 1994),a member of 4. D.E. Culler and Arvind, “Resource Re- neering at Oregon State University. His in- the IEEE Computer Society Press Editorial quirements of Dataflow Programs,”Proc. terests include computer architectures, Board, and a speaker for the Distinguished 15th Ann. Int’l Symp. Computer Architec- parallel and distributed systems, program al- Visitors Program. ture, IEEE CS Press, Los Alamitos,Calif., location, and dataflow computing. Lee re- Order No. 861 (microfiche only), 1988,pp. ceived the BEEE degree from the State Uni- Readers can contact Lee at Oregon State 141-150. versity of New York at Stony Brook in 1984 University, Dept. of Electrical and Computer and the PhD degree in computer engineering Engineering, Corvallis, OR 97331-3211. His from Pennsylvania State University in 1991. address is [email protected]. 5. B. Lee, A.R. Hurson, and B. Shirazi, “A Hybrid Scheme for Processing Data Struc- tures in a Dataflow Environment,” IEEE Trans. Parallel and Distributed Systems, Vol. 3, No. 1, Jan. 1992, pp. 83-96. A quarterly magazine published by the IEEE Computer Society 6. K.E. Schauser et al., “Compiler-Con- trolled Multithreading for Lenient Parallel Languages,”Proc. Fifth ACM Conf: Func- Subscribe before August 31. tional Programming Languagesand Com- puter Architecture,ACM, New York, 1991, Receive the final 1994 issues for just $12. pp. 50-72. 0 FALL 1994 - High-Performance Fortran 7. R.A. Iannucci,“Towards a Dataflow/von WINTER 1994 - Real-Time Computing Neumann Hybrid Architecture,” Proc. 15th Ann. Int’l Symp. Computer Architec- ture, IEEE CS Press, Los Alamitos, Calif., YES, SIGN ME UP! As a member of the Computer Society or another IEEE Society, I qualify for the Order No. 861 (microficheonly), 1988,pp. member rate of $12 for a half-year subscription (two issues). 131-140. Society IEEE Membership No. Since IEEE subscriptions are annualized, orders received from September 1994 through Februaly 8. M. Sat0 et al., “Thread-Based Program- 1994 will be entered as full-year subscriptions for the 1995 calendar year. Pay the full-year rate of $24 ming for EM-4 Hybrid Dataflow Ma- for four quarterly issues. For membership information,see pages 16A-B. chine,” Proc. 19th Ann. Int’l Symp. Com- puter Architecture, IEEE CS Press, Los Alamitos, Calif., Order No. 2941 (mi- FULL SICNATURE Dam crofiche only), 1992,pp. 146-155.

9. W.D. Weber and A. Gupta, “Exploring NAME the Benefits of Multiple Hardware Con- texts in a Multiprocessor Architecture: STREETADDRESS Preliminary Results,” Proc. 16th Ann. Int’l Symp, Computer Architecture, IEEE CS Press, Los Alamitos, Calif., Order No. 1948 (microfiche only), 1989, pp. 273-280. STATdCOWRi’ ZIPPOSTALLODE 10. E. Spertus et al., “Evaluation of Mecha- nisms for Fine-Grained Parallel Programs Payment enclosed Residents of CA, DC,Canada, and Belgium, add applicable tax. in the J-Machine and the CM-5,” Proc. 0 Charge to 0 Visa 0 MasterCard U American Express 20th Ann. Int’l Symp. Computer Architec- ture, IEEE CS Press, Los Alamitos, Calif., CWRGE-URDNUMUER EXPIRATIONDATE Order No. 3811 (microfiche only), 1993...... MUNIH YE4R MAILro CIRCLUTIONDEFT, 10662 LOSVAQUEROS CIRCLE, PO Bux3014, Lob AUMIIOS,LA 90720-1264 August 1994