From Dataflow to Multithreading

ASYNCHRONY IN PARALLEL COMPUTING: FROM DATAFLOW TO MULTITHREADING y z x JURIJ \ILC , BORUT ROBIS , AND THEO UNGERER Abstract. The pap er presents an overview of the parallel computing mo dels, architectures, and research pro jects that are based on asynchronous instruction scheduling. It starts with pure dataow computing mo dels and presents an historical development of several ideas i.e. single-token-per-arc dataow, tagged-token dataow, explicit token store, threaded dataow, large-grain dataow, RISC dataow, cycle-by-cycle interleaved multithreading, blo ckinterleaved multithreading, simultaneous multithreading that resulted in mo dern multithreaded sup erscalar pro cessors. The pap er shows that unication of von Neumann and dataow mo dels is p ossible and preferred to treating them as two unrelated, orthogonal computing paradigms. Today's dataow research incorp orates more explicit notions of state into the architecture, and von Neumann mo dels using many dataow techniques to improve the latency hiding asp ects of mo dern multithreaded systems. Key words. parallel computer architectures, data-driven computing,multithread computing, static dataow, tagged-token dataow, threaded dataow, cycle-by-cycle interleaving, blo ckinter- leaving, multithreaded sup erscalar AMS sub ject classications. 68-02, 68M05, 68Q10 1. Intro duction. There are many problems that require enormous computational capacity to solve, and therefore present opp ortunities for high-p erformance parallel computing. There are also a numberofwell-known hardware organization techniques for exploiting parallelism. In this pap er we giveanoverview of the computers where parallelism is achieved by executing multiple asynchronous threads of instructions concurrently. For sequential computers the principal execution mo del is the well known von Neumann model which consists of a sequential pro cess running in a linear address space. The amount of concurrency available is relatively small [228]. Computers having more than one pro cessor may b e categorized in twotyp es, SIMD or MIMD [82], and allow several multiprocessor von Neumann program execution mo dels, such as shared memory with actions co ordinated by synchronization op erations, e.g. P and V semaphore commands [75], or distributed memory with actions co ordinated by message passing facilities [118, 135], e.g. explicit op erating system calls. A MIMD computer in principle will have a dierent program running on every pro cessor. This makes for an extremely complex programming environment. A frequent program execution mo del for MIMD computers is the single-program multiple data SPMD mo del [62] where the same program is run on all pro cessors, but the execution may follow dierent paths through the program according to the pro cessor's identity. In multipro cessor systems, two main issues must b e addressed: memory latency , which is the time that elapses b etween issuing a memory request and receiving the cor- Received by the editors January 14, 1997; accepted for publications in revised form September 9, 1997. This work was partially supp orted by the Ministry of Science and Technology of the Republic of Slovenia under grants J2-7648 and J2-8697. y Computer Systems Department, Joºef Stefan Institute, Jamova cesta 39, SI-1000 Ljubljana, Slovenia [email protected]. z Faculty of Computer and Information Science, University of Ljubljana, Trºa²ka cesta 25, SI-1000 Ljubljana, Slovenia, and Computer Systems Department, Joºef Stefan Institute, Jamova cesta 39, SI-1000 Ljubljana, Slovenia [email protected]. x Department of Computer Design and Fault Tolerance, University of Karlsruhe, P.O.Box 6980, D-76128 Karlsruhe, [email protected]. 1 2 J. \ILC, B. ROBIS AND T. UNGERER resp onding resp onse, and synchronization , which is the need to enforce the ordering of instruction executions according to their data dep endencies. The two issues can- not b e prop erly resolved in a von Neumann context since connecting von Neumann pro cessors into a very high sp eed general purp ose computer also brings b ottleneck problems [20]. As an alternative, the dataow mo del was intro duced as the radically new mo del capable of prop erly satisfying the two needs [8,17,66,88]. Dataow mo dels use dataow program graphs to represent the ow of data and control. Synchronization is achieved by a requirement that two or more sp ecied events o ccur b efore an instruction is eligible for execution [191]. For example, in the graph/heap model [65] an instruction is enabled by the presence of simple tokens i.e., data packets on the arcs of copies of the dataow graph, whereas the unraveling interpreter model U-interpreter [19] enables instructions by matching tags that identify the graph instance to which a token b elongs. In [158] a detailed implementation of an unraveling dataowinterpreter of the Irvine Dataow Id programming language is prop osed. Dataow mo dels have also guided the design of several multithreaded computers [70]. Unfortunately, direct implementation of computers based on a pure dataow mo del has b een found to b e an arduous task. For this reason, the impact of the convergence of the dataow and control-owwas investigated [16, 125, 127, 141, 169]. In particular, Nikhil and Arvind p osed the following question: What can a von Neu- mann processor borrow from dataow to become more suitable for a multiprocessor? and oered the answer in terms of the P-RISC Parallel RISC programming mo del [164]. The mo del is based on a simple, RISC-like instruction set extended with three instructions that giveavon Neumann pro cessor a ne-grained dataow capability [16]. It uses explicit fork and join commands. Based on P-RISC programming mo del amultithreaded Id programming language is implemented [162, 163]. Another mo del designed to supp ort nested functional languages using sequential threads of instructions is the Multilisp mo del [114, 115, 116, 146]. Multilisp is an extension of the Lisp dialect Scheme with additional op erators and semantics to deal with parallel execution. The principal language extension provided by Multilisp is f utur ex.Upon executing f utur ex, an immediate undeter mined value is returned. The computation of x o ccurs in parallel and the result replaces undeter mined when complete. Of course, any use of the result would blo ck the parent pro cess until the computation is nished. The incorp oration of conventional control-ow thread execution into the dataow approach resulted in the multithreaded computer architecture which is one of the most promising and exciting avenues for the exploitation of parallelism [10, 43, 87, 128, 151, 159]. 2. The day b efore yesterday: Pure dataow. The fundamental principles of dataowwere develop ed by Jack Dennis [65] in the early 1970s. The dataow mo del [8, 68, 93, 191]avoids the two features of von Neumann mo del, the program counter and the global up datable store, which b ecome b ottlenecks in exploiting parallelism [27]. The computational rule, also known as the ring rule of the dataow mo del, sp ecies the condition for the execution of an instruction. The basic instruction ring rule, common to all dataow systems is as follows: An instruction is said to be executable when al l the input operands that arenecessary for its execution are available to it. The instruction for which this condition is satised is said to b e red . The eect of ring an instruction is the consumption of its input values and generation of output values. Due to the ab ove rule the mo del is asynchronous . It is also self-scheduling ASYNCHRONY IN PARALLEL COMPUTING 3 since instruction sequencing is constrained only by data dep endencies. Thus, the ow of control is the same as the ow of data among various instructions. As a result, a dataow program can b e represented as a directed graph consisting of named nodes , which represent instructions, and arcs , which represent data dep endencies among instructions [64, 138] Fig.2.1 a,b. Data values propagate along the arcs in the form of data packets, called tokens . The two imp ortantcharacteristics of the dataow graphs are functionality and composability .Functionality means that evaluation of a graph is equivalenttoevaluation of a mathematical function on the same input values. Comp osability means that graphs can b e combined to form new graphs. In a dataow architecture the program execution is in terms of receiving, pro cessing and sending out tokens containing some data and a tag. Dep endencies b etween data are translated into tag matching and transformation, while pro cessing o ccurs when a set of matched tokens arrives at the execution unit. The instruction which has to b e fetched from the instruction store according to the tag information con- tains information ab out what to do with the data and how to transform the tags. The matching and execution unit are connected by an asynchronous pip eline, with queues added to smo oth out load variations [136]. Some form of asso ciate memory is required to supp ort token matching. It can b e a real memory with asso ciative access, a simulated memory based on hashing, or a direct matched memory. Each solution has its prop onent but none is absolutely suitable. Due to its elegance and simplicity, the pure dataow mo del has b een the sub ject of many research eorts. Since the early 1970s, a numb er of dataow computer proto- typ es have b een built and evaluated, and dierent designs and compiling techniques have b een simulated [17, 66, 88, 151, 168, 185, 196, 205, 208, 216, 219]. Clearly, an architecture supp orting the execution of dataow graphs should supp ort the ow of data.

Load more