Scalability of Scheduled Dataflow Architecture (SDF) with Register Contexts

Scalability Of Scheduled Dataflow Architecture (SDF) With Register Contexts Joseph M. Arul Fu Jen Catholic University, Taiwan and Krishna M. Kavi University of North Texas, USA Abstract Our new architecture, known as Scheduled DataFlow that the component instructions can be executed (SDF) system deviates from current trend of building simultaneously. In order to increase the number of complex hardware to exploit Instruction Level instructions that can be issued (either statically or Parallelism (ILP) by exploring a simpler, yet dynamically), speculative execution is often powerful execution paradigm that is based on employed [2]. The single threaded programming dataflow, multithreading and decoupling of memory model limits the instruction-level parallelism that can accesses from execution. A program is partitioned be exploited in modern processors. Smith, in [3] into non-blocking threads. In addition, all memory promotes building general-purpose micro accesses are decoupled from the thread’s execution. architectures composed of small, simple, Data is pre-loaded into the thread’s context interconnected processors (or execution engines) (registers), and all results are post-stored after the running at very high clock frequencies. Smith completion of the thread’s execution. Even though advocates a shift from instruction-level parallelism to multithreading and decoupling are possible with instruction-level distributed processing with more control-flow architecture, the non-blocking and emphasis on inter-instruction communication along functional nature of the SDF system make it easier to with dynamic optimization and a tight interaction coordinate the memory accesses and execution of a between hardware and low-level software [3]. thread. In this paper we show some recent Dataflow architecture [4, 5, 6] is an alternative improvements on SDF implementation, whereby to the von Neumann model. However previous threads exchange data directly in register contexts, attempts to develop practical systems based on thus eliminating the need for creating thread frames. dataflow model have failed for numerous reasons Thus it is now possible to explore the scalability of [15]. Hybrid models that combine the two alternatives our architecture’s performance when more register have also been explored. Our SDF architecture can be contexts are included on the chip. viewed as a new hybrid approach. SDF also presents an alternative to the instruction level parallelism, Keywords: Scheduled Dataflow Architecture, albeit different from Smith’s instruction level Superscalar, superspeculative, Multithreaded distributed processing. architectures. The memory hierarchies present yet another challenge in the design of high-performance 1. Introduction processors, since multiple levels of the hierarchy may In today’s computer industry, von Neumann or need to be traversed to obtain a data value. control flow architecture, which dates back to 1946, Decoupling of memory accesses from execution can still dominates the programming model. In order to alleviate the memory-CPU performance gap [7]. The overcome performance limitations of this model, main feature of this decoupling is to separate the modern architectures rely on instruction-level operand accesses from execution. We believe that to parallelism [1], by executing multiple instructions fully benefit from decoupling we must employ every cycle, often executing instruction in an order multithreading with multiple register contexts such other than that specified by the program. that several threads are ready to execute. Our SDF Alternatively, multiple independent instructions can combines decoupling with non-blocking be packed into a wide instruction word (VLIW) so multithreading. This architecture differs from other multithreaded architectures by using non-blocking Thread Level Parallelism (TLP) depends on the threads and the instructions of a thread obey dataflow application and data. When per thread ILP is limited, (or functional) properties. The deviation of the TLP can achieve more parallelism. By combining ILP decoupled Scheduled DataFlow (SDF) system from and TLP, SMT claims to achieve greater throughput “pure” dataflow is a deviation from data driven and significant program speedups. SMT research also execution (or token driven execution) that is studies how TLP stresses other hardware structures traditionally used for its implementation. Section 2 such as memory system, branch prediction and cache will present the background and related research. In misses. SMT has the advantage of flexible usage of Section 3 we describe our new model of architecture TLP and ILP, fast synchronization, and a shared L1 and briefly summarize our threaded code generation. cache over functional units. Although SDF does not Section 4 describes the comparison of SDF with that rely on ILP, we contend that the decoupling and non- of superscalar architecture and VLIW. Concluding blocking models lead to fine-grained threads, yielding remarks and future work will be presented in the last effectively similar performance gains as when ILP section. within coarse-grained threads are exploited. 2. Background and Related Research 3. Scheduled Dataflow (SDF) Execution model and Even though dataflow architecture provided Code Generation the natural elegance of eliminating anti- and output- In this section we briefly describe the dependencies, it performed poorly on sequential code. Scheduled Dataflow architecture model and its code In an eight-stage pipeline machine such as Monsoon, generation. For the conventional multiprocessor an instruction of the same thread can only be issued system, the program is explicitly partitioned into to the dataflow pipeline after the completion of its processes based on programming constructs such as predecessor instruction. Besides, the token matching, loops, procedures, etc. In order to run the program on waiting-matching store, introduced more bubbles or SDF, the program must explicitly be partitioned into stalls in the execution stage(s) of the dataflow threads of a sufficient granularity to balance TLP machines. In order to overcome these drawbacks, with the overheads of creating threads. A compiler researchers explored hybrid of dataflow/control-flow must generate a number of code blocks, where each models along with multithreaded execution. In such code block consists of several instructions from one models several tokens within a dataflow graph are or more threads that can be executed to completion grouped together as a thread to be executed either on Synchronization Processor (SP) or sequentially under its own private program counter Execution Processor (EP) without interruption. control, while activation and synchronization of Synchronization Count (SC) is the number of inputs threads are data-driven. Such hybrid [8, 9, 11, 12] needed for a thread before it can be scheduled for architectures deviate from the original model, where execution. A Frame is allocated when a thread is the instructions fetch data from memory or registers created. All data needed for the thread, including SC, instead of having instructions deposit operands is stored in the frame memory. When a thread (tokens) in "operand receivers" of successor receives the needed inputs, the Pre-load code of the instructions. The architecture presented in this paper, thread moves the data from the frame memory into SDF, is one such hybrid architecture, designed to the register context allocated to the thread. Post-store overcome the limitations of pure dataflow model as code of a thread stores data from a thread’s registers well as those described in section 1 (that pertain to into the frame memories of awaiting threads. The SP control flow models). Its features include decoupling is responsible for the pre-load and post-store portions of memory accesses from execution pipeline, non- of the code. Once a thread’s registers are loaded with blocking multithreading, dataflow program paradigm input data, it executes on EP without blocking or and scheduling of instructions in a control-flow like accessing memory. manner. In this paper we present a new optimization to Simultaneous Multithreading (SMT) [10] allows the above sequence of activities. When a thread is multiple independent threads to issue instructions created, we will first check to see if the thread can be each cycle to a superscalar processor. SMT attempts allocated a register context directly (instead of to combine thread level parallelism and instruction allocating a frame memory). It will then be possible level parallelism when running multiple applications for the thread to receive inputs directly in its registers on the same processor. Thus, all the available threads (instead of first receiving in its frame memory and compete for and share all of the superscalar's then moving the data into registers during the pre- resources every cycle. Each application may exhibit load portion of the code). However if no register different amounts of parallelism. The choice of context is available on a thread creation, we will implementing Instruction Level Parallelism (ILP) or allocate a frame as in the original SDF implementation. In this event, a thread receives inputs STOP in its frame memory. Our goal is to explore how the midloop: [If i< N; then call another loop thread and performance of SDF changes with the number of increase the i value] register contexts. This provides multiple scalability PUTR1 finloop FORKEP R1 options within the SDF architecture: the architecture STOP can

Load more