Arxiv:1801.05178V1 [Cs.AR] 16 Jan 2018 Single-Instruction Multiple Threads (SIMT) Model

Inter-thread Communication in Multithreaded, Reconfigurable Coarse-grain Arrays Dani Voitsechov1 and Yoav Etsion1;2 Electrical Engineering1 Computer Science2 Technion - Israel Institute of Technology {dani,yetsion}@tce.technion.ac.il ABSTRACT the order of scheduling of the threads within a CTA is un- Traditional von Neumann GPGPUs only allow threads to known, a synchronization barrier must be invoked before communicate through memory on a group-to-group basis. consumer threads can read the values written to the shared In this model, a group of producer threads writes interme- memory by their respective producer threads. diate values to memory, which are read by a group of con- Seeking an alternative to the von Neumann GPGPU model, sumer threads after a barrier synchronization. To alleviate both the research community and industry began exploring the memory bandwidth imposed by this method of commu- dataflow-based systolic and coarse-grained reconfigurable ar- nication, GPGPUs provide a small scratchpad memory that chitectures (CGRA) [1–5]. As part of this push, Voitsechov prevents intermediate values from overloading DRAM band- and Etsion introduced the massively multithreaded CGRA width. (MT-CGRA) architecture [6,7], which maps the compute In this paper we introduce direct inter-thread communica- graph of CUDA kernels to a CGRA and uses the dynamic tions for massively multithreaded CGRAs, where intermedi- dataflow execution model to run multiple CUDA threads. ate values are communicated directly through the compute The MT-CGRA architecture leverages the direct connectiv- fabric on a point-to-point basis. This method avoids the ity between functional units, provided by the CGRA fab- need to write values to memory, eliminates the need for a ric, to eliminate multiple von Neumann bottlenecks includ- dedicated scratchpad, and avoids workgroup-global barriers. ing the register file and instruction control. This model thus The paper introduces the programming model (CUDA) and lifts restrictions imposed by register file bandwidth and can execution model extensions, as well as the hardware prim- utilize all functional units in the grid concurrently. MT- itives that facilitate the communication. Our simulations CGRA has been shown to dramatically outperform von Neu- of Rodinia benchmarks running on the new system show mann GPGPUs while consuming substantially less power. that direct inter-thread communication provides an average Still, the original MT-CGRA model employed shared mem- speedup of 4.5× (13.5× max) and reduces system power by ory and synchronization barriers for inter-thread communi- an average of 7× (33× max), when compared to an equiva- cation, and incurred their power and performance overheads. lent Nvidia GPGPU. In this paper we present dMT-CGRA, an extension to MT- CGRA that supports direct inter-thread communication through the CGRA fabric. By extending the programming model, the 1. INTRODUCTION execution model, and the underlying hardware, the new architecture forgoes the shared memory/scratchpad and global Conventional von Neumann GPGPUs employ the data-parallel synchronization operations. arXiv:1801.05178v1 [cs.AR] 16 Jan 2018 single-instruction multiple threads (SIMT) model. But pure The dMT-CGRA architecture relies on the following compo- data parallelism can only go so far, and the majority of data nents: We extend the CUDA programming model with two parallel workloads require some form of thread collabora- primitives that enable programmers to express direct inter- tion through inter-thread communication. Common GPGPU thread dependencies. The primitives let programmers state programming models such as CUDA and OpenCL control that thread N requires a value generated by thread N − k, for the massive parallelism available in the workload by group- any arbitrary thread index N and a scalar k. ing threads into cooperative thread arrays (CTAs; or work- The direct dependencies expressed by the programmer are groups). Threads in a CTA share a coherent memory that is mapped by the compiler to temporal links in the kernel’s used for inter-thread communication. dataflow graph. The temporal links express dependencies This model has two major limitations. The first limitation between concurrently-executing instances of the dataflow graph, is that communication is mediated by a shared memory re- each representing a different thread. gion. As a result, the shared memory region, typically im- Finally, we introduce two new functional units to the CGRA plemented using a hardware scratchpad, must support high that redirect dataflow tokens between graph instances (threads) communication bandwidth and is therefore energy costly. such that the dataflow firing rule is preserved. The second limitation is the synchronization model. Since 1 The remainder of this paper is organized as follows. Sec- thread_code(thread_t tid) { tion2 describes the motivation for direct inter-thread com- // common: not next to margin munication on an MT-CGRA and explains the rationale for i f (!is_margin(tid - 1) && the proposed design. Section3 then presents the dMT-CGRA !is_margin(tid + 1) ) { execution model and the proposed programming model ex- result[tid] = globalImage[tid-1] * kernel[0] + globalImage[tid] * kernel[1] tensions, and Section4 presents the dMT-CGRA architec- + globalImage[tid+1] * kernel[2]; ture. We present our evaluation in Section5 and discuss } related work in Section6. Finally, we conclude with Sec- // corner: next to left margin e l s e i f (is_margin(tid - 1)) { tion7. result[tid] = globalImage[tid-1] * kernel[0] + globalImage[tid] * kernel[1]; } 2. INTER-THREAD COMMUNICATION IN // corner: next to right margin e l s e i f (is_margin(tid - 1)) { A MULTITHREADED CGRA result[tid] = globalImage[tid] * kernel[1]; Modern massively multithreaded processors, namely GPG- + globalImage[tid+1] * kernel[2]; } PUs, employ many von-Neumann processing units to deliver } massive concurrency, and use shared memory (a scratchpad) as the primary mean for inter-thread communication. This (a) Spatial convolution using only global memory design imposes two major limitations: thread_code() { 1. The frequency of inter-thread communication barrages // map the thread to1D space(CUDA − s t y l e) shared memory with intermediate results, resulting in tid = threadIdx.x; high bandwidth requirements from the dedicated scratch- // load image into shared memory pad and dramatically increases its power consumption. sharedImage[tid] = globalImage[tid]; 2. The asynchronous nature of memory decouples com- // pad the margins with zeros munication from synchronization and forces program- i f (is_margin(tid)) mers to use explicit synchronization primitives such as pad_margin(sharedImage, tid); barriers, which impede concurrency by forcing threads // block until all threads finish the load phase to wait until all others have reached the synchroniza- barrier (): //e.g. CUDA syncthreads tion point. // execute the convolution;(kernel The dataflow computing model, on the other hand, offers //(preloaded in shared memory) more flexible inter-thread communication primitives. The result[tid] = dataflow model couples direct communication of interme- sharedImage[tid-1] * kernel[0] + sharedImage[tid ] * kernel[1] diate values between functional units with the dataflow fir- + sharedImage[tid+1] * kernel[2]; ing rule to synchronize computations. We argue that the } massively multithreaded CGRA (MT-CGRA) [6,7] design, (b) Spatial convolution on a GPGPU using shared memory which employs the dataflow model, can be extended to support inter-thread in the proposed direct MT-CGRA (dMT- thread_code() { CGRA) architecture. Specifically, inter-thread communica- // map the thread to1D space(CUDA − s t y l e) tion on the dMT-CGRA architecture is implemented as fol- tid = threadIdx.x; lows. Whenever an instruction in thread A sends a data token // load one elemnt from global memory to an instruction in thread B, the latter will not execute (i.e., mem_elem = globalImage[tid]; fire) until the data token from thread A has arrived. This sim- // tag the value of the variable to be sent, ple use of the dataflow firing rule ensures that thread B will // in case the variable gets rewritten. wait for thread A. The dataflow model therefore addresses tagValue<mem_elem >(); the two limitation of the von Neumann model: // wait for tokens from threads tid+1 and tid −1 1. The CGRA fabric, with its internal buffers, enables lt_elem = fromThreadOrConst<mem_elem, / ∗ t i d ∗ / -1 ,0 >(); most communicated values to propagate directly to their rt_elem = fromThreadOrConst<mem_elem, / ∗ t i d ∗ / +1 ,0 >(); destination, thereby avoiding costly communication with // execute the convolution the shared memory scratchpad. Only a small fraction result[tid] = lt_elem * kernel[0] + mem_elem * kernel[1] of tokens that cannot be buffered in the fabric are spilled + rt_elem * kernel[2]; to memory. } 2. By coupling communication and synchronization us- (c) Spatial convolution on a MT-CGRA using thread cooperation ing message-passing extensions to SIMT programming models, dMT-CGRA implicitly synchronizes point-to- Figure 1: Implementation of a separable convolution [8]) point data delivery without costly barriers. using various inter-thread data sharing models. For brevity, The remainder of this section argues for the coupling of com- we focuses on 1D-convolutions, which are the main and it- munication and synchronization, and discusses why typical erative component in the algorithm. programs can be satisfied by the internal CGRA buffering.

Arxiv:1801.05178V1 [Cs.AR] 16 Jan 2018 Single-Instruction Multiple Threads (SIMT) Model

Computer Architecture: Dataflow (Part I)

A Dataflow Architecture for Beamforming Operations

Design Space Exploration of Stream-Based Dataflow Architectures

Object-Oriented Development for Reconfigurable Architectures

Dataflow Computing Models, Languages, and Machines for Intelligence Computations

Hyperflow: a Heterogeneous Dataflow Architecture

Comparative Analysis of Dataflow Engines and Conventional Cpus in Data-Intensive Applications

Neuflow: a Runtime Reconfigurable Dataflow Processor for Vision

Cognitive Dimensions Usability Assessment of Textual and Visual VHDL Environments

High Level Synthesis with a Dataflow Architectural Template

Compiling for EDGE Architectures

Reconfigurable Dataflow Graphs for Processing-In-Memory