A Survey on Coarse-Grained Reconfigurable Architectures from a Performance Perspective

Received May 31, 2020, accepted July 13, 2020, date of publication July 27, 2020, date of current version August 20, 2020. Digital Object Identifier 10.1109/ACCESS.2020.3012084 A Survey on Coarse-Grained Reconfigurable Architectures From a Performance Perspective ARTUR PODOBAS 1,2, KENTARO SANO1, AND SATOSHI MATSUOKA1,3 1RIKEN Center for Computational Science, Kobe 650-0047, Japan 2Department of Computer Science, KTH Royal Institute of Technology, 114 28 Stockholm, Sweden 3Department of Mathematical and Computing Sciences, Tokyo Institute of Technology, Tokyo 152-8550, Japan Corresponding author: Artur Podobas ([email protected]) This work was supported by the New Energy and Industrial Technology Development Organization (NEDO). ABSTRACT With the end of both Dennard’s scaling and Moore’s law, computer users and researchers are aggressively exploring alternative forms of computing in order to continue the performance scaling that we have come to enjoy. Among the more salient and practical of the post-Moore alternatives are reconfigurable systems, with Coarse-Grained Reconfigurable Architectures (CGRAs) seemingly capable of striking a balance between performance and programmability. In this paper, we survey the landscape of CGRAs. We summarize nearly three decades of literature on the subject, with a particular focus on the premise behind the different CGRAs and how they have evolved. Next, we compile metrics of available CGRAs and analyze their performance properties in order to understand and discover knowledge gaps and opportunities for future CGRA research specialized towards High-Performance Computing (HPC). We find that there are ample opportunities for future research on CGRAs, in particular with respect to size, functionality, support for parallel programming models, and to evaluate more complex applications. INDEX TERMS Coarse-grained reconfigurable architectures, CGRA, FPGA, computing trends, reconfigurable systems, high-performance computing, post-Moore. I. INTRODUCTION dynamically configurable. A reconfigurable system can, for With the end of Dennard’s scaling [1] and the looming threat example, mimic a processor architecture for some time (e.g., that even Moore’s law [2] is about to end [3], computing is a RISC-V core [7]), and then be changed to mimic an LTE perhaps facing its most challenging moments. Today, com- baseband station [8]. This property of reconfigurability is puter researchers and practitioners are aggressively pursuing highly sought after, since it can mitigate the end of Moore’s and exploring alternative forms of computing in order to law to some extent– we do not need more transistors, we just try to fill the void that an end of Moore’s law would leave need to spatially configure the silicon to match the computa- behind. There are already a plethora of emerging and intru- tion in time. sive technologies with the promise of overcoming the limits Recently, a particular branch of reconfigurable architec- of technology scaling, such as quantum- or neuromorphic- ture – the Field-Programmable Gate Arrays (FPGAs) [9] computing [4], [5]. However, not all Post-Moore architectures – has experienced a surge of renewed interest for use in are intrusive, and some merely require us to step away from High-Performance Computing (HPC), and recent research the comforts that von-Neumann architecture offers. Among has shown performance- or power-benefits for multiple appli- the more salient of these technologies are reconfigurable cations [10]–[14]. At the same time, many of the limitations architectures [6]. that FPGAs have, such as slow configuration times, long Reconfigurable architectures are systems that attempt to compilations times, and (comparably) low clock frequencies, retain some of the silicon plasticity that an ASIC solution remain unsolved. These limitations have been recognized for usually throws away. These systems – at least conceptually decades (e.g., [15]–[17]), and have driven forth a different – allow the silicon to be malleable and its functionality branch of reconfigurable architecture: the Coarse-Grained Reconfigurable Architecture (CGRAs). The associate editor coordinating the review of this manuscript and CGRAs trade some of the flexibility that FPGAs have approving it for publication was Nitin Nitin . to solve their limitations. A CGRA can operate at higher This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ VOLUME 8, 2020 146719 A. Podobas et al.: Survey on CGRAs From a Performance Perspective clock frequencies, can provide higher theoretical compute (usually 4-6) and produced an output (and its complement) performance, can drastically reduce compilation times, and as a function of the SRAM content and their inputs. Hence, – perhaps most importantly – reduce reconfiguration time depending on the sought-after functionality to be simulated, substantially. While CGRAs have traditionally been used in LUTs could be configured and – through a highly reconfig- embedded systems (particular for media-processing), lately, urable interconnect – could be connected to each other, finally they too are considered for HPC. Even traditional FPGA yielding the expected designs. The design would naturally run vendors such as Xilinx [18] and Intel [19] are creating and/or between one and two orders of magnitude lower frequency investigating to coarsen their existing reconfigurable archi- (for example, there is a 37× reduction in frequency when run- tecture to complement other forms of computing. ning Intel Atom on an FPGA [21]) than the final standard-cell In this paper, we survey the literature of CGRAs, summa- ASIC, but would nevertheless be an invaluable prototyping rizing the different architectures and systems that have been tool. introduced over time. We complement surveys written by our By the early 1990s, FPGAs had already found other uses peers by focusing on understanding the trends in performance (aside from digital development) within telecommunication, that CGRAs have been experiencing, providing insights into military, and automobile industries—the FPGA was seen as a where the community is moving, and any potential gaps in compute device in its own right and there were some aspira- knowledge that can/should be filled. tions to use it for general-purpose computing and not only in The contributions of our work are as follows: the niche market of prototyping digital designs. Despite this, • A survey over three decades of Coarse-Grained Recon- several limitations of FPGAs were quickly identified that pro- figurable Architectures, summarizing existing architec- hibited coverage of a wide range of applications. For exam- ture types and properties, ple, unlike software compilation tools that take minutes to • A quantitative analysis over performance metrics of compile applications, the FPGA Electronic Design Automa- CGRA architectures as reported in their respective tion (EDA) flow took significantly longer, often requiring papers, and hours or even days of compilation time. Similarly, if the • An analysis on trends and observations regarding expected application could not fit a single device, the long CGRAs with discussion reconfiguration overhead (the time it takes to program the The remaining paper is organized in the following way. FPGA) demotivated time-sharing or context-switching of its Section II introduces the motivation behind CGRAs, as well resources. Another limitation was that some important arith- as their generic design for the unfamiliar reader. Section III metic operators did not map well to the FPGA; for exam- positions this survey against existing surveys on the topic. ple, a single integer multiplication could often consume a Section IV quantitatively summarizes each architecture that large fraction of the FPGA resources. Finally, FPGAs were we reviewed, describing key characteristics and the premise relatively slow, running at a low clock frequency. Many behind each respective architecture. Section V analyzes of these challenges and limitations of applying FPGAs for the reviewed architecture from different perspectives (Sec- general-purpose computing still hold to this day. tions VII, VIII, and VI), which we finally discuss at the end Early reconfigurable computing pioneers looked at the of the paper in section IX. limitations of FPGAs and considered what would happen if one would increase the granularity at which it was pro- II. INTRODUCTION TO CGRAs grammed. By increasing the granularity, larger and more Before summarizing the now three decades of Coarse-Grained specialized units could be built, which would increase the Reconfigurable Architecture (CGRA) research, we start by performance (clock frequency) of the device. Also, since describing the main aspirations and motivations behind them. the larger units require less configuration state, reconfig- To do so, we need to look at the predecessor of the CGRAs: uring the device would be significantly faster, allowing The Field-Programmable Gate Array (FPGA). fine-grained time-sharing (multiple contexts) of the device. FPGAs are devices that were developed to reduce the Finally, by coarsening the units of reconfiguration, one would cost of simulation and developing Application-Specific Inte- include those units that map poorly on FPGAs into the fab- grated Circuits (ASICs). Because any bug/fault that was ric (e.g., multiplications), making better use of the silicon left undiscovered post ASIC tape-out would incur a (poten- and increasing generality of the device. These new

A Survey on Coarse-Grained Reconfigurable Architectures from a Performance Perspective

Parallel Patterns for Adaptive Data Stream Processing

A Middleware for Efficient Stream Processing in CUDA

AMD Accelerated Parallel Processing Opencl Programming Guide

Performance and Energy Efficient Network-On-Chip Architectures

Computer Architecture: Dataflow (Part I)

Fine-Grained Window-Based Stream Processing on CPU-GPU Integrated

Lightsaber: Efficient Window Aggregation on Multi-Core Processors

Parallel Stream Processing with MPI for Video Analytics and Data Visualization

Configurable Fine-Grain Protection for Multicore Processor Virtualization 1

Design and Implementation of an FPGA-Based Scalable Pipelined

SABER: Window-Based Hybrid Stream Processing for Heterogeneous Architectures

Analyzing Efficient Stream Processing on Modern Hardware