Quick viewing(Text Mode)

A Hybrid Systolic-Dataflow Architecture for Inductive Matrix

A Hybrid Systolic-Dataflow Architecture for Inductive Matrix

A Hybrid Systolic-Dataflow Architecture for Inductive Matrix

Jian Weng, Sihao Liu, Zhengrong Wang, Vidushi Dadu, Tony Nowatzki University of California, Los Angeles Los Angeles, California, US {jian.weng, sihao, seanzw, vidushi.dadu, tjn}@cs.ucla.edu

Inductive Workloads Non-inductive Workloads Abstract—Dense linear algebra kernels are critical for wireless, 25% and the oncoming proliferation of 5G only amplifies their dsp cpu gpu importance. Due to the inductive nature of many such algorithms, 20% parallelism is difficult to exploit: parallel regions have fine- 15% grain producer/consumer interaction with iteratively changing 10% dependence distance, reuse rate, and memory access patterns. This causes a high overhead both for multi-threading due to 5% 0% fine-grain synchronization, and for vectorization due to the non- % of Ideal Performance avg avg 12 24 32 12 24 32 12 24 32 12 24 32 37 12 24 48 64 rectangular iteration domains. CPUs, DSPs, and GPUs perform svd qr cho sol fir 147 199 mm fft 512 order-of-magnitude below peak. 1024 Fig. 1: Percent ideal performance. CPU: Xeon 4116–Intel MKL, Our insight is that if the nature of inductive dependences DSP: TI C6678–TI DSPLIB, GPU: TITAN V–NVIDIA libraries and memory accesses were explicit in the hardware/software interface, then a spatial architecture could efficiently execute parallel code regions. To this end, we first extend the traditional (a) Solver Code (b) Solver’s Iteration Space i (inner loop) dataflow model with first class primitives for inductive depen- for (j=0; j

Dataflow Dataflow Dep. Dataflow able Reuse • Inductive dataflow execution model which expresses induc- Inter-lane XFER XFER XFER Vector Lane 1 Vector Lane 2 Vector Lane N tive memory and dependences as a first-class primitive to Broadcast reduce control overhead and make vectorization profitable. Streams A[0:n] Memory B[0:n-i] Inductive • A novel hybrid systolic-dataflow architecture to specialize Stream Memory Vector-Stream Stream × × / Control Core / for the exposed inductive parallelism. × × / + √ √ + √ √ • REVEL accelerator design and its novel vector-stream √ × config REVEL Program + √ 1|1 + / produce | × mem expressed as a series control model which amortizes control in time and space. × 1|n-i / dep of streams vectorized + consume wait / Inductive rate Systolic Dataflow across lanes. • Full stack implementation, open-source compiler, simula- Reuse Dep. PE PE Ctrl Program 1 1.Inductive Dataflow Model 2.Hybrid Systolic-Dataflow Arch. 3.Vector-Stream Control tor , and comparison against state-of-the art DSPs, CPUs, reduces control overhead for inductive parallelism for efficient scaling GPUs and spatial architectures. Fig. 3: Proposed Architecture and Three Novel Aspects Results: A single 1.25GHz REVEL unit can outperform a will frequently not be divisible by the vector length, causing 2.1GHz OOO core running highly-optimized MKL code on under-utilization. Frequent dependences between program re- DSP workloads by mean 9.6×, with an area normalized gions due to inductive updates make -synchronization of 1089×. Compared to a DSP, REVEL achieves overhead too high. between 4.6×-37× lower latency, It is half the area of an Dataflow models and their corresponding architectures [5]– equivalent set of ASICs and within 2× average power. [7] are promising because they can express dependences with Paper Organization: We characterize the workloads and inductive patterns, enabling parallelism of regions even with challenges for existing architectures in SectionII, and the fine-grain dependences. Spatial dataflow architectures [8]– promise and challenges for spatial architectures in Section III. [12] help make dataflow architectures practical and energy- We then develop inductive dataflow and the hybrid architecture efficient. However, we demonstrate quantitatively that these in SectionIV. Our accelerator, REVEL, and its vector-stream are still factors under peak performance due to either control control model is described in SectionV, and its compiler is overheads or serialization. in SectionVI. Finally, we cover methodology, evaluation and Goal and Approach: Our goal is to create the next-generation additional related work in Sections VII, VIII, andIX. programmable DSP architecture, capable of both inductive and non-inductive parallelism, by leveraging principles of dataflow, II.WORKLOAD CHARACTERIZATION spatial architectures and specialization. We summarize the In this section, we explain how the inductive nature of many proposed design, REVEL (Reconfigurable Vector Lanes), and linear algebra workloads creates parallelism that is difficult novel aspects of this work (see Figure3): to exploit. We start with workload background, then discuss First, we create an inductive dataflow execution model, inductive workload characteristics and why they hamper tra- which extends dataflow execution with first class primitives ditional architectures. for inductive dependences and memory access streams. This specialization enables effective vectorization through implicit A. Why these particular DSP workloads? masking based on the relationship of stream to vector length, Figure4 shows typical 4G/5G transmitter/receiver stages: and allows expression of inductive behavior at low control overhead. General Matrix Mult: Matrix size Downlink depends on # of antennas&beams Second, we develop a hybrid systolic/dataflow architecture, Resource Element Mapping: Channel Coding & Beam- RE Control dominated, not targeted IFFT which can use efficient systolic execution for inner-loops Modulation Forming Mapping regions that must execute at a high rate, and a more flexible FFT/IFFT size depends on cell Bit-level Ops- bandwidth, sub-carrier spacing and tagged dataflow execution for other operations off the not targetted Beam-Forming SRS Weight Update channel Est. critical path. This enables parallelism across program regions Demodulation & FFT Chan. Decoding at high utilization with low hardware overhead. MIMO Eq. & DMRS Noise Var. Est. channel Est. Matrix Factorization/Inversion: Third, we scale the design by using lanes, each having the Uplink QR Decomp/Cholseky/SVD, And Filtering hybrid architecture and local scratchpad (SPAD). To control Fig. 4: Typical 4G/5G Transmitter/Receiver Pipeline the operation and interaction of all lanes with low overhead, Kernels we do not target: Channel coding and modulation we develop the vector-stream control model, which enables involve mostly bit-level arithmetic. RE mapping is a short a single core to control program regions with fine-grain resource allocation phase which is not computation intense. dependences mapped across lanes. This is a generalization of the stream-dataflow ISA [13] to enable amortization of control 1 Open-source release of simulator, compiler, and workload implementa- through time (streams) and space (vector lanes). tion: https://github.com/PolyArch/stream-specialization-stack

2

Kernels we target: The Beamforming stage involves mostly (a) Cholesky Code for (k=0; k

3 Least Complex Dedicated PEs Temporally Shared PEs Dedicated PE Temporally Shared PE

. Systolic CGRA Static “Systolic” “CGRA” 0 0 0 1× 2.6× Cycle Area Area Schedule Example Sched- Warp [23], FPCA [24], MorphoSys [26], Remarc [27], Num: + × × 2 PE1 PE2 PE1 PE2 uling Softbrain [13], Tar- MATRIX [28], ADRES [29], 1 2 Cyc 1 + × Dataflow Cyc 2 >> × tan [10], Piperench [25] WaveFlow [30] 2 >> + PE3 PE4 Cyc 3 Graphs 3 + Dynamic “Ordered Dataflow” “Tagged Dataflow” Static SchedStatic Spatial mapping, Spatial + temporal mapping, + × × Sched- DySER [11], Q100 [31], TRIPS [9], SGMF [32], all timing pre-determined all timing pre-determined uling Plasticine [12] Trig. Insts [33], WaveScalar [8], . Ordered Dataflow Tagged Dataflow >> + dMT-CGRA [34] 2.1× 5.8× Schedule + × × Area PE1 PE2 Area PE1 PE2 TABLE I: Classifying Architectures within Spatial Taxonomy control - + × × dynamic >> + PE3 PE4 Inductive Workloads Non-inductive Workloads flow >> + 100% Spatial mapping only, Spatial + temporal mapping Most systolic tagged dataflow Dynamic Dynamic Sched flow-control maintains timing dynamic dataflow ordering Complex 75% Fig. 7: Spatial Architecture Taxonomy with Example Program 50% allelism. We describe here a taxonomy of spatial architectures, 25% and use this to explain the unique challenges and limitations 0% that motivate a novel dataflow execution model and spatial % of Ideal Performance avg avg 12 24 32 12 24 32 12 24 32 12 24 32 37 12 24 48 64

svd qr cho sol fir 147 199 mm fft 512 architecture. 1024 Fig. 8: Performance relative to ideal for Spatial Architectures A. Spatial Architecture Taxonomy dataflow design most resembles Triggered Instructions [33], We categorize spatial architectures in two dimensions: and the systolic most resembles Softbrain [13]. The total 1. static or dynamic scheduling: is the timing of all instructions execution resources matches the TI DSP (8 cores, 16 MACs determined statically or dynamically? 2. dedicated or shared each, per-core SPAD). We develop and tune each algorithm PEs: is each PE dedicated to one instruction or shared among for both architectures, and develop a cycle-level simulation multiple instructions? Figure7 depicts this taxonomy, along framework and compiler (Section VII). Performance tradeoffs with how a simple computation graph is mapped to each. are in Figure8, and are explained in the following paragraphs. 2 It also includes a relative PE area estimate . TableI gives Though spatial architectures are much faster than CPUs/G- 3 examples of each category. We explain each quadrant : PUs/DSPs, they still do not achieve full throughput, and each • “Systolic” architectures have dedicated PEs, and the com- type of architecture favors different workloads. piler schedules events to allow perfect pipelining of com- Challenge 1: Inductive Parallelism: Statically scheduled putation. Our definition includes systolic designs with a spatial architectures (systolic and CGRAs) cannot achieve reconfigurable network. They are the simplest and smallest parallelism across regions in inductive workloads, as inductive because PEs lack flow control. dependences do not have a static distance, which is required by • “CGRAs” (Coarse Grain Reconfigurable Architectures) static scheduling. Also, this limitation helps explain why the add temporal multiplexing capability, so regions can be vast majority of CGRA only attempt parallelization larger. They keep the hardware simplicity of static sched- of inner loops [35]–[38]. The systolic architecture in Figure8 ules, but require an instruction memory and register file. also serializes each program region, which is another reason • “Ordered Dataflow” augments systolic with limited why it performs poorly on inductive workloads. dataflow semantics: instruction firing is triggered by data Ordered dataflow could parallelize across loops, as instruc- arrival; however, it is simpler than tagged dataflow as the tion firing is dynamic. However, as ordered dataflow only order of firing is predetermined. maps one instruction per PE, infrequently executed outer-loop • “Tagged Dataflow” architectures allow data-triggered exe- program regions achieve quite low utilization. cution at each PE, with the capability or reordering instruc- tions. They are the most expensive due to tag matching Tagged dataflow architectures avoid the above challenges of some form, but also the most flexible and tolerant to but suffer energy costs. They also do not fare well on non- variable latencies. inductive workloads with regular parallelism, as its difficult for the compiler to create a perfectly pipelined ; even B. Spatial Architecture Challenges a single PE with contention from two instructions in the inner loop will halve the throughput. To understand their tradeoffs, we implement and evaluate the two extremes within the spatial architecture taxonomy: Challenge 2: Inductive Control: A common issue is control systolic and tagged dataflow (hence just “dataflow”). The overhead from inductive memory accesses and dependences. We demonstrate with a simple reuse and inductive reuse 2Methodology in Section VII; 64-bit PE; shared PEs have 32-instr. slots pattern shown in Figure9, along with the dataflow program and 8 reg-file entries. Area does not include floating point (FP) units, but even representation. Traditional spatial architectures require these with FP multiply, dataflow is > 4× the systolic. 3We use these nicknames loosely: Our use of “systolic” is broader than in extra instructions to maintain effectively a finite state machine the literature, while our use of “CGRA” refers only to traditional CGRAs. (FSM) to track the data-reuse for a given number of iterations,

4 (a) Reuse Dependence Traditional Dataflow Inductive Update Dataflow dependences between outer and inner loops, as shown by use count def for (j=0; j

A. Inductive Dataflow Model Inductive Dataflow Vectorization: Often it is useful to apply vectorization to the program region which requires the most We develop here an execution model called inductive work, usually the inner loop. To vectorize, a load stream dataflow, which extends a traditional dataflow model with: may connect to multiple compute nodes, and the subsequent 1. capability of expressing inductive dependence patterns (as dependence streams), 2. capability to express inductive mem- (a) Solver Code (b) Inductive Dataflow Representation n n j a[j,j] (inductive access/ j a[j,j+1:n] ory patterns (as memory streams), and 3. semantics for stream 0 dependences are blue) 0 interaction with vectorized computation. for (j=0; j 1 indicates reuse of a value. (e.g. data is reused multiple / × × × × 2 6 0 times within a subsequent loop). Production > 1 means that - - - - 3 6 4 several iterations occur before producing a value (e.g. only the n (n-j-1) j 4 5 0 n 0 |1 n j b[j] 4 j b[j+1:n] first array item has a dependent computation). 0 0 5 5 4 We refer to these patterns of dependences as a dependence streams. For example a dependence stream may describe Fig. 12: Implicit Vector Masking with Solver Kernel

5 n (a) Hybrid Systolic- (b) Inductive Dataflow j Scratchpad Stream Control C[0:n] 0 B[0:n-j] A[0:n] Data Mapping Example Programmable Ports buses n n Input Port Input Port Input Port j P1 P2 j n-j P3 From Neighbor From Neighbor Switches A[0:n] 0 B[0:n-j] 0

- - Inst × × √ RegFile Switch sPE sPE dPE × × × Buffer Accum √ + Stream Stream Ctrl √ × P6:P2 Move FuncUnit n Inst. FuncUnit Config j 1|n-j acc √ To Neighbor + 0 Sched reg sPE sPE dPE acc select Switch / / XFER XFER C[0:n] To Neighbor Switches Systolic PE (sPE) Outer-Loop Inner-Loop Dataflow PE (dPE) Output Port Output Port Output Port Circuit Region Region P4 P5 P6 Switched Mesh Fig. 13: Hybrid Systolic Dataflow Tiles, Overall Architecture, and Inductive Dataflow Mapping Example values of the stream are distributed round-robin to each how they are embedded into the overall design; Figure 13(b) (a write stream consumes from connected nodes similarly). shows an example program mapping. Figure 12(a) shows solver with the inner loop vectorized. PEs and Integration: Dataflow PEs are based on Triggered However, as with vector architectures, the iteration count of Instructions [33], [40]. An instruction scheduler monitors the a loop (especially an inductive one) may not align with the state of architecture predicates and input channels (with tags) vector width. The advantage of expressing inductive streams to decide when to fire an instruction. A triggered instruc- as a primitive is that we can assign meaning to the completion tion will read operands (registers or inputs from neighboring j of the outer-loop (here the loop). Specifically, we add switches) to the FU for processing. Systolic PEs are much the semantics that nodes which have not received data are simpler; they fire their dedicated instruction whenever any implicitly predicated off. This predication is carried to the input arrives, are configured with a single register, and only output through dependent instructions; at the output, streams have a single accumulator register (rather than a register file). which write memory ignore instances with invalid predicates. The compiler is responsible for ensuring all systolic PE’s Figure 12(b) shows the state of the stream over several inputs arrive at exactly the same cycle. cycles (with n=8). This approach enables stall-free pipelined All PEs are embedded into a circuit-switch mesh network execution of the vectorized region. with no flow control. To map program regions onto the systolic B. Hybrid Systolic-Dataflow PEs, the compiler configures the switches to implement the With the execution model defined, we now develop an dependences. Only 1|1 production|consumption edges may architecture which can leverage the additional semantics of be mapped to the circuit switched mesh. If a program region inductive dataflow for highly-efficient execution. is mapped to the dataflow PEs, routers between the PEs will be statically configured to route to each other, so that there is Hybrid Rationale: As discussed in Section III, one of the a communication path between the PEs. Dataflow PEs time- main tradeoffs between systolic and dataflow architectures is multiplex these links to communicate. the efficiency of simplified control in systolic versus the ability to exploit more general parallelism in dataflow. Execution of Concurrent Program Regions: The input and Our rationale begins with the observation that in any kernel, output ports buffer data until it is ready to be consumed by some of the concurrent program regions correspond to more the PEs or written to scratchpad. Ports are implemented as total work than others, hence they would execute more fre- programmable FIFOs that have fixed connections to locations quently. These regions are generally in the inner-most loops within the mesh. They may be associated with systolic pro- (e.g. matrix region in Cholesky). Our insight is that it is only gram regions or dataflow regions. these inner loop regions that make up the bulk of the work, Systolic regions have the execution semantics that one and executing these on the efficient would instance of the computation begins when all of its inputs are the majority of the benefits. If we can provide a much smaller available. To support multiple systolic program regions which dataflow fabric for non-inner loop regions, and a way for all fire independently, the readiness of each region’s ports are regions to communicate, this would be low cost and high tracked separately. On the other hand, ports associated with efficiency. Tagged dataflow is specifically attractive because dataflow regions may immediately send data. In the example it can both temporally multiplex instructions and be resilient in Figure 13, the inner-loop region uses three systolic PEs and to the timing variation caused by inductive behavior. the outer-loop region uses both dataflow PEs. Hybrid Architecture Design: Our overall approach is that Maintaining Dependences: The XFER unit works with we embed the dataflow architecture within the systolic ar- the ports to implement inter-region dependences. It arbitrates chitecture’s network, so that functional units (FUs) within access to data buses on the output ports and input ports. If the dataflow component may be used for systolic execution. a dependence is non-inductive, the stream controller within To enable communication between the regions, we use pro- the XFER unit is simply configured with a stream to send grammable ports which realize the inductive dependence se- data from an output to an input port (multiple dependence mantics. Figure 13(a) shows the systolic and dataflow PEs, and streams will contend for the ). For a dependence with reuse

6 (consumption>1), a hardware-specialized FSM is configured Vector-Stream Control Core Shared SPAD within the input port to track the number of times the data Queue SPAD Command Command Data Bus should be reused before popping the data. If a dependence Lane Ctrl To Stream Controls To Stream Controls discards some outputs (production>1), then the output port is Cmd Queue Cmd Sync Cmd Queue C configured to implement an FSM to track the number of times Private SPAD Stream Control Private SPAD Stream Control an output should be discarded. In the example, the inductive Input Input reuse dependence is routed from P6 to P2 by the XFER unit, ports ports Syst- Syst- then reused according to the FSM in P2. + × × × × olic PE + × × × × olic PE Data- Data- Stream Predication for Vectorization: Ports also implement × + + + + flow PE × + + + + flow PE stream predication. The FSM at the port compares the remain- × + + + S Control Stream × + + + S Control Stream

Hybrid Hybrid 3-8... Lanes ...Vector ing iterations with the port’s vector length. If the iterations left Systolic × + + × S Systolic × + + × S is non-zero and less than the vector length, the stream control Data- × + + + S XFER Data- × + + + S XFER flow flow unit sends the data padded with zeroes for the unused vector- Output Output lanes, along with additional meta-information which indicates ports ports that the those vector-lanes should be predicated off. Vector Lane 1 Vector Lane 2 Memory Access: Finally, a stream in the scratch- Fig. 14: REVEL Accelerator Architecture pad will arbitrate streams’ access into (or out of) the mesh, A. REVEL Architecture by writing (or reading) its own set of data buses. Inductive Multi-lane Rationale: Scaling up the design requires consid- memory access is supported in the same way as dependences: eration on how to coordinate parallel units with low overhead, the FSM in the input or output port is configured to reuse while maintaining the ability to exploit inductive parallelism. or discard according to the inductive pattern. Another benefit A possible approach is a monolithic architecture with large of the reuse support in the port is that it reduces scratchpad spatial array and very-wide scratchpad. However, the mono- bandwidth for those accesses. lithic design has several drawbacks: Overall, the systolic-dataflow architecture achieves both • Average routing distance increases, increasing average high efficiency execution for the vast majority of the com- latency of communication. putation, and flexible parallelism where it is needed. • Necessity of reconfiguring the entire array at one time, which reduces flexibility. C. Applicability to Other Architectures • Long compilation times for the spatial architecture as routing/scheduling complexity increases. The inductive dataflow model opens an opportunity for lanes specializing common control patterns and also enables efficient We take the alternative approach of using multiple , vectorization of these patterns. It can be independently applied where each lane is comprised of the hybrid systolic-dataflow to any spatial architecture by introducing dependence stream architecture and some control. Lanes can independently ex- and memory stream primitives. For example, if a traditional ecute their own program regions, while also communicating CGRA (static-scheduled/shared-PE) is extended with hardware using dependence streams or through shared scratchpad mem- and a software interface to specify an inductive memory stream ory. Also, since each lane can be programmed separately, they primitive (and if it supports stream predication), vectorization can either work together or on independent tasks. of an inductive region can be achieved. We first explain the high-level accelerator operation, then describe the ISA and vector-stream approach. The principle behind the hybrid-systolic dataflow design – to combine temporally-shared and dedicated-PEs in one REVEL Overview: The accelerator (Figure 14) is composed spatial architecture – is also broadly applicable. For example, of a number of lanes, a low power VonNeumann control Plasticine [12], a dynamic dedicated-PE spatial architecture, core, and a shared scratchpad (serves as the external memory could be augmented with tagged dataflow PEs for low-rate interface). The control core constructs “stream commands” and computations, enabling higher overall hardware utilization. ships these asynchronously to the lanes. Stream commands are buffered in local command queues until the hardware is ready, V. REVEL ACCELERATOR and ensures they execute in program order. Streams are issued to the XFER unit and scratchpad controller. At this point, exe- Thus far we have described a spatial architecture to execute cution of the spatial fabric lane is as described in the previous inductive-dataflow programs and a which section. The XFER unit is extended to support communication can take advantage of this specialized program representation. between program regions mapped to consecutive lanes. Since What remains is to develop an approach for scaling the design, dataflow PEs are larger, they are grouped on the right side of as well as a specific ISA. We address these aspects with the the spatial fabric to enable simpler physical design. REVEL accelerator proposed in this section. We first discuss flexible scaling with multiple lanes, then cover the vector- B. Vector-stream ISA and Control Model stream ISA which lets a single simple control core coordinate An ideal control-model would both rely on simple-to- all lanes. reason-about sequential programming, and amortize control

7 Computation Graphs Pattern Params Source Params Dest. Params Original Program 1 2 3 1 2 3 StoreStream out port local addr for(j=0; j

8 Compiler Steps: Extract DFG C annotated Modified LLVM IR w/ Inject Vec.

(a) Original Source Code (b) Computations & Streams w/ Pragma Clang Metadata Spatial Arch. Spatial #pragma config The dependences mapped to Stream a[k,k] host control, the number is #pragma parallel loop-carry 1 the same as #port in (d). LLVM Spatial Spatial Computation for (k=0; k

for (i=j; i

Xfer Local, ③ → ③, Rep: (n-k’)²/2 a[0,1:n] L[0:n,0] a[1,2:n] is a hint to detect and enforce loop-carried dependences by Local WriteStream ⑨ → a[k+k’,k+k’:n] SPAD in/out/inout WriteBar 1 2 4 3 6 7 8 1 2 4 analyzing the inputs/outputs specified by LoadStream a[k+k’,k+k’:n] → ④ clauses between adjacent iterations4. WriteStream ⑤ → l[k+k’:n,k+k’] / / LoadStream a[k+k’,k+k’:n] → ⑥, × × r=n-j-1,sr=-1 Computation Graph and Vector-Stream Extraction: The n LoadStream j k +k ’a[k+k’,j+1:n] → ⑦ n √ √ first step is to extract the computation graph (using slic- LoadStream j ka[j+1:n,j+1:n], Lane: 0 × × × × Xfer Right ⑨ → ①, 1 ing [42]) and streams from offloaded program regions. Lanes: 0,1,…,(Nlanes-2) ComputeFabric Xfer Right ⑨ → ⑧, (n-k-k’)/2 / - - / ... n A configuration instruction is inserted at the site of WriteStream ⑨ →j k +k ’a[j+1:n,j+1:n], sc:-1, Lane: (Nlanes-1) 3 #pragma config, and is pointed to spatial configuration Out 2 3 5 9 2 Wait lanes done This code runs on a Ports Lane } centralized host Lane 1 bits after they are generated. Streams are extracted by deter- 2..8 mining the access pattern, for which we use LLVM scalar REVEL Binaries: Dataflow Config + Vector-Stream Code evolution [43] analysis. The encoded streams are injected at Fig. 17: Mapping Cholesky to REVEL with pragma support the site of #pragma stream. Dependence pattern metadata is used to see if some memory streams can be converted to 3) Parallelize the outer k loop across lanes. dependence streams to save memory traffic. Scratch barriers are inserted to keep data consistent. Finally, empty loops (due VI.PROGRAMMINGAND COMPILATION to being completely offloaded) are removed. Programming REVEL involves five basic responsibilities: Spatial Architecture Compiler: The spatial architecture com- 1) Dividing the work onto lanes and partitioning data. piler maps computation graphs of all concurrent program re- 2) Deciding which program regions execute concurrently. gions to systolic and dataflow PE resources, and generates the 3) Extracting memory/dependence streams; inserting barriers. configuration (similar role as [35], [37], [44]–[46]). The graph 4) Decoupling the computation graph and vectorizing it. is first “unrolled” (i.e. vectorized) by a degree determined by 5) Mapping computation onto spatial PEs and network. pragma meta-data if specified, otherwise the most critical loop We developed an LLVM/Clang-based compiler which relies (estimated based on expected instruction count) is unrolled to on pragma-hints for 1-2 and automates 3-5. Figure 18 shows fill the spatial fabric. the pipeline of compilation, and Figure 17 shows Cholesky The responsibilities of the dataflow compiler are to map undergoing transformation from C with pragmas to REVEL instructions to PEs, dependences to the network, and to abstractions. We next explain the compiler stack. make timing decisions. For the systolic regions, all operand timing paths must be equalized, and no resources may be Pragma-based Compilation: The pragmas are inspired by shared between instructions. For the dataflow computation OpenMP’s tasks [41]. Each offloaded code region is similar graphs, the goal is to minimize resource contention. Usually to a light-weight thread: it has an independent flow and instructions belonging to systolic/dataflow regions map to the potentially input/output dependences. Figure 17 (a) shows how corresponding PEs (systolic-PEs or dataflow-PEs); However, Cholesky is annotated with the following pragmas: dataflow instructions may map to the systolic fabric if it #pragma dataflow/systolic specifies whether a has low utilization, and systolic instructions to the dataflow program region is offloaded to the systolic or dataflow ar- fabric to reduce latency or network congestion, provided that chitecture. It has two optional clauses: The unroll clause there are enough resources. To balance these objectives, we specifies the number of iterations to offload in one computa- adapt a prior heuristic [47] to use simulated annealing, similar tion instance, determining resource use; the in/out/inout clause specifies data dependence distance patterns. 4Support for this pragma is ongoing, some kernels require intervention.

9 stefitrsz) x div/sqrt. size), filter the is where SVD, AL V da SCModels. ASIC Ideal IV: TABLE only models scratchpad. power and and FUs with area count ASIC constraints, REVEL. throughput to and by FUs limited equivalent path only critical are and algorithmic algorithms, the optimized the on based are Models: Analytical ASIC model. power event-based area an and instructions create to model used triggered [40]. are fabric synthesis source temporal from the Results for open reference our An was meets implementation 1.25GHz. design The at library. tech timing proto- 28nm REVEL’s DC, synthesized Synopsys We using each. type for simulator custom create we a To architectures, control. spatial vector-stream RISCV state-of-the-art for against a custom extended compare of [50], a model [49], in gem5 core level a inorder cycle with integrated a is at which modeled simulator, are blocks All rameters. Modeling: REVEL are schedule. graphs overall best computation the concurrent find to All simultaneously mapped [48]. Pathfinder to ie n ae nbth1. batch in lanes # data- and sizes shows V Table one input. over operates lane 8 each batch for lanes and possible), (if across work REVEL 1, spreads batch For tions: optimiza- 8, different and requiring 1 size batch evaluate Versions: Workload Net. Shr. Ctrl Revel Lane (× 8) SPD Core V RMM Centro-FIR Cholesky FFT QR 2 Solver 4 SVD hrdsrtha u:52BtSae Bus Shared Sync) Bit Cmd 512 (8-bit Bus: Bus scratchpad Data Shared Bit 512 Inter-lane: Port) (1R/1W Bits 512 Bandwidth: Single-bank 128Kb, Structure: added insts. d$, 16kb stream-commands single-issue, for 5-stage, [51], ISA RISCV Net. SPAD Ctrl Stream Ports Vector Fabric Spatial n dm ∑ − 0 1 + max 2QR ( m d I.E VII. et -nr FIFO 4-entry Depth pta eh6-i ici-wthdMesh 2 Circuit-switched 64-bit Port) (1R/1W Bits Single-bank 512 8Kb, Mesh Spatial Bus Queue Data Cmd streams 8-Entry concurrent 8 for Bandwidth Entries 2 8 Structure mult Queue 9 Cmd sqrt/div, 3 Table and Slots) Stream add, Instruction 14 Cyc. (32 1/5 1x1 Thr.: Cyc., FP 12 2-way Lat.: Fixed-point, Width 4-way PE Dataflow SubwrdSIMD Div/Sqrter PEs 4 ( vec n stenme fieain n etoFR where Centro-FIR, and iterations of number the is i + ) AL I:RVLParameters REVEL III: TABLE e , vec d d + 8 n vec VALUATION 3 niae -etrzd and x-vectorized, indicates 2 ) sosRVLhrwr pa- hardware REVEL shows III Table e 7 8 We dn vec n IV) (Table models optimistic These + log 2 × × AL :Wrla Parame- Workload ters. V: TABLE n i m = ∑ 1 i (SPAD+XFER) Bit 512 2 512, n 1 , n i M ( = ∑ − okodDt ieLanes SVD Size Data Workload FIR GEMM FFT Solver Cholesky QR n i 1 1 + , ETHODOLOGY sal,lre ie bolded sizes “small”,“large” max p d × 2 r h arxdm.(except dims. matrix the are 5,1 256, vec i ( d e 2 n i vec 2 ) 37 12 64 12 12 12 12 × e , , ,128,512 ,16,24, ,16,24, ,16,24, ,16,24, , 199 48 2,1 128, 4d d 1x48 x16x64 ) 12 8 x1024 steltnyof latency the is d d 32 32 32 32 × 8 n 1024 vec − n 4 vec m 4bit 64 e + mp 1 e m 1 1 8 8 1 m 10 ag ie,RVLgt peu f6.2 of speedup a gets REVEL sizes, large 3.3 and more): has which GPU, (except FLOPs max. ideal similar with Methodology: Comparison h S n P aesmlrma perfor- mean similar 37 have to CPU 11 up and attains DSP REVEL The mance. 19. Figure in Speedup: Overall magnitude. of Performance order A. an by often than better designs, consistently state-of-the-art is all REVEL that find with we Overall, area/power/performance ASICs. the compare architecture remaining to to the Finally, sensitivity features. as the understand well to as Third, spe- bottlenecks. parallelism, the behind inductive benefits of of cialization sources the characterize GPUs. to and Second, Spatial, DSPs, CPUs, state-of-the-art over speedups • • • • Speedup over DSP Speedup over DSP o ml and small For 20. Figure in is 8 batch for Performance × the quantify to First goals. main four has evaluation Our 1 1 1 1 0 0 0 0 6F desmliles sn DSPLIB using (@1.25GHz) adders/multipliers, 16-FP DSP 6678 TI lnsaetesame. the are #lanes and FLOPs Spatial: peak GPU’s benchmark. gpu is our as library NVIDIA CUDA cuBLAS and cuFFT, (@1.2GHz) cuSOLVER, using V processor TITAN NVIDIA GPU: used) library. cores MKL (8 Intel (@2.1GHz) highly-optimized using 4116 processor Xeon OOO Intel Core: OOO 0 1 0 1 n 17 and > svd svd × 10 dataflow i.2:PromneTaefs(ac size=8) (batch Tradeoffs Performance 20: Fig. size=1) (batch Tradeoffs Performance 19: Fig. qr qr atrta aao n systolic. and dataflow than faster × × The cho cho higher o ml n ag aaszs EE s3.5 is REVEL sizes. data large and small for Small Small sol sol .Fsand FUs [33]. Insts. Triggered to similar is systolic peusoe S o ac r shown are 1 batch for DSP over Speedups hnREVEL. than fir fir II E VIII. mm mm eini iia oSfban[13], Softbrain to similar is design fft fft VALUATION o aresw opr designs compare we fairness For cpu cpu GM GM × -oeDP ahcr has core each DSP, 8-core svd svd gpu gpu peu,wt ema of geomean with speedup, qr qr systolic systolic cho cho × Large Large sol sol V0 graphics GV100 C66x n 8.1 and fir fir tagged tagged Conventional mm mm 3.4.0.0. × fft fft revel revel over GM GM × 24 issued temporal scr-b/w ctrl-ovhd config multi-issued stream-dpd scr-bar drain 23 Small Large 2 2 1.00 21 0.75 20 mkl 1 revel 1 0.50 2 − 1 mkl 2 revel 2 Speedup (Log) MKL enables multi- mkl 4 revel 4 2 − 2 threading from 0.25 mkl 8 revel 8 matrix size 128 3 2 − 0.00 16 32 48 64 80 96 Relative Exec. Time 112 128 144 160 176 192 208 224 240 256 systolic systolic systolic systolic systolic systolic systolic systolic systolic systolic systolic systolic systolic systolic Fig. 21: CPU vs REVEL Scaling 8-batch 8-batch 1-batch 8-batch 1-batch 8-batch 8-batch 1-batch 8-batch 1-batch 8-batch 8-batch 8-batch 1-batch 8-batch 1-batch 8-batch 8-batch 1-batch 8-batch 1-batch 8-batch svd qr cho sol fir mm fft svd qr cho sol fir mm fft

stream pred. Fig. 23: REVEL’s Cycle-level bottlenecks 8 hybrid systolic/dataflow induc. dep/stream systolic categories, issue and multi-issue means that one or multiple 4 systolic regions fired, and temporal means only a temporal dataflow fired during that cycle. All other categories represent 2 overhead, including the drain of the dedicated fabric, scr-b/w

Relative Speedup and scr-barrier for bandwidth and synchronization, stream-

1 dpd for waiting on dependences, and ctrl-ovhd for waiting on

12 16 24 32 12 16 24 32 12 16 24 32 12 16 24 32 12 16 24 32 12 16 24 32 64 the control core. svd 8-batch 1-batch 8-batch 1-batch sol. 128 512 qr cholesky fft 1024 The clearest trend is that our design reduces the control Fig. 22: Performance Impact of Each Mechanism. overhead dramatically. For some kernels, REVEL is able to the DSP and CPU. REVEL’s dataflow/vector-stream model execute multiple regions in the same cycle, especially for provides 4.0× speedup over dataflow, and 2.9× over systolic. larger matrices. One outlier is FFT with small data; it requires REVEL provides factors speedup over state-of-the-art. multiple reconfigurations, each requiring the pipeline to drain. Exploiting inductive parallelism increases parallel work and REVEL vs CPU Parallelism: Figure 21 shows the scaling reduces control, enabling better performance. of REVEL’s performance against the MKL’s library’s CPU version for different sizes of Cholesky and thread counts. Dataflow PE Allocation: Tagged 1.6 1.6 Observe that when multi-threading is first enabled in MKL dataflow PEs are helpful on in- svd (>= matrix size 128), it actually hurts performance. This is ductive workloads, but expensive. qr 1.4 cho. 1.4 because of the inherent fine-grain dependences, which REVEL A tagged-dataflow PE costs > sol. supports natively. 5× more area than a systolic PE area Inductive dataflow can parallelize much finer-grain depen- (2822µm2 versus 16581µm2). 1.2 1.2 dences than with CPU threading. Figure 24 shows REVEL’s per- Relative Area formance and area sensitivity. 1.0 1.0 Benefits from Hardware/Software Mechanisms: To under- Relative Performance SVD has the largest demand on 1 2 4 6 9 PE(s) stand the sources of improvement, we evaluate four versions of dataflow PEs, so are affected Fig. 24: Temporal region REVEL with increasingly advanced features. We start with the the most. The effects on other sensitivity systolic, then add inductive streams, hybrid systolic-dataflow, workloads are neglectable, so we and finally stream predication to enable efficient vectorization. choose 1 dataflow PE to minimize the area penalty. Figure 22 shows the results. Inductive memory and dependence streams improve all B. Area and Power Comparison workloads by reducing control and increasing parallelism. Even FFT benefits by using inductive reuse to reduce scratch- Breakdown: TableVI shows the power/area breakdown; pad bandwidth. QR and SVD have complex outer-loop regions, the largest source (especially power) comes from FP units. 2 so do not benefit as much until after adding hybrid systolic- REVEL is 1.93mm , and 1.63 Watts. dataflow, which enables more resource allocation for inner- Comparing against CPU and DSP: Figure 25 shows loop regions. Solver was also accelerated by the heterogeneous the relative performance/area normalized to the CPU af- fabric because it is latency sensitive, and collapsing less critical ter adjusting the technology. The DSP achieves a high instructions can reduce latency. The vectorized workloads also performance/mm2, and REVEL is able to achieve even higher receive large gains from stream predication by reducing the performance with a moderate area overhead. REVEL has overheads of vectorization. 1089× performance/mm2 advantage over the OoO core, and The vector-stream ISA and hybrid systolic-dataflow archi- 7.3× over the DSP. tecture together enable high performance. Comparing against ASIC: Table VII shows performance- Cycle-Level Bottlenecks: Figure 23 overviews REVEL’s normalized area overhead over ASIC analytical models. cycle-level behavior, normalized to systolic. To explain the REVEL is mean 2.0× power. This is mostly due to the control

11 area(mm2) power(mw) nests [59], but does not parallelize across multiple region Dedi. Net. (24) 0.06 71.40 instances. CGRA Express [60] allows a CGRA to use the first Compute Temp. Net. (1) 0.02 14.81 Fabric row of its PEs in VLIW mode during outer loops. Concurrent Func. Units 0.07 74.04 execution across inner and outer regions is not attained. None Total Fabric 0.13 160.25 of the above handle inductive dependences. Control (ports/XFER/str. ctrl) 0.03 62.92 SPAD-8KB 0.06 4.64 Flexible Vectorization: Vector-threading techniques also mar- shal independent execution lanes for vectorized execution 1 Vector Lane 0.22 207.90 Control Core 0.04 19.91 when useful [61]–[64]. The RISC-V vector extension supports REVEL 1.93 1663.3 configurable vector-length and implicit vector masking [65]. Vector-length is limited by physical registers (REVEL’s TABLE VI: Area and Power Breakdown (28nm) streams are arbitrary length), and inductive access is not REVEL @1.25GHz TI C6678 @1.25GHz Intel OoO @2.10GHz supported, so the vector length would have to be reset on each 2 3 m 10 iteration. These architectures are also not spatial, so cannot m

/ exploit pipelined instruction parallelism.

f 2

r 10

e Some spatial dataflow models use predicates for control [6], P 101 [66]. These do not use streams for vector predication. dMT- e v i CGRA [34] adds inter-thread communication for a spatial- t 100 a l dataflow GPU [32], [67]. e R Normalized. by CPU DSP Accelerators: Many application/domain-specific recon- 12 16 24 32 12 16 24 32 12 16 24 32 12 16 24 32 37 12 48 64 199 128 512 svd qr cho. sol. fir mm fft 1024 figurable designs have targeted DSP algorithms. Fasthuber 2 Fig. 25: Relative performance/mm normalized to CPU. et. al [68] outline the basic approaches. One representative example includes LAC [69], targeted at matrix factorization. logic (ports, bus, etc.) and reconfigurable networks. It is 0.55× Our architecture allows more general programmability. the area of the combined ASIC. This is optimistic for ASICs in that it assumes perfect pipelining and no control power. Stream-based ISAs and Reuse: Many prior architectures REVEL is on par with ASICs-level efficiency. have used memory-access stream primitives [13], [24], [31], [70]–[74]. To our knowledge, no prior work has incorporated Workloads SVD QR Cho. Sol. FIR MM FFT Mean inductive patterns into such streams. Power Ovhd. 2.8 2.0 1.9 1.6 2.0 1.9 1.9 2.0 Area Ovhd. 3.3 2.4 2.3 2.2 2.3 2.3 2.8 2.5/0.55 X.CONCLUSION TABLE VII: Power/Area overheads to ideal ASIC (iso-perf) This work identified that existing techniques, vector, multi- threading, and spatial, all experience challenges in achieving IX.ADDITIONAL RELATED WORK high-performance on dense linear algebra workloads, largely due to inductive program behavior. In this section, we discuss work other than that previously We found that it is possible to specialize for this behavior covered by the spatial architecture taxonomy in Section III-A. by encoding inductive properties into the hardware/software Synchronous Dataflow Variants: The inductive production interface. This was the approach behind inductive dataflow, a to consumption rates in our dataflow model is inspired by model that allows the expression of parallelism within induc- the static rates in synchronous dataflow [39] (SDF). SDF tive program regions and across them. Our taxonomy of spatial was developed as a specialization of existing dataflow models architectures also makes it clear why a hybrid architecture – which could be statically scheduled. Cycle-static dataflow [52] one that combines systolic fabrics for efficiency and dataflow extends SDF with periodically changing rates, and hete- fabrics for flexible parallelism – can achieve the best of both. rochronous dataflow [53] extends SDF to enable an FSM to Finally, this work develops a scalable design: REVEL, by step through predefined rates. None of the above were applied leveraging a vector-stream ISA that amortizes control in space to spatial architectures or handle inductive dependences. and time. With a full stack implementation, our evaluation StreamIt [54] is a language and runtime with somewhat sim- against four state-of-the-art designs demonstrates many factors ilar semantics to vanilla SDF, and was evaluated on RAW [55], of speedup and energy efficiency. a (mostly) static/shared-PE spatial architecture. On one hand, our contribution is what we believe to be Outer-loop Parallelism: Prabhakar et al. develops “nested- a superior . What is perhaps more parallelism,” which enables coupling of with nested important is the principle of hardware/software specialization parallel patterns [56]. Inductive parallelism is a generalization of complex but general control and memory patterns, useful of nested-parallelism, and we can achieve a higher utilization for developing next generation programmable accelerators. due to hybrid systolic-dataflow execution. Some CGRA compilers target nested loops [57], [58], but XI.ACKNOWLEDGMENTS only parallelize the epilogue and prologue of subsequent loop We sincerely thank Naxin Zhang, Zhiguo Ge, Jing Tao, and nests. Recent work has made progress in pipelining imperfect Mian Lu for their thoughtful discussions and domain expertise.

12 This work was supported by an NSF CAREER award CCF- [21] P. Darwood, P. Alexander, and I. Oppermann, “Lmmse chip equalisation 1751400 and gift funding from HiSilicon. for 3gpp wcdma downlink receivers with channel coding,” in ICC 2001. IEEE International Conference on Communications. Conference Record (Cat. No.01CH37240), vol. 5, pp. 1421–1425 vol.5, 2001. REFERENCES [22] D. Tse and P. Viswanath in Fundamentals of Wireless Communication, New York, NY, USA: Cambridge University Press, 2005. [1] I. T. Union, “Minimum requirements related to technical performance [23] M. Annaratone, E. A. Arnould, T. Gross, H. T. Kung, M. S. Lam, for IMT-2020 radio interface(s),” 2017. O. Menzilcioglu, and J. A. Webb, “The Warp Computer: Architecture, [2] S. Chen and J. Zhao, “The requirements, challenges, and technologies Implementation, and Performance,” IEEE Transactions on Computers, for 5g of terrestrial mobile telecommunication,” IEEE communications vol. 36, pp. 1523–1538, December 1987. magazine, vol. 52, no. 5, pp. 36–43, 2014. [24] J. Cong, H. Huang, C. Ma, B. Xiao, and P. Zhou, “A fully pipelined and [3] H. Tullberg, P. Popovski, Z. Li, M. A. Uusitalo, A. Hoglund, O. Bulakci, dynamically composable architecture of cgra,” in Field-Programmable M. Fallgren, and J. F. Monserrat, “The metis 5g system concept: Meeting Custom Computing Machines (FCCM), 2014 IEEE 22nd Annual Inter- the 5g requirements,” IEEE Communications magazine, vol. 54, no. 12, national Symposium on, pp. 9–16, IEEE, 2014. pp. 132–139, 2016. [25] S. Goldstein, H. Schmit, M. Moe, M. Budiu, S. Cadambi, R. Taylor, [4] Y. C. Hu, M. Patel, D. Sabella, N. Sprecher, and V. Young, “Mobile edge and R. Laufer, “Piperench: a for streaming multimedia computing—a key technology towards 5g,” ETSI white paper, vol. 11, acceleration,” in , 1999. Proceedings of the 26th no. 11, pp. 1–16, 2015. International Symposium on, 1999. [5] M. Budiu, P. V. Artigas, and S. C. Goldstein, “Dataflow: A complement [26] H. Singh, M.-H. Lee, G. Lu, N. Bagherzadeh, F. J. Kurdahi, and E. M. C. to superscalar,” in ISPASS, 2005. Filho, “Morphosys: An integrated reconfigurable system for data-parallel [6] M. Budiu, G. Venkataramani, T. Chelcea, and S. C. Goldstein, “Spatial and computation-intensive applications,” IEEE Trans. Comput., vol. 49, Computation,” in ASPLOS XI. pp. 465–481, May 2000. [7] Arvind and R. S. Nikhil, “Executing a program on the MIT Tagged- [27] T. Miyamori and K. Olukotun, “Remarc: Reconfigurable multimedia Token Dataflow Architecture,” IEEE Transactions on Computers, array coprocessor,” IEICE Transactions on information and systems, vol. 39, no. 3, pp. 300–318, 1990. vol. 82, no. 2, pp. 389–397, 1999. [8] S. Swanson, K. Michelson, A. Schwerin, and M. Oskin, “Wavescalar,” [28] E. Mirsky, A. DeHon, et al., “Matrix: a reconfigurable computing in Proceedings of the 36th Annual IEEE/ACM International Symposium architecture with configurable instruction distribution and deployable on Microarchitecture, MICRO 36, (Washington, DC, USA), pp. 291–, resources.,” in FCCM, vol. 96, pp. 17–19, 1996. IEEE Computer Society, 2003. [29] B. Mei, S. Vernalde, D. Verkest, H. De Man, and R. Lauwereins, [9] D. Burger, S. W. Keckler, K. S. McKinley, M. Dahlin, L. K. John, “Adres: An architecture with tightly coupled vliw processor and coarse- C. Lin, C. R. Moore, J. Burrill, R. G. McDonald, W. Yoder, and t. T. grained reconfigurable matrix,” in International Conference on Field Team, “Scaling to the end of silicon with edge architectures,” Computer, Programmable Logic and Applications, pp. 61–70, Springer, 2003. vol. 37, pp. 44–55, July 2004. [30] C. Nicol, “A coarse grain reconfigurable array (CGRA) for statically [10] M. Mishra, T. J. Callahan, T. Chelcea, G. Venkataramani, S. C. Gold- scheduled data flow computing,” WaveComputing WhitePaper, 2017. stein, and M. Budiu, “Tartan: Evaluating spatial computation for whole [31] L. Wu, A. Lottarini, T. K. Paine, M. A. Kim, and K. A. Ross, program execution,” in Proceedings of the 12th International Conference “Q100: The architecture and design of a processing unit,” on Architectural Support for Programming Languages and Operating in Proceedings of the 19th International Conference on Architectural Systems, ASPLOS XII, (New York, NY, USA), pp. 163–174, ACM, Support for Programming Languages and Operating Systems, ASPLOS 2006. ’14, (New York, NY, USA), pp. 255–268, ACM, 2014. [11] V. Govindaraju, C.-H. Ho, T. Nowatzki, J. Chhugani, N. Satish, [32] D. Voitsechov and Y. Etsion, “Single-graph multiple flows: Energy K. Sankaralingam, and C. Kim, “Dyser: Unifying functionality and efficient design alternative for gpgpus,” in Proceeding of the 41st parallelism specialization for energy-efficient computing,” IEEE Micro, Annual International Symposium on Computer Architecuture, ISCA ’14, vol. 32, pp. 38–51, Sept. 2012. (Piscataway, NJ, USA), pp. 205–216, IEEE Press, 2014. [12] R. Prabhakar, Y. Zhang, D. Koeplinger, M. Feldman, T. Zhao, S. Hadjis, [33] A. Parashar, M. Pellauer, M. Adler, B. Ahsan, N. Crago, D. Lustig, A. Pedram, C. Kozyrakis, and K. Olukotun, “Plasticine: A reconfigurable V. Pavlov, A. Zhai, M. Gambhir, A. Jaleel, R. Allmon, R. Rayess, architecture for parallel paterns,” in Proceedings of the 44th Annual S. Maresh, and J. Emer, “Triggered instructions: A control paradigm for International Symposium on Computer Architecture, ISCA ’17, (New spatially-programmed architectures,” in Proceedings of the 40th Annual York, NY, USA), pp. 389–402, ACM, 2017. International Symposium on Computer Architecture, ISCA ’13, (New [13] T. Nowatzki, V. Gangadhar, N. Ardalani, and K. Sankaralingam, York, NY, USA), pp. 142–153, ACM, 2013. “Stream-dataflow acceleration,” in Proceedings of the 44th Annual [34] D. Voitsechov and Y. Etsion, “Inter-thread communication in International Symposium on Computer Architecture, ISCA ’17, (New multithreaded, reconfigurable coarse-grain arrays,” arXiv preprint York, NY, USA), pp. 416–429, ACM, 2017. arXiv:1801.05178, 2018. [14] R. Mudumbai, G. Barriac, and U. Madhow, “On the feasibility of [35] B. Mei, S. Vernalde, D. Verkest, H. D. Man, and R. Lauwereins, distributed beamforming in wireless networks,” IEEE Transactions on “Exploiting loop-level parallelism on coarse-grained reconfigurable ar- Wireless communications, vol. 6, no. 5, 2007. chitectures using modulo scheduling,” IEE Proceedings - Computers and [15] H. Johansson et al., “Polyphase decomposition of digital fractional-delay Digital Techniques, vol. 150, pp. 255–61–, Sept 2003. filters,” IEEE signal processing letters, vol. 22, no. 8, pp. 1021–1025, [36] B. Mei, S. Vernalde, D. Verkest, H. De Man, and R. Lauwereins, 2015. “Exploiting loop-level parallelism on coarse-grained reconfigurable ar- [16] R. Zhao, “Wls design of centro-symmetric 2-d fir filters using matrix chitectures using modulo scheduling,” IEE Proceedings - Computers and iterative algorithm,” in 2015 IEEE International Conference on Digital Digital Techniques, vol. 150, pp. 255–, Sep. 2003. Signal Processing (DSP), pp. 34–38, July 2015. [37] H. Park, K. Fan, S. A. Mahlke, T. Oh, H. Kim, and H.-s. Kim, “Edge- [17] F. Mintzer, “On half-band, third-band, and nth-band fir filters and centric modulo scheduling for coarse-grained reconfigurable architec- their design,” IEEE Transactions on Acoustics, Speech, and Signal tures,” in Proceedings of the 17th international conference on Parallel Processing, vol. 30, pp. 734–738, Oct 1982. architectures and compilation techniques, PACT ’08, pp. 166–176, 2008. [18] M. Dendrinos, S. Bakamidis, and G. Carayannis, “Speech enhancement [38] M. Hamzeh, A. Shrivastava, and S. Vrudhula, “Regimap: Register- from noise: A regenerative approach,” Speech Communication, vol. 10, aware application mapping on coarse-grained reconfigurable architec- no. 1, pp. 45–57, 1991. tures (cgras),” in 2013 50th ACM/EDAC/IEEE Design Automation [19] D. Patel, M. Shabany, and P. G. Gulak, “A low-complexity high-speed Conference (DAC), pp. 1–10, May 2013. qr decomposition implementation for mimo receivers,” in 2009 IEEE [39] E. A. Lee and D. G. Messerschmitt, “Synchronous data flow,” Proceed- International Symposium on Circuits and Systems, pp. 33–36, May 2009. ings of the IEEE, vol. 7, pp. 1235–1245, September 1987. [20] P. Salmela, A. Happonen, T. Jarvinen, A. Burian, and J. Takala, “Dsp [40] T. J. Repetti, J. a. P. Cerqueira, M. A. Kim, and M. Seok, “Pipelining implementation of cholesky decomposition,” in Joint IST Workshop on a triggered processing element,” in Proceedings of the 50th Annual Mobile Future, 2006 and the Symposium on Trends in Communications. IEEE/ACM International Symposium on Microarchitecture, MICRO-50 SympoTIC ’06., pp. 6–9, June 2006. ’17, (New York, NY, USA), pp. 96–108, ACM, 2017.

13 [41] TheOpenMPTeam, “Openmp.” https://openmp.org, 1997-2019. [58] J. Lee, S. Seo, H. Lee, and H. U. Sim, “Flattening-based mapping [42] V. Govindaraju, T. Nowatzki, and K. Sankaralingam, “Breaking of imperfect loop nests for cgras?,” in 2014 International Conference shackles with an exposed flexible microarchitecture and the access on Hardware/Software Codesign and System Synthesis (CODES+ISSS), execute pdg,” in PACT, pp. 341–351, 2013. pp. 1–10, Oct 2014. [43] R. A. Van Engelen, “Efficient symbolic analysis for optimizing compil- Transactions on Parallel and Distributed Systems, vol. 27, pp. 3199– ers,” in International Conference on Compiler Construction, pp. 118– 3213, Nov 2016. 132, Springer, 2001. [60] Y. Park, H. Park, and S. Mahlke, “Cgra express: Accelerating exe- [44] M. Mercaldi, S. Swanson, A. Petersen, A. Putnam, A. Schwerin, cution using dynamic operation fusion,” in Proceedings of the 2009 M. Oskin, and S. J. Eggers, “Instruction scheduling for a tiled dataflow International Conference on Compilers, Architecture, and Synthesis for architecture,” in Proceedings of the 12th international conference on Ar- Embedded Systems, CASES ’09, (New York, NY, USA), pp. 271–280, chitectural support for programming languages and operating systems, ACM, 2009. ASPLOS XII, pp. 141–150, 2006. [61] Y. Lee, R. Avizienis, A. Bishara, R. Xia, D. Lockhart, C. Batten, [45] W. Lee, R. Barua, M. Frank, D. Srikrishna, J. Babb, V. Sarkar, and and K. Asanovic,´ “Exploring the tradeoffs between programmability S. Amarasinghe, “Space-time scheduling of instruction-level parallelism and efficiency in data-parallel accelerators,” in Proceedings of the 38th on a raw machine,” in Proceedings of the Eighth International Con- Annual International Symposium on Computer Architecture, ISCA ’11, ference on Architectural Support for Programming Languages and (New York, NY, USA), pp. 129–140, ACM, 2011. Operating Systems, ASPLOS VIII, (New York, NY, USA), pp. 46–57, [62] R. Krashinsky, C. Batten, M. Hampton, S. Gerding, B. Pharris, J. Casper, ACM, 1998. and K. Asanovic, “The vector-thread architecture,” in Proceedings of the [46] T. Nowatzki, M. Sartin-Tarm, L. De Carli, K. Sankaralingam, C. Estan, 31st Annual International Symposium on Computer Architecture, ISCA and B. Robatmili, “A general constraint-centric scheduling framework ’04, (Washington, DC, USA), pp. 52–, IEEE Computer Society, 2004. for spatial architectures,” in Proceedings of the 34th ACM SIGPLAN [63] S. Rivoire, R. Schultz, T. Okuda, and C. Kozyrakis, “Vector lane Conference on Programming Language Design and Implementation, threading,” in 2006 International Conference on Parallel Processing PLDI ’13, (New York, NY, USA), pp. 495–506, ACM, 2013. (ICPP’06), pp. 55–64, Aug 2006. [47] T. Nowatzki, N. Ardalani, K. Sankaralingam, and J. Weng, “Hybrid optimization/heuristic instruction scheduling for programmable acceler- [64] J. Kim, S. Jiang, C. Torng, M. Wang, S. Srinath, B. Ilbeyi, K. Al-Hawaj, ator codesign,” in Proceedings of the 27th International Conference on and C. Batten, “Using intra-core loop-task accelerators to improve Parallel Architectures and Compilation Techniques, PACT ’18, (New the productivity and performance of task-based parallel programs,” in York, NY, USA), pp. 36:1–36:15, ACM, 2018. Proceedings of the 50th Annual IEEE/ACM International Symposium on [48] L. McMurchie and C. Ebeling, “Pathfinder: A negotiation-based Microarchitecture, MICRO-50 ’17, (New York, NY, USA), pp. 759–773, performance-driven router for fpgas,” in Third International ACM Sym- ACM, 2017. posium on Field-Programmable Gate Arrays, pp. 111–117, Feb 1995. [65] R. Foundation, “Working draft of the proposed risc-v v vector exten- [49] N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu, sion.” https://github.com/riscv/riscv-v-spec. J. Hestness, D. R. Hower, T. Krishna, S. Sardashti, R. Sen, K. Sewell, [66] A. Smith, R. Nagarajan, K. Sankaralingam, R. McDonald, D. Burger, M. Shoaib, N. Vaish, M. D. Hill, and D. A. Wood, “The gem5 simulator,” S. W. Keckler, and K. S. McKinley, “Dataflow predication,” in 2006 SIGARCH Comput. Archit. News, 2011. 39th Annual IEEE/ACM International Symposium on Microarchitecture [50] A. Roelke and M. R. Stan, “RISC5: Implementing the RISC-V ISA in (MICRO’06), pp. 89–102, IEEE, 2006. gem5,” in Workshop on Computer Architecture Research with RISC-V [67] D. Voitsechov and Y. Etsion, “Control flow coalescing on a hybrid (CARRV), 2017. dataflow/von neumann gpgpu,” in 2015 48th Annual IEEE/ACM Inter- [51] K. Asanovic´ and D. A. Patterson, “Instruction sets should be free: The national Symposium on Microarchitecture (MICRO), pp. 216–227, Dec case for risc-v,” EECS Department, University of California, Berkeley, 2015. Tech. Rep. UCB/EECS-2014-146, 2014. [68] R. Fasthuber, F. Catthoor, P. Raghavan, and F. Naessens, Energy-Efficient [52] G. Bilsen, M. Engels, R. Lauwereins, and J. A. Peperstraete, “Cyclo- Communication Processors: Design and Implementation for Emerging static data flow,” in 1995 International Conference on Acoustics, Speech, Wireless Systems. Springer Publishing Company, Incorporated, 2013. and Signal Processing, vol. 5, pp. 3255–3258 vol.5, May 1995. [69] A. Pedram, A. Gerstlauer, and R. van de Geijn, “Algorithm, architecture, [53] A. Girault, B. Lee, and E. A. Lee, “Hierarchical finite state machines and floating-point unit codesign of a matrix factorization accelerator,” with multiple concurrency models,” IEEE Transactions on computer- IEEE Transactions on Computers, no. 1, pp. 1–1, 2014. aided design of integrated circuits and systems, vol. 18, no. 6, pp. 742– [70] S. Rixner, W. J. Dally, U. J. Kapasi, B. Khailany, A. Lopez-Lagunas,´ 760, 1999. P. R. Mattson, and J. D. Owens, “A bandwidth-efficient architecture for [54] W. Thies, M. Karczmarek, and S. Amarasinghe, “Streamit: A language media processing,” in Proceedings of the 31st Annual ACM/IEEE Inter- for streaming applications,” in International Conference on Compiler national Symposium on Microarchitecture, MICRO 31, (Los Alamitos, Construction, pp. 179–196, Springer, 2002. CA, USA), pp. 3–13, IEEE Computer Society Press, 1998. [55] M. Gordon, W. Thies, M. Karczmarek, J. Lin, A. S. Meli, C. Leger, [71] S. Ciricescu, R. Essick, B. Lucas, P. May, K. Moat, J. Norris, A. A. Lamb, J. Wong, H. Hoffman, D. Z. Maze, and S. Amarasinghe, M. Schuette, and A. Saidi, “The reconfigurable streaming vector proces- “A Stream Compiler for Communication-Exposed Architectures,” in sor (rsvp),” in Proceedings of the 36th Annual IEEE/ACM International Proceedings of the 10th International Conference on Architectural Symposium on Microarchitecture, MICRO 36, (Washington, DC, USA), Support for Programming Languages and Operating Systems, pp. 291– pp. 141–, IEEE Computer Society, 2003. 303, October 2002. [56] R. Prabhakar, D. Koeplinger, K. J. Brown, H. Lee, C. De Sa, [72] G. Weisz and J. C. Hoe, “Coram++: Supporting data-structure-specific 25th International Confer- C. Kozyrakis, and K. Olukotun, “Generating configurable hardware memory interfaces for fpga computing,” in ence on Field Programmable Logic and Applications (FPL) from parallel patterns,” in Proceedings of the Twenty-First International , pp. 1–8, Conference on Architectural Support for Programming Languages and Sept 2015. Operating Systems, ASPLOS ’16, (New York, NY, USA), pp. 651–665, [73] T. Hussain, O. Palomar, O. Unsal, A. Cristal, E. Ayguad, and M. Valero, ACM, 2016. “Advanced pattern based for fpga based hpc ap- [57] S. Yin, D. Liu, Y. Peng, L. Liu, and S. Wei, “Improving nested plications,” in 2014 International Conference on High Performance loop pipelining on coarse-grained reconfigurable architectures,” IEEE Computing Simulation (HPCS), pp. 287–294, July 2014. Transactions on Very Large Scale Integration (VLSI) Systems, vol. 24, [74] Z. Wang and T. Nowatzki, “Stream-based memory access specialization no. 2, pp. 507–520, 2016. for general purpose processors,” in Proceedings of the 46th International [59] S. Yin, X. Lin, L. Liu, and S. Wei, “Exploiting parallelism of imper- Symposium on Computer Architecture, ISCA ’19, (New York, NY, USA), fect nested loops on coarse-grained reconfigurable architectures,” IEEE pp. 736–749, ACM, 2019.

14