<<

Eindhoven University of Technology

MASTER

Compiler vectorization for coarse-grained reconfigurable architectures

Linders, G.C.

Award date: 2019

Link to publication

Disclaimer This document contains a student thesis (bachelor's or master's), as authored by a student at Eindhoven University of Technology. Student theses are made available in the TU/e repository upon obtaining the required degree. The grade received is not published on the document as presented in the repository. The required complexity or quality of research of student theses may vary by program, and the required minimum study period may vary in duration.

General rights Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights.

• Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain Compiler Vectorization for Coarse-Grained Reconfigurable Architectures

Geert Linders [email protected] Master’s Thesis

Eindhoven University of Technology

August 13, 2019 Abstract

In recent years, there has been a demand for increasingly powerful embedded processing units which operate at an ever higher energy efficiency. However, tra- ditional architectural platforms such as CPUs and FPGAs are not always able to support growing workloads on a tight energy budget. To mitigate this, an al- ternative paradigm of processing has been gaining traction: the Coarse-Grained Reconfigurable Architecture (CGRA). Configuration of Functional Units inside a CGRA allow it to be structured in various ways including general-purpose, application-specific. To support large workloads on a CGRA with minimal en- ergy, a CGRA can be configured as an SIMD , which allows for the execution of vectorized programs. This work explores how vectorized programs from existing CPU program- ming models can be effectively mapped to a CGRA configuration and program schedule. Currently, a C compiler exists based on the LLVM compiler frame- work, which is able to compile scalar programs for Blocks: a CGRA in devel- opment at Eindhoven University of Technology which is well-suited for SIMD programs. To extend this compiler with vectorization, we make adjustments to the Blocks hardware, compiler model, and instruction set. Moreover, we im- plement procedures in the compiler and surrounding toolflow in order to add support for lowering vector-specific operations, particularly vector shuffles, for various different Blocks configurations. Benchmarking results indicate that for a compiler-friendly Blocks configura- tion, significant speedups and energy reductions can be obtained by vectoriza- tion. A vector width of 8 can yield a speedup ranging from 2.3× to 7.6×, whilst energy usage is reduced by 11.1% to 75.8%. Finally, a number of key issues which prevent vectorized Blocks programs from achieving their full potential are identified. Contents

1 Introduction 3

2 Related work 5 2.1 Blocks ...... 5 2.2 Blocks compiler ...... 10 2.2.1 LLVM-based compiler ...... 10 2.2.2 Resource graph ...... 12 2.3 Vector programming models ...... 15 2.3.1 CPU-style ...... 16 2.3.2 GPU-style ...... 20 2.4 Common problems on Blocks ...... 21

3 Compiler model 23 3.1 LLVM vector operations ...... 23 3.1.1 build vector ...... 24 3.1.2 shufflevector ...... 24 3.1.3 extractelement ...... 24 3.1.4 insertelement ...... 24 3.2 Hardware model ...... 25 3.3 Architecture expansion ...... 26 3.4 Scalar-vector boundary ...... 27 3.4.1 Scalar to vector ...... 27 3.4.2 Vector to scalar ...... 27 3.4.3 Vector to vector ...... 27 3.5 Shuffle patterns ...... 28 3.6 Pseudo-units ...... 30 3.6.1 Vector Shuffle Unit (SHF) ...... 30 3.6.2 Multiplexed (RFM) ...... 31 3.7 Vector indices ...... 34

4 Instruction set extensions 36 4.1 ALU instructions ...... 36 4.1.1 ADDI: Add Immediate ...... 36 4.1.2 SUBI: Subtract Immediate ...... 38 4.2 LSU instructions ...... 40 4.2.1 SVX: Set Vector Index ...... 40 4.2.2 RVX: Reset Vector Index ...... 40 4.2.3 LVD: Load Vector Default Index ...... 41

1 5 Compiler vector support 43 5.1 Storage and routing ...... 43 5.2 Building vectors ...... 45 5.2.1 Base vector ...... 46 5.2.2 Remainder insertion ...... 48 5.3 Insert element ...... 48 5.4 Extract element ...... 49 5.5 Vector shuffling ...... 50 5.5.1 Hard-wired shuffle ...... 50 5.5.2 Vector Shuffle Unit ...... 53 5.5.3 Shuffle via memory ...... 58 5.5.4 Partial insertion ...... 60

6 Evaluation 67 6.1 Testing environment ...... 67 6.2 Runtime ...... 70 6.3 Area size ...... 77 6.4 Energy usage ...... 79

7 Conclusion 82 7.1 Summary ...... 82 7.2 Future work ...... 83

Bibliography 85

2 Chapter 1

Introduction

As workloads and performance requirements increase, there has been an increas- ing demand for highly efficient embedded platforms. Embedded processing units continue to become more and more powerful, but this often comes at the cost of a higher energy usage. To an increasing degree, there is a need for parallel processing in order to support growing workloads on a limited power budget; for instance, using a multi-core approach [2] or a vectorized one [6]. Nowadays, general purpose Central Processing Units (CPUs) and Field- Programmable Gate Arrays (FPGAs) are commonly used in embedded plat- forms. A CPU has the advantage that it is highly flexible and available complete- off-the-shelf, being able to run any sort of application out of the box. However, this generality introduces a great deal of computational overhead; for instance, instruction fetching/decoding and data transport. As such, in some cases a CPU is simply not efficient enough in order to support compute intensive ap- plications; instead, an adaptable micro-architecture is required [20]. On the other hand, an FPGA is such an adaptable micro-architecture: the FPGA itself is generic, but its configuration can be optimized for running a single specific application, which generally yields a much higher throughput. However, the high configurability and low path-width of an FPGA causes much energy to be lost due to the large number of -boxes and long wires [20][4]. As a result, neither CPUs nor FPGAs are an ideal solution when a high energy efficiency is desired. To improve energy efficiency, CPUs and FPGAs are often paired with spe- cialized computing units such as Graphical Processing Units (GPU) and Dig- ital Signal Processors (DSP). In terms of granularity, these computing units fall somewhere in between; they are more fine-grained than a CPU, but more coarse-grained than an FPGA. In addition, Application-Specific Integrated Cir- cuits (ASIC) are also frequently used when a higher energy efficiency is desired. However, these computing units are usually domain specific and do not have the flexibility that a CPU or FPGA offers. A new paradigm of processing unit has been proposed which potentially achieves a higher energy efficiency than both a CPU and an FPGA: the Coarse- Grained Reconfigurable Architecture (CGRA). The concept has been around since the 90s in various shapes and forms [4], but has especially gained traction in recent years [19]. Similar to an FPGA, a CGRA can be configured to be optimized for a specific application; however, a CGRA is configured at the

3 Functional Unit level, whereas an FPGA is configured at the gate level. In other words, a CGRA has larger building blocks than an FPGA and thus requires a lower amount of wires and switch-boxes, which reduces static energy dissipation. In this report we shall specifically deal with the CGRA known as Blocks, that is currently in development at Eindhoven University of Technology. [20] Due to the use of larger building blocks, code generation for a CGRA is more akin to that of a CPU rather than the high level synthesis that is used for FPGAs. A compiler based on the popular LLVM architecture currently exists for the Blocks-CGRA, which supports compilation of programs to Blocks machine code. However, up until now the focus has been exclusively on scalar programs, with vectorized programs only being created through manual Parallel Assembly programming. Moreover, little research has been done into the map- ping of vectorized programs to Blocks-CGRA designs. We wish to extend the compiler in order to add support for vectorized architectures and operations, similar to those available on contemporary processing units. This report will explore and evaluate the possibility of introducing support for vectorized pro- grams on the Blocks-CGRA hardware and supporting vector operations in the compiler. The remainder of this report is structured as follows. Chapter 2 gives a detailed introduction of the Blocks-CGRA and CGRAs in general, as well as the Blocks-CGRA compiler, and gives an overview of related work. Chapter 3 describes the compiler model, including vector-specific operations and how a vectorized Blocks-CGRA is represented in the compiler, as well as introducing pseudo-units constructs and Blocks hardware extensions used to aide compila- tion. Chapter 4 provides the specifications of all newly implemented instruc- tions in the Blocks instruction set. Chapter 5 details the changes made to code lowering and scheduling processes in the Blocks compiler, and introduces proce- dures to lower high-level vector operations to low-level assembly code. Chapter 6 describes and evaluates the results of vectorized benchmarking applications that have been run on the modified Blocks toolflow. Finally, Chapter 7 gives a conclusion regarding our findings in this work.

4 Chapter 2

Related work

This chapter will discuss relevant work that has been done, regarding both CGRAs in general and the Blocks-CGRA specifically, as well as vectorization of computer programs. Section 2.1 serves as a brief introduction of Blocks, the CGRA in development at Eindhoven University of Technology. Section 2.2 in- troduces the Blocks compiler and describes recent developments in CGRA code generation. Section 2.3 discusses the various programming models of vector- ization that are employed on other platforms, and how these pertain to the Blocks platform. Finally, Section 2.4 identifies the common issues that arise with vectorization on the Blocks-CGRA.

2.1 Blocks

A Coarse-Grained Reconfigurable Architecture (CGRA) is similar to an FPGA in that it is a reconfigurable processing element. However, unlike an FPGA, which is configured at the gate level, a CGRA is configured at the Functional Unit (FU) level–hence “coarse-grained”. The key difference which sets Blocks, the CGRA in development at Eind- hoven University of Technology, apart from traditional CGRAs, is that Blocks uses separate control-paths and data-paths, whereas traditional CGRAs only have a data-path.[18] This separation of control-path and data-path allows for Instruction Fetch/Instruction Decode units to be treated as “first-class citizens”, i.e., they are considered as separate Functional Units and can be more-or-less arbitrarily connected to any other units. On the other hand, for a traditional CGRA, this Instruction Fetch/Instruction Decode functionality, as well as pro- gram memory, is built right into the Functional Units themselves. By having IF/ID as a separate unit, it becomes possible for a single IF/ID unit to drive multiple other FUs of the same type at once; i.e., Blocks can operate in a true Single Instruction, Multiple Data fashion, which is not possible on a traditional CGRA. Operating a CGRA in such an SIMD fashion has been shown to result in significant energy efficiency improvements.[18] A general overview of the hardware structure for Blocks is shown in Figure 2.1. Functional Units available in the Blocks hardware, and their supported operations include: ID: Instruction Decoder. Sends instructions in sequential order to connected

5 Functional Units, which they then execute. The Program for the instructions is supplied by a connected ABU.

ABU: Accumulate-and-Branch Unit. Produces an automatically incrementing and supports related operations, such as branching and jumping. Can also be configured to support accumulating on internal registers. RF: Register File with 16 registers where data can be stored temporarily. Sup- ports operations related to reading data from and writing data to its registers. This unit is optional as data can also be stored in buffers be- tween Functional Units, but in case more storage space is needed, Register File(s) can be added to the system. LSU: Load-Store Unit used for long-term storage. Supports operations related to reading from and writing to memory; both global memory and a local memory block can be accessed. Global memory is accessible to external devices, whereas local memory is private to the individual LSU. Loads and stores can be done explicitly by specifying the memory address on an input, or implicitly by configuring memory address and stride in in- ternal control registers in advance; the latter also increments the memory address. Moreover, the LSU supports operations related to reading from and writing to said internal control registers. ALU: . Supports general arithmetic and bitwise op- erations, including equals/greater/less-than checks. Also supports small bit-shifts by 1 bit or 4 bits; variable bit-shifts are not supported.

MUL: Multiplier Unit. Supports multiplication operations as well as extended bit-shifts (by 8 bits, 16 bits or 24 bits). IU: Immediate Unit. The only purpose of this unit is to produce immediate values in the system, which are constant values that are available “immedi- ately”; i.e., they do not need to be loaded from a register or from external memory. The immediate values are stored in the instruction memory for the Immediate Unit. As a result, it has a higher instruction width than the other units; 32 bits per “instruction” (immediate value) in the current base hardware configuration, versus 12 bits per instruction for other units.

In addition, most Functional Units support a general PASS operation, which simply forwards a value from an input port to an output port. By supplying a hardware configuration file, formatted in Extensible Markup Language (XML), it is possible to specify which Functional Units are (in)active in the Blocks hardware, and how they are connected to one another. Next, a programmed application can be executed on the configured Blocks-CGRA. Blocks-CGRA programs are written in Parallel Assembly (PASM), which is similar to a regular uniprocessor assembly language source file (e.g. or ARM), but consists of multiple columns, each pertaining to a single Instruc- tion Decoder (ID) or Immediate Unit (IU) in the system. During execution of the program, each Instruction Decoder executes in lock-step. As a result, all Functional Units in the hardware are synchronized by cycle at all times.

6 Global data memory Host

Local Local Local Local Local Local Mem. Mem. Mem. Mem. Mem. Mem.

Instruction IF ID LS LS LS LS LS LS memory

Instruction IF ID FU FU FU FU FU RF memory

Instruction IF ID FU FU FU FU FU RF memory

Instruction IF ID FU FU FU FU FU RF memory

Instruction IF ID FU FU FU FU FU RF memory

Figure 2.1: Hardware structure of Blocks.

By specifying connections between Instruction Decoders and the other Func- tional Units, FU-specific instructions can be encoded. In particular, this ap- proach allows for Blocks to be configured in various structures. For instance, Figure 2.2 shows a Blocks-CGRA configured as a VLIW processor, by connect- ing three Functional Units each to a unique Instruction Decoder. Most Blocks configurations look like this. This sort of configuration can also be built as a general-purpose processor. Furthermore, it is possible to connect a single In- struction Decoder to a set of Functional Units of the same type, as shown in Figure 2.3. In the current Blocks hardware, a Functional Unit is structured generally as shown in Figure 2.4. A Functional Unit contains four input ports and two output ports on the data-path network, as well as one input port on the control-path network (for supplying instructions). Each output port contains a buffer which can be used as a storage location for a single 32-bit value. In the architectural design XML file, connections are specified from FU output buffers to FU input ports. It is possible for a single FU output to be connected to multiple different FU inputs; however, any one FU input can only be connected to at most one output. Instruction Decoders and Immediate Units differ slightly from other FUs; they have one input port on the data-path (for supplying the Program Counter) as well as one output port, which is on the control-path for Instruction Decoders and on the data-path for Immediate Units. Going forward, whenever we refer to “input” or “output” ports or buffers, we generally refer to the 4 inputs or 2 outputs (1 for IU) on the data-path. An instruction is encoded in program memory, which is decoded by a con- nected Instruction Decoder and forwarded to the Functional Unit. An example is shown in Figure 2.4 for the mock instruction opcode outD, inB, inA, a binary operation. The opcode part of the instruction specifies the operation

7 Global data memory Host

Local Local Local Local Local Local Mem. Mem. Mem. Mem. Mem. Mem.

Instruction IF ID LS LS LS LS LS LS memory

Instruction IF ID FU FU FU FU FU RF memory

Instruction IF ID FU FU FU FU FU RF memory

Instruction IF ID FU FU FU FU FU RF memory

Instruction IF ID FU FU FU FU FU RF memory

Figure 2.2: Blocks configured as a VLIW processor.

Global data memory Host

Local Local Local Local Local Local Mem. Mem. Mem. Mem. Mem. Mem.

Instruction IF ID LS LS LS LS LS LS memory

Instruction IF ID FU FU FU FU FU RF memory

Instruction IF ID FU FU FU FU FU RF memory

Instruction IF ID FU FU FU FU FU RF memory

Instruction IF ID FU FU FU FU FU RF memory

Figure 2.3: Blocks configured as an SIMD processor.

8 FU FU FU FU

in0 in1 in2 in3

Instruction Memory inA A inB B

opcode outD, inB, inA opcode outD, inB, inA opcode outD, inB, inA opcode outD, inB, inA Determine opcode result opcode outD, inB, inA opcode outD, inB, inA opcode outD, inB, inA

FUoutD D

out0 out1

FU FU

Figure 2.4: Structure of a Blocks Functional Unit, connected to six other FUs. to be performed in order to compute the result; the inA and inB parts spec- ify which input ports are read as parameters for the operations; finally, the outD part specifies which destination output buffer the result of the operation is written to. This makes the Functional Units in Blocks similar to those seen in Transport-Triggered Architectures. Not all instructions follow this pattern exactly; for instance, the RF unit additionally takes register indices as param- eters (encoded directly into the instruction). For the current version of Blocks, the programming model assumes that all instructions take a single cycle, and as such, every FU in the system which is connected to the same program counter operates in lockstep. However, in some cases, such as when multiple LSU units are accessing global memory, and these accesses cannot be coalesced (e.g. when two LSUs read adjacent 16-bit values, this can sometimes be coalesced into a single 32-bit read), an instruction may take more than one cycle to finish. In such a situation, all other units in the system are stalled for a number of cycles until all pending operations are finished, thus preventing any desynchroniza- tion between FUs and ensuring that they continue to operate in lockstep, in accordance with the programming model. The main benefit of a CGRA is that it features larger building blocks than an FPGA; therefore, there is less interconnect and there are fewer switch-boxes and buses in the system. As a result, static energy discharge is far reduced com- pared to an FPGA. On the other hand, the CGRA does include some overhead from instruction decoding as well as Register File usage. However, as CGRA instructions do not operate on registers, but rather directly on input ports (sim- ilar to a TTA), it is possible to avoid usage of the Register File through explicit

9 bypassing. For instance, the output port of an ALU can be connected directly to the input port of an LSU, which allows for the result of an ALU computation to be written directly into memory without touching the Register File. A rather significant downside of a CGRA, however, is that it is difficult to program for. There are currently no tools to automatically synthesize a hard- ware configuration for a specific program. Moreover, programming in Parallel Assembly is significantly more difficult and error-prone than programming for a regular uniprocessor, as not only does one need to be mindful of the parallelism across all Instruction Decoders, the distributed nature of the FU output buffers makes it hard to keep track of where data is stored at any given moment. As such, we would ideally want to be able to generate code for the Blocks-CGRA from a higher level language, such as C, which is easier to program–and this is where a compiler comes in.

2.2 Blocks compiler

Program execution for a CGRA platform is similar to that of a Transport- Triggered Architecture. A TTA program executes by transporting data via memory buses from/to Register Files and Functional Units. For Transport- Triggered Architectures, code generation methods are available [5]. As in a TTA, a CGRA program executes by transporting data from output buffers to Functional Units and Register Files. As such, methods used for TTA code gen- eration are also partially applicable to a CGRA, and hence, to Blocks. However, in the Blocks-CGRA, there is a limit on the number of inputs to each Func- tional Unit–namely, 4 inputs. As a result, when there are more than 4 units in the system that are executing the program–and this is usually the case for non-trivial programs–it is not possible for the interconnect graph to be fully connected. For instance, it becomes impossible to connect every other unit to the Register File directly. As a result, data must often be routed through un- related units on its way to the destination FU. This brings with it a host of problems, such as the possibility to deadlock while scheduling when no suitable path can be found for operands.[14]. This makes the of code generation for the Blocks-CGRA considerably more complicated than for a regular CPU. No generic framework for CGRA and TTA code generation currently exists; compilers up until now have been architecture-specific, and this also applies to the Blocks compiler. There are recent efforts towards creating such a generic framework, but research on this is still in the early stages.[14]

2.2.1 LLVM-based compiler LLVM is a popular compiler framework consisting of a front-end, optimization layer, and back-end. An example compilation of a C program to x86 assembly is shown in Figure 2.5. The front-end parses a program written in some program- ming language, e.g. C, and converts it to LLVM Intermediate Representation (IR) format. Other front-ends exist for different programming languages, such as Fortran or Ada. The IR program is then fed through a common optimiza- tion layer, which can consist of many different optimization passes such as loop unrolling, dead code elimination, and so on. Each optimization pass transforms the IR program into a more optimized IR program. Finally, the optimized IR

10 Optimized x86 LLVM IR C program C front-end LLVM optimizer LLVM IR x86 back-end assembly program program program

Figure 2.5: Structure of the LLVM compiler architecture. program passes through a back-end, which converts the program to the native format of the architecture, such as x86 assembly. The back-end can output this assembly code to various formats, for instance human-readable text formats, or to a binary which is directly executable on the target platform. The major benefit of this structure is that support for a new programming language can be added simply by adding a new front-end which parses that language. Sim- ilarly, common optimization passes added to the optimization layer can often be applied to all programming languages and platform targets that are already supported, as well as those that will be supported in the future. Finally, support for a new target platform can be added simply by adding a new back-end which converts the optimized LLVM IR program to the native format of the platform architecture. An initial compiler for the Blocks platform has been developed in 2016 im- plemented as a custom back-end for LLVM, along with some target-specific optimization passes [1]. This initial Blocks compiler supported compiling basic scalar C programs to Blocks Parallel Assembly (PASM) machine code, with some support for explicit bypassing and function calls. A custom scheduler is implemented, which uses an operation-based scheduling algorithm. The initial compiler performs instruction scheduling and register allocation as separate passes. In early 2017, a combined instruction scheduling and register allocation algorithm was proposed [12]. This method models the Blocks config- uration and program to be scheduled in a state transition model, which is then implemented in a Constraint Statisfaction Program. A Boolean satisfiability, or SAT solver, is then used to find a schedule of minimal length. However, this algorithm is currently not yet integrated in the Blocks compiler. Later in 2017, major improvements were made to reduce loop overheads in the Blocks compiler as well as in hand-crafted PASM programs [13, 15]. A Zero- Overhead Loop Accelerator (ZOLA) circuit was added inside the Accumulate- and-Branch Units in the Blocks hardware, which can be configured ahead of a loop in order to automatically handle loop conditions during the loop. In addi- tion, the ZOLA circuit includes a loop buffer which stores instructions executed during the loop, cutting down on program memory accesses. Additional im- provements to the compiler were made to reduce register accesses by extending the scope of the scheduler. Moreover, additional passes were added for implicit addressing and for software pipelining, the latter using a new swing-modulo scheduler. However, compiler integrated of the ZOLA circuit is still in an early stage and is only functional for a small set of programs. Recently in early 2018, improvements were made in pipelining applications on multi-core Blocks configurations [17]. A FIFO channel was added to the Load-Store Units of the Blocks hardware, which allows communication and syn- chronization between “cores” in the Blocks configuration.

11 imm

alu lsu

Figure 2.6: The architecture in its simplest and most abstract form. An Imme- diate Unit connects to an ALU and LSU, which are connected to each other. An ABU manages the program counter for each unit, but it is not shown in this image.

2.2.2 Resource graph In order to implement vector operations in the Blocks-CGRA compiler, said operations must also be integrated into the scheduler. As such, in this section we give a detailed overview of the Blocks scheduler and how it uses a resource graph in order to map code onto the architecture configuration, which the vector operations that we shall implement will also be mapped onto. Compiling code for a CGRA is very different from most other target plat- forms supported by LLVM. Whereas most CPUs have only a single instruction memory, in the CGRA multiple instruction memories and issue slots are present. Some LLVM target backends for VLIW platforms do exist, but a CGRA differs from most VLIW platforms in that the architecture is parametric; specifying different architecture configurations can completely change the structure of the CGRA, for instance by adding more ALUs or LSUs. Therefore, the scheduler for the Blocks compiler can make few assumptions about the available (compute) resources and latencies present in the system, which complicates the scheduling process. The scheduler which is currently implemented in the Blocks target backend for LLVM makes use of a resource graph. The resource graph is created by repeatedly instantiating a template model based on the loaded Blocks config- uration once for every cycle, and adding connections between units that span different cycles. Instructions are then scheduled onto this two-dimensional re- source graph, as opposed to most conventional processors, where instructions are scheduled into a one-dimensional list.[1] Suppose that we want to schedule a program that loads a value from memory at address 10, adds 5 to it, then writes it back to memory at address 20. Such a program requires at the minimum a Blocks configuration with an Immediate Unit, an ALU, and an LSU, as well as an ABU to generate a program counter. For the sake of simplicity, we shall leave out the ABU since no branching occurs in this program. This minimum Blocks configuration is shown in Figure 2.6. We make this architecture explicit by defining the connections between the units in terms of output buffers and input ports, as shown in Figure 2.7. This is the model that will be sent to the compiler. The compiler parses the model from Figure 2.7 and generates a template model as shown in Figure 2.8. Nodes are generated for each Functional Unit as well as all of their output buffers. Connections are then generated between the Functional Unit nodes and their output buffers with a latency of 1 cycle, which represents the Functional Unit computing the result of some instruction

12 imm

out0

in0 in1 in2 in3 in0 in1 in2 in3

alu lsu

out0 out1 out0 out1

Figure 2.7: A more detailed version of the architecture. Connections are now explicitly assigned from output buffers to input ports.

imm imm.0

lsu lsu.0 alu alu.0

lsu.1 alu.1

Figure 2.8: The architecture model as parsed by the compiler. Normal lines are assigned a latency of 1 cycle, whereas dashed lines have a latency of 0 cycles.

13 imm imm.0 lsu lsu.0 lsu.1 alu alu.0 alu.1 0 immediate 10

immediate 5 imm imm.0 lsu lsu.0 lsu.1 alu alu.0 alu.1 1 load from address 10

immediate 20 imm imm.0 lsu lsu.0 lsu.1 alu alu.0 alu.1 2 add 5

imm imm.0 lsu lsu.0 lsu.1 alu alu.0 alu.1 3 store to address 20

Figure 2.9: The example program: load from 10, add 5, store to 20; scheduled onto the resource graph. Normal lines have a latency of 1 cycle, whereas dashed lines have a latency of 0 cycles. The bolded nodes and lines represent the active nodes and connections which execute instructions and through which data is routed.

and storing it in an output buffer (which takes 1 cycle). Self-connections are also added for the output buffers, which represents keeping the same value stored in the buffer for the next cycle. Finally, zero-latency connections are added between output buffers and Functional Units, representing the incoming values on the input ports. These connections are tagged with the input port numbers so that the proper PASM code can be generated at the end of the code generation process, but for the purposes of scheduling, the input port numbers do not matter. Moreover, each connection is given a weight; higher weights are assigned to connections that have a cycle latency. Next, a resource graph is generated from this template model. The template model is duplicated for every cycle of the program; extra cycles are added as the schedule demands it. Any connections with a latency of 1 are altered so that the connection points to the target node in the next cycle, rather than in the same cycle. Then, for each instruction to be scheduled, the scheduler first checks to see which Functional Units are capable of executing the instruction (e.g. ALUs for add operations, LSUs for load/store operations), and where the input data is currently stored. It then traces every forward path through the system starting

14 from the nodes where the input data is stored, incrementally raising the number of cycles (“rows” in the resource graph) that are searched, until one or more suitable Functional Units are found. The instruction is then scheduled to the Functional Unit node with the combined shortest path between it and each of the input data nodes, based on the weights assigned to each connection. As the connection weights are mostly based on latency, and the number of searched cycles is increased until a suitable FU is found, in this sense the scheduler is a greedy one, as instructions are mostly scheduled ASAP. Once the executing Functional Unit is chosen, the input data is allocated to each node that it passes through via PASS operations, and the executing FU node is tagged with the opcode of the instruction. Each node has a capacity of 1, meaning that only one unit of data may pass through each node; as a result, an FU node which is executing an instruction cannot also be used to route data via PASS operations. The result of the instruction is output at the FU node that executed the instruction, rather than an explicit output port. The output port is not chosen until the result of that instruction is consumed by another FU, at which point the data is routed through one of the output ports to the consuming FU, in the process of scheduling the next instruction. If, however, the result of the instruction is not used by any other instructions, then an output port is chosen immediately, with priority given to output ports that are not connected to any other units. The result of this is shown in Figure 2.9, showing the example program being scheduled onto the resource graph. In cycle 0, the immediate 10 is generated; this immediate is used in cycle 1 to load from memory address 10. Next, in cycle 2, the immediate 5 is added to the value loaded from memory; this immediate was generated in cycle 1. The result is written back to memory in cycle 3, with the immediate address 20 being generated in cycle 2.

| imm | lsu | alu | |------|------|------| | imm 10 | nop | nop | | imm 5 | lga WORD, out0, in0 | nop | | imm 20 | nop | add out0, in0, in1 | | nopi | sga WORD, in0, in1 | nop |

Figure 2.10: Generated PASM code for the scheduled program.

Finally, once the whole program has been scheduled, the compiler translates the resource graph from Figure 2.9 into PASM by outputting an instruction for every FU node in the resource graph based on the opcode that the FU was tagged with, and by inspecting the input and output connections of the node (which were tagged with input/output port numbers earlier). The final PASM code is shown in Figure 2.10.

2.3 Vector programming models

We can generally categorize code generation for vectorized programs into two distinct programming models: a “CPU-style” and “GPU-style”. This section discusses the merits of both of these styles of vectorization as they pertain to

15 Instr Mem

CP Instr PE Instr

Decode Decode

Circular CP PE PE ... PE Neighborhood Network

Mem Mem Mem ... Mem

Vector Data Memory

Figure 2.11: Wide SIMD architecture from [16]. the CGRA, specifically the Blocks platform.

2.3.1 CPU-style The (single-core) “CPU-style” of vectorization, in its simplest form, is imple- mented in a uniprocessor in a Single Instruction, Multiple Data fashion. The CPU-style is characterized by its use of low-level vector instructions with di- rect CPU support; for instance, Intel’s Streaming SIMD Extensions (SSE) and Advanced Vector Extensions (AVX) [9]. Such instructions are capable of ad- dressing many registers in the Register File at once, or use specialized registers for vector operations. Another architecture that follows the CPU-style of vectorization is the wide SIMD architecture also in development at Eindhoven University of Technology [16], shown in Figure 2.11 and Figure 2.12. In this architecture, a number of vector Processing Elements run in lockstep, parellel with a scalar Control Processor. Usually, CPU-style vectorization involves tight loops performing simple op- erations on large sets of data; for instance, copying an array or applying a (convolution) filter. The SIMD nature of the vectorization complicates han- dling of edge cases or complex processing. Advanced compilers are capable of performing auto-vectorization, wherein the compiler automatically detects cer- tain code structures and generates vectorized machine instructions for them. This auto-vectorization is based on an architecture-specific cost model, which gauges whether a given loop is worth vectorizing or not. However, the program- mer can also use compiler directives at the source code level, such as #pragma clang loop in LLVM or #pragma gcc ivdep in GCC [7, 3], to provide hints to the compiler or to bypass the cost model and force a vectorization.[11, 7] The CPU-style of vectorization has the major benefit that it requires minimal effort on the part of the programmer whilst producing significant performance improvements [6]. However, handling edge cases is somewhat more difficult, and

16 RF & Bypass

Left PE Right PE

Neighborhood Network Op A in the

PE PE ... PE

CP

Figure 2.12: Circular neighborhood communication network in wide SIMD ar- chitecture from [16]. this must usually be done separately. In the Blocks-CGRA, the “CPU-style” of vectorization, or more generally, a Single-Instruction-Multiple-Data structure, can be realized by connecting a single Instruction Decoder unit to a range of Functional Units of identical type. Note that this feature is unique to the Blocks-CGRA and does not exist in other CGRA platforms.[18] Data is then routed into the vectorized units by specifying different connections for the inputs of each unit. In this fashion, for a set of vectorized Functional Units, the CGRA needs only fetch and decode a single instruction, using a single Instruction Decoder, for every attached vectorized Functional Unit. This setup is very similar to the wide SIMD architecture described in [16] and shown in Figure 2.11. This has the effect of reducing energy costs, as redundant Instruction Fetch/Decode units are omitted. Furthermore, as multiple data units are being processed in parallel, this may also result in a speed increase that is inherent to vectorization. For instance, a CPU-style vectorized program with a vector width of 4 would use, in theory, only 25% of the original Instruction Decoder energy as the scalar version, while processing 4 data units at a time in parallel. However, this comes at the cost of using 4 times as much energy for the vectorized Functional Units. Furthermore, Amdahl’s law suggests that a 4× speedup is unlikely to be reached; not all parts of the program can be executed in parallel. As such, the Functional Units themselves, combined, are likely to use a bit more energy in a vectorized program than in a scalar program. Even so, reducing the Instruction Decoder energy usage would likely still save energy overall, as the Instruction Decoder and Immediate Unit are known to be some of the most energy expensive units in the Blocks-CGRA, due to their increased instruction width (32-bits versus 12-bits). Moreover, the potential speed improvements are equally desirable.

17 in in in in

A0 A1 A2 A3 out out out out

Figure 2.13: A shuffle pattern < 3, 2, 0, 1 >, i.e. the value from index 3 (at ALU3) is moved to index 0 (at ALU0), etc., performed explicitly in wiring, using 4 vectorized ALUs.

As with a general-purpose CPU, on a CGRA this style of vectorization has the drawback that handling edge cases is nontrivial, since each vectorized unit receives the same instruction. It is currently not possible to skip instructions for certain units or make them otherwise behave differently from their neighbors, except by applying these differences through the input values; for instance, an edge case in a summation could be handled by supplying zero-values to those units that fall outside of the range of summation. The Blocks-CGRA currently has no means of temporarily disabling a Func- tional Unit outside of feeding it nop (no operation) instructions. Furthermore, each vectorized unit generally only has local access to its own inputs and out- puts, and cannot directly access the inputs of its neighbors. For instance, in the top of Figure 2.14, ALU0 cannot access directly input values v4, v5, v6 and v7, and ALU1 cannot directly access input values v0, v1, v2, v3. On the bottom, a bypass has been added from ALU1 to ALU0; now ALU0 can access v4, v5, v6 and v7 by passing them through ALU1, but it must give up one of its own input values; v3 can no longer be accessed. These two limitations makes it difficult to execute vector shuffling operations, as each vectorized unit must perform the same PASS operation and cannot access arbitrary vector values. An arbitrary vector shuffling operation could still be performed by explicitly adding connec- tions between the vectorized elements, as shown in Figure 2.13, but this comes at the cost of precious Functional Unit inputs. In the wide SIMD architecture, this is solved by adding a circular neighborhood network as shown in Figure 2.12. Another drawback for CPU-style vectorization on the Blocks-CGRA is that there is no way to directly load immediate vectors in constant time. This is useful for e.g. applying a convolution window or specifying different offsets for a set of Load-Store Units. By connecting vectorized units in a network, it is possible to load an immediate vector into a set of units using a single Immediate Unit, but this would have to be done one element at a time, and thus takes linear time. One could add a separate Immediate Unit per vector unit, but this introduces a lot of overhead. One way around this limitation is to prepare the constant vector(s) to be loaded in global memory prior to executing the program, though each vectorized Load-Store Unit will still need its own unique offset. Moreover, reading from global memory comes with its own performance implications.

18 v0 v1 v2 v3 v4 v5 v6 v7

in0 in1 in2 in3 in0 in1 in2 in3

alu0 alu1

out0 out1 out0 out1

v0 v1 v2 v4 v5 v6 v7

in0 in1 in2 in3 in0 in1 in2 in3

alu0 alu1

out0 out1 out0 out1

Figure 2.14: Two vectorized ALU units in a Blocks-CGRA. Top: no bypass; bottom; bypass added.

19 Grid

Block Block

Shared Memory Shared Memory

Thread Thread Thread

RF RF RF RF

FU FU FU FU FU FU FU FU

FU FU FU FU FU FU FU FU

Local Local Local Local Mem. Mem. Mem. Mem.

Host Global Memory

Figure 2.15: Single Instruction, Multiple Thread architecture model.

2.3.2 GPU-style The “GPU-style” of vectorization is implemented in specialized hardware, typ- ically a GPU, usually in a Single Instruction, Multiple Thread fashion. It is characterized by the use of so-called kernels; specialized mini-programs written specifically by the programmer. The vector width or data size is then decided at runtime by configuring the kernel. The hardware consists of numerous threads, each containing their own Functional Units, registers and memory; threads are grouped into blocks with shared memory; finally, blocks are part of a grid with a global memory, which is accessible by the host computer. Figure 2.15 shows this architectural model. One example of GPU-style vectorization is CUDA, a programming model for NVIDIA GPUs [10]. As the vectorization is performed on specialized hardware, it supports many more features than a uniprocessor. A significant feature is divergence, which allows a set of threads to follow differing control flow paths [21]. This allows for if-then-else blocks to be executed inside a kernel, which can handle edge cases. Such control flow divergence is achieved by masking the instructions for the inac- tive control flow path(s), and using a specialized function (e.g. syncthreads() in CUDA) to resynchronize threads if they do become desynchronized [10]. This allows for a much greater degree of flexibility than “CPU-style” vectorization. However, GPU-style vectorization requires that the programmer write a sep- arate CPU control program and a GPU kernel, which requires considerably more effort from the programmer than simply adding #pragma statements in CPU-style vectorization, or indeed letting the compiler’s auto-vectorization al- gorithms do their work. In terms of the Blocks-CGRA, this style of vectorization could already be implemented using the current architecture, but would likely have limited ben- efits. As Blocks is not a specialized SIMT architecture, it lacks many of the features that make GPU-style vectorization practical, including latency hiding and multi-threading. Notably, Blocks does not support masking of control flow paths. This means that any implementation of a GPU-style kernel, e.g. from CUDA or OpenCL, would inevitably require that each “core” feature its own set of Functional Units, each with their own Instruction Decoders and Program

20 Counters. Using FIFO queues between cores, it is then possible to construct some form of synchronization, which allows for thread divergence [17]. At this point, one basically ends up with a general purpose multi-core processor, which is considered out of scope for this project. While this could result in a lower execution time for parallelized programs, the costs of adding the extra units may outweigh it.

2.4 Common problems on Blocks

By examining vectorization as it is performed in current CPUs and GPUs and attempting to map this to a Blocks-CGRA, we can identify some common prob- lems in Blocks that make the implementation of specialized vectorization sup- port difficult. One major issue, as discussed in Section 2.3.1, is that vectors on the Blocks- CGRA are disjoint; i.e., each FU that is part of a vector only has immediate access to its own part of the vector through its own inputs and outputs, and accessing those of neighbors requires extra wiring to be added, as seen in Figure 2.14. While this is not an issue for simple, straightforward programs such as an array copy, it quickly becomes a problem for more general programs such as a convolution filter, which requires accessing neighboring values. This problem is mainly caused by input saturation on the Functional Units in the Blocks- CGRA, which refers to the phenomenon that there are not enough input ports on a Functional Unit than we would ideally want in order to achieve direct connectivity with all other (relevant) units. This phenomenon is depicted in Figure 2.16, which depicts a basic scalar Blocks configuration with a central Register File. No direct path exists from the LSU to the ALU, or from the MUL unit to the RF unit; in order for data to pass between these units, it must travel through unrelated units. In the case of vectorized units, as each unit can have only up to four different inputs, it is not possible for the vectorized units to be fully connected. To some degree, it is possible to bypass this problem, by adding bypasses in the form of connections, but for more involved algorithms involving vector shuffles, some form of dedicated hardware for vector operations is needed. A related issue is that of immediate vectors. In order to load an immedi- ate vector into a set of units, one has a few options. One way is to prepare the constant vector(s) to be loaded, one element at a time, in global mem- ory beforehand, and then load the vector in one go using parallel Load-Store Units. However, reading from global memory incurs a penalty–and the Load- Store Units still require unique offsets in the first place. Another option is to connect separate Immediate Units to each vectorized Functional Unit, which is expensive in terms of energy usage. Moreover, one could connect the vectorized units in a network, expose one (or more) of the units to a scalar input, and load the vector into the units slowly–one element at a time. This would take linear time, which can be quite slow when vector widths are large; ideally, we would want a method that scales a bit better in terms of performance and energy costs. Finally, there is a distinct lack of support for vector operations in the current instruction set for Blocks–or indeed, a lack of instructions that would be useful for vectorized applications at all. For instance, Blocks instructions have no conditional execution bit(s) and Blocks currently does not support masking,

21 IMM ABU

RF ALU

LSU MUL

Figure 2.16: A basic scalar Blocks configuration. The red nodes indicate a unit that is input saturated. which prohibits the use of control flow divergence as in the GPU-style. Moreover, ALU instructions do not support immediate operands, which further drives input saturation, as the immediate operands now need to be routed through the network from some Immediate Unit. As we shall find later, support for specialized vector operations such as shuf- fling, extraction and insertion on Blocks shall often come down to emulating these operations with conventional instructions in linear time. Moreover, im- plementing these operations will require specialized hardware in the sense that dedicated pseudo-units must be added to the program architecture.

22 Chapter 3

Compiler model

Currently, a Blocks compiler exists based on LLVM, which is only able to gen- erate scalar code. In order to implement vector support, modifications must be made to the model used by the compiler, including to how the Blocks platform is represented in the compiler and procedures for lowering LLVM code to Blocks Parallel Assembly. This chapter focuses on the extensions made to the hardware model used by the compiler. The procedures for lowering code will be detailed in Chapter 5. We shall first introduce the vector-specific operations that exist in LLVM, which must be translated to Parallel Assembly. We then describe the extensions made to the hardware model used by the compiler, and introduce a procedure to align the Blocks configuration better with the compiler model. Next, we specify how connections between scalar and vector parts of the architecture will be implemented, and we introduce the implementation of shuffle patterns into the hardware model. Following this, a number of pseudo-units are defined and introduced, with the goal of aiding code generation for the Blocks platform. Finally, we detail changes made to the Blocks hardware itself, specifically to the Load-Store Unit memory interface.

3.1 LLVM vector operations

As described in Section 2.2, adding support for a new target platform in the LLVM compiler architecture comes down to adding a target-specific back-end. The back-end is tasked with translating instructions encoded in the LLVM Intermediate Representation language [8] to the platform’s machine code for- mat. The LLVM IR language includes three unique vector-specific operations: shufflevector, extractelement and insertelement [8]. These operations are special in that they have no scalar equivalent; they can only be performed on vectors. In addition, adding support for vector data types requires that build vector operations be supported. This is technically not part of the LLVM IR language, but is used internally in the compiler to represent vector constants and instantiations. The current Blocks compiler supports translation of most of the scalar IR instructions. However, in order to introduce full compiler support for vectoriza- tion, two goals must be realized: support for vector data types must be added to

23 existing operations, and the three vector-specific operations must be translated to machine code, i.e. PASM.

3.1.1 build vector = > , ty , ..., ty >

A build vector operation takes n scalar operands s1, s2, ..., sn, all having the same type ty, and returns a vector of type ty having length n. The operands are inserted at their respective indices, i.e. s1 goes into the first element, s2 goes into the second element, sn goes into the last element, and so forth. In the event that all operands are exactly the same, we refer to this as a splat, i.e. the single operand is “splatted” into all elements of the vector.

3.1.2 shufflevector = shufflevector > , > ,

The shufflevector operation takes two vectors of type ty as operands, v1 and v2, each having the same length n. It also takes a mask vector of size m containing integers. This mask vector is a constant, i.e. known at compile time. The result of the operation is a vector of length m with elements taken from v1 and v2 according to the indices contained in the mask vector. For instance, a mask of {0, 1, 2, . . . , n − 1} simply returns vector v1 as-is, whereas a mask of {0, 1, 2,..., 2n − 1} returns the concatenation of v1 and v2. It is possible to supply vector v1 only and leave v2 as undef (undefined). Moreover, it is possible for the mask to contain undef source indices; if this is the case, then we simply do not care what data will end up in the result vector at the destination index.

3.1.3 extractelement = extractelement > ,

The extractelement operation takes a vector val having type ty and length n, as well as an index idx having type ty2, which can be any integer type. It returns the scalar element in the vector val at the index pointed to by idx. The idx index supplied to the extractelement instruction may be a variable, i.e. unknown at compile time.

3.1.4 insertelement = insertelement > , ,

The insertelement operation takes a vector val having type ty and length n, as well as an element elt having the same type as the vector elements; finally, it

24 ID ID ID ID ID ID

FU FU FU FU FU FU FU FU FU vector=4

(a) (b) (c)

Figure 3.1: Hardware configurations for Blocks nodes in scalar and vector fash- ion.

ID

FU FU FU FU

ID ID ID ID

FU FU FU FU

ID

FU FU FU FU

Figure 3.2: Vectorized Blocks nodes with ambiguous routing. takes an index idx having type ty2, which can be any integer type. It returns a copy of the val vector with the element pointed to by idx replaced with the input element elt. Both the idx index and elt elements supplied to the insertelement instruction may be variables, i.e. unknown at compile time.

3.2 Hardware model

As the LLVM-based Blocks compiler is unable to produce a hardware configu- ration itself, one must be provided as input. The hardware configuration file, formatted in XML, specifies which Functional Units are used in the program and how they are connected to one another. In a scalar Blocks hardware configuration, each Functional Unit is connected to its own dedicated Instruction Decoder (ID), which is only used for that FU, as shown in Figure 3.1a. To specify a vectorized FU in a Blocks hardware configuration, a single ID unit can be connected to a series of FUs of the same type, as in Figure 3.1b. The result is that the same instruction is sent to all connected units, which then operate on a data set in tandem; i.e., Single Instruction, Multiple Data (SIMD). In order for the compiler to properly deal with vector types, it must first be determined what the vector width for each node in the hardware configuration is. This could be observed by checking the connections between FUs; if several FUs of the same type are connected to one ID, then this could be considered a vector node. However, this may not always suffice when routing vector types through different nodes. For instance, in Figure 3.2, the top and bottom nodes

25 Programmer Simulation

PASM source OR PASM source Compiler C source (LLVM)

Architecture Expanded Architecture XML expansion Architecture XML

Figure 3.3: The revised Blocks toolflow, with the architecture expansion pass (shown in gray) added. Previously, the architecture XML was simply copied to the simulation step. are vectorized nodes (with a single ID), which are connected to a set of FUs that each have their own ID. If we only look at the Instruction Decoder connections between nodes, then we would conclude that the nodes in the middle are not vectorized nodes. Even so, because the FU connections are lined up correctly, there is no problem with routing vector types through this (set of) nodes. As such, we would like to be able to set the vector width explicitly per node, as shown in Figure 3.1c. This allows the compiler to easily recognize, and make assumptions about, vectorized nodes in the Blocks configuration, without needing to closely examine the entire configuration file. However, for simulation and hardware synthesis purposes, we still require the full model in Figure 3.1b. Consequently, we introduce an additional architecture expansion pass in the Blocks toolflow.

3.3 Architecture expansion

The architecture expansion pass is an extra pass added in the Blocks toolflow, and is implemented as a separate tool from the LLVM-based compiler. The full toolflow with the architecture expansion pass added is shown in Figure 3.3. The pass serves two main functions:

1. Expand explicitly vectorized nodes into separate Functional Units with the proper connections; 2. Expand pseudo-unit constructs into native Functional Units supported by Blocks.

For function 1, any Functional Unit with an explicitly set vector width, as shown in Figure 3.1c, is expanded to a number of separate FUs equal to the vector width, as shown in Figure 3.1b. All connections to and from the explicitly vectorized FU are also re-routed such that they go to the respective expanded FUs. The intricacies of this re-routing are detailed in Section 3.4. Additionally, as stated in function 2, the architecture expansion pass allows us to define and use pseudo-unit constructs. This shall be further elaborated in Section 3.6.

26 3.4 Scalar-vector boundary

As the explicit vector width property in the architecture XML abstracts away the specific connections between individual FUs, rules must be defined for how these connections will be expanded during architecture expansion. In particular, we define the rules for the scalar-vector boundary, i.e. how scalar nodes connect to vector nodes and vice versa. In the following subsections, we shall refer to the producer and consumer nodes of the given connection type as A and B respectively, and the vector widths of A and B as N and M respectively.

3.4.1 Scalar to vector When the output of a scalar node A is connected to the input of a vector node B, this is expanded to a broadcast connection; the output of A is connected to the inputs with the same input index as in the abstract node, for each of the individual nodes that B is expanded to. This type of connection is shown in Figure 3.4a. An alternative would be to simply connect the output of A to the first expanded node of B; however, as we shall see later, a broadcast operation is very useful in many situations when dealing with vectors.

3.4.2 Vector to scalar When the output of a vector node A is connected to the input of a scalar node B, we must choose one of the individual nodes in the output of A that shall be connected to the input of B. Although an output can be connected to any number of inputs, one input can only be connected to one output. Therefore, we choose arbitrarily the first expanded node of A as the source for the scalar input B, as shown in Figure 3.4b.

3.4.3 Vector to vector When a vector node A is connected to another vector node B, we can identify three different situations. If N = M, then the individual expanded nodes are connected one-to-one, as shown in Figure 3.4c. In the case that N < M, we repeat the output connections to the inputs. For instance, if N = 4 and M = 8, then the first four expanded nodes of B are connected to the four expanded nodes of A, and the fifth through eighth expanded nodes of B are also connected to the four expanded nodes of A in the same order. Note that this behavior is a more general case of the scalar-to- vector behavior as defined in Section 3.4.1, where N = 1; compare Figure 3.4a and Figure 3.4d. Finally, if N > M, we take only the first M outputs in A and connect these to B one-to-one. Note that this behavior is a more general case of the vector- to-scalar behavior as defined in Section 3.4.2, where M = 1; compare Figure 3.4b and 3.4e. In the general case, Equation 3.1 formally defines the expanded connections after architecture expansion, given some connection A.out[i] 7→ B.in[j] for i, j ∈ N, i < 2, j < 4.

∀x ∈ N, x < M : A[x mod N].out[i] 7→ B[x].in[j] (3.1)

27 A0 A0 A1 A2 A3

B0 B1 B2 B3 B0

(a) Scalar-to-vector (b) Vector-to-scalar

A0 A1 A2 A3 A0 A1 A0 A1 A2 A3

B0 B1 B2 B3 B0 B1 B2 B3 B0 B1

(c) Vector N = M (d) Vector N < M (e) Vector N > M

Figure 3.4: Expanding scalar-vector boundary connections.

3.5 Shuffle patterns

On the Blocks platform, vectors are stored in a vectorized number of individual nodes equal to the vector width. Each of these individual nodes can technically also have its own individual connections. As such, the fastest way to produce a shuffled vector using a specific shuffle mask is to directly connect the output buffers of the individual elements of the input vector node(s), to the input ports of the individual elements of the destination vector node. Examples of this can be seen in Figure 2.13 for a shuffle using one input vector A and destination vector B, and in Figure 3.5 for a shuffle using two input vectors A and B and destination vector C. However, by abstracting away the individual connections between vectorized nodes for the benefit of the compiler, we would lose the ability to represent this sort of hard-wired shuffle in the architecture configuration. The individual connections between vectorized nodes are added back in during the architecture expansion step, but this occurs separately from the compiler. Therefore, we add a mask property to FU input port definitions in the high-level architecture configuration XML, which allows us to explicitly define shuffle patterns in the connections between vectorized nodes. In addition, we extend the functionality of the data source property for FU input ports. Normally, in the architecture configuration XML, each FU input port can specify a single other node as its source from which data will be received. We change the semantics of this property such that multiple other nodes can be specified as source for an input port. This can then be combined with the added mask property to address (specific elements of) various data sources in order to produce a shuffled vector. Such an explicitly shuffled connection is given a special status in the compiler back-end. Because the contents of the vector are altered (shuffled) whenever

28 in in in in in in in in

A0 A1 A2 A3 B0 B1 B2 B3 out out out out out out out out

in in in in

C0 C1 C2 C3 out out out out

Figure 3.5: A hard-wired shuffle on two inputs A and B, producing an output C, with mask <1, 2, 3, 4>. Note that this hard-wired shuffle can also be used for mask <5, 6, 7, 0> by swapping the positions of A and B. data passes through the connection, they are not suitable for standard opera- tions like routing operands. Indeed, a shuffled connection will only be used in the context of vector shuffle operations, yet still takes up one of four inputs on each FU. Taking input saturation into consideration, the programmer must be mindful of where the hard-wired shuffle patterns are placed in the architecture configuration, as data cannot be routed through them normally. In the example in Figure 2.13, the hard-wired shuffle pattern can now be represented as follows. For Functional Unit C, we set the source property of the input to both A and B in that order. Next, we set the mask property of the input port to <1, 2, 3, 4>. The mask property specifies which elements in the source list are connected to the input ports of C. The indexing occurs on a concatenation of all the individual elements of the source list; in this case, {A0,A1,A2,A3,B0,B1,B2,B3}. Therefore, mask <1, 2, 3, 4> corresponds with reading from individual nodes {A1,A2,A3,B0}, as is shown in the figure. Note that this is equivalent to mask <5, 6, 7, 0> if the order of A and B in the source list were swapped. If the mask property is not defined for some input port in the high-level archi- tecture configuration XML, then it defaults to <0, 1, 2, 3, ..., N-1>, where N is the vector width of the input port (determined by the FU for which it is an input port). Should this produce indexes that exceed the length of the source list, then we wrap around the source list. For example, take a vector-to-vector connection A to B where |A| = 2 and |B| = 4, with no explicitly defined mask property. In this case, the mask property defaults to <0, 1, 2, 3>. However, the source list is {A0,A1}, which is only two elements. By wrapping around, the source list expands to {A0,A1,A0,A1}, for which mask <0, 1, 2, 3> is valid. Note that this behavior is wholly consistent with the vector-to-vector connection rules specified in Section 3.4. In particular, the “wrap-around” es- sentially performs a modulo operation similarly to Equation 3.1 as defined in Section 3.4.3. Therefore, the added mask property and source list semantics are compatible with the scalar-vector boundary rules whilst adding support for explicitly wired shuffles.

29 3.6 Pseudo-units

The separation of the Blocks hardware model into a high-level architecture and an expanded one allows for the use of pseudo-unit constructs. These are Blocks nodes that abstract some specialized functionality for the compiler, which can be constructed from FUs available natively in the Blocks-CGRA. Like explicitly vectorized nodes, these pseudo-unit nodes allow the compiler to recognize and make assumptions about the availability of certain higher-level behaviors and operations. In this section we will detail two such pseudo-units which have been imple- mented in the compiler, namely the Vector Shuffle Unit (SHF) and Multiplexed Register File (RFM).

3.6.1 Vector Shuffle Unit (SHF) A problem with handling vectors in vector-expanded Blocks architectures is that of vector locality. Each individual Functional Unit can only access its own input ports and output buffers, and accessing those of other elements in the vector requires explicit connections between those individual FUs. However, since each individual Functional Unit has only two output buffers and four input ports, as the vector width increases it quickly becomes impossible for the vector-expanded FUs to be fully connected. For instance, for an expanded node in a vector of 4 to have direct access to all other elements in the vector, three of the four available input ports must be used up, leaving only one input port for external data. Furthermore, since vector nodes in the Blocks-CGRA (normally) share a single instruction decoder, it is not always possible to specify unique behavior for the individual FUs in the vector. The ALU supports a Conditional Move (CMOV) operation, but this would require that the vector index of every ALU be compared to a constant indicating the index of the ALU for which the behavior should deviate from the rest of the vector. This by itself uses up inputs, and also requires additional inputs to be used to make the vector connected. These problems makes it difficult to perform vector shuffles with arbitrary patterns. One option is to embed hard-wired shuffle patterns into the architec- ture, as seen in Figure 2.13. However, for this approach to work effectively, it must be known in advance which shuffle patterns will be used in the program to be compiled. Another option is to shuffle the vector via memory, by writing and reading it with particular load/store masks, which we shall consider further in Section 5.5.3. We introduce the Vector Shuffle Unit (abbreviated as SHF), a pseudo-unit designed to mitigate these problems and allow for compiler support of any pos- sible vector shuffle pattern. Pre-expansion, the Vector Shuffle Unit has a form as shown in Figure 3.6; it supports two external inputs and has one available output buffer. The expanded version of the Vector Shuffle Unit is shown in Fig- ure 3.7 and consists of a network of ALU units connected in a circular fashion; the two remaining input ports in each FU are used to access the left and right neighboring elements, available through a reserved output port. This makes it very similar to the circular communication network in the wide SIMD archi- tecture described in [16] and shown in Figure 2.12. Unlike the communication network in the wide SIMD architecture, however, each individual ALU in the

30 in0 in1 SHF vector=4 out0

Figure 3.6: Vector Shuffle Unit pre-expansion.

in0 in1 in0 in1 in0 in1 in0 in1 in2 out1 in2 out1 in2 out1 in2 out1 ALU0 ALU1 ALU2 ALU3 out1 in3 out1 in3 out1 in3 out1 in3 out0 out0 out0 out0

Figure 3.7: Vector Shuffle Unit post-expansion.

Blocks circular network has its own associated Instruction Decoder unit; i.e., there are as many Instruction Decoders as there are ALUs. The Vector Shuffle Unit solves the two problems outlined earlier by guaran- teeing connectivity between all scalar nodes in the vector, and by the individual Instruction Decoders which allow for element-specific operations such as PASSes from arbitrary input ports. The compiler can then use these features to gener- ate a series of instructions that produce a vector shuffle for any arbitrary shuffle pattern. As the Vector Shuffle Unit exposes two external input ports, it is suit- able for the LLVM shufflevector IR operation, which also takes (up to) two vectors as input.

3.6.2 Multiplexed Register File (RFM) When dealing with larger compiled programs on the Blocks-CGRA, especially vectorized programs, register pressure is an issue. Register spilling is a problem more difficult to solve on a CGRA than on a traditional processor, as operands are explicitly routed between Functional Units; inserting any sort of spilling code would require that operands potentially be re-routed after scheduling, as

31 in0 in1 in2 in3 RFM rfs=4 out0 out1

Figure 3.8: (Scalar) RFM pre-expansion.

in0 in1 in2 in3 in0 in1 in2 in3 in0 in1 in2 in3 in0 in1 in2 in3 RF0 RF1 RF2 RF3 (r0-r15) (r16-31) (r32-r47) (r48-r63) out0 out1 out0 out1 out0 out1 out0 out1

in0 in1 in2 in3 ALU config=0 out0 out1

Figure 3.9: (Scalar) RFM post-expansion.

32 the registers that need to spill to memory may have to pass through nodes which are already reserved for routing. Essentially, this invalidates the whole schedule. The current Blocks compiler has no support for register spilling or for multiple Register Files, so once the 16 available registers in the scalar Register File are used up, the program becomes unschedulable. To mitigate this issue, we introduce another pseudo-unit, namely the Mul- tiplexed Register File (RFM). Though the Blocks compiler has no support for multiple Register Files, we can bypass this by abstracting more than 16 reg- isters into one single node. The RFM fulfills this goal by using an ALU with configuration bit set to 0; this makes it so that output port 0 of the ALU is unbuffered, and therefore allows for PASS operations with zero-cycle overhead. As such, latencies for writing to and reading from the RFM are the same as with a regular Register File, and the RFM is designed to be functionally equivalent to a regular Register File, only with support for up to 64 registers rather than 16. The pre-expansion RFM is shown in Figure 3.8; for an RFM containing 4 RFs, which is the maximum, this expands to a form as shown in Figure 3.9. During architecture expansion the individual RFs are generated. Each RF has its own individual Instruction Decoder; if the RFs were to share a single Instruction Decoder, it would not be possible to write to one register without also writing to three other registers in the process, as each RF would execute the same instruction, and there is no form of predication available in the RF instruction set architecture. The four input connections to the RFM are copied to all four RFs in the expanded RFM; this allows for any of the total 64 registers to be written to from any of the four RFM input ports at any time. Next, any connections from output port 0 of the abstract RFM are re-routed to use output port 0 of the first RF in the expanded RFM instead; this ensures that register r0 is always available on the lower output port, which mimics the same feature in a regular RF on the Blocks-CGRA. Moreover, output port 1 of every RF in the expanded RFM is routed to an additional ALU with an unbuffered output port 0; this ALU is tasked with forwarding the loaded register from the associated RF to its output port with zero-cycle delay. Any connections from output port 1 of the abstract RFM are then re-routed to use the (unbuffered) output port 0 of the ALU instead; this ensures that any of the 64 registers can be loaded without any delay, exactly as in a regular RF in the Blocks-CGRA. In other words, the RFM has exactly the same features and semantics as a regular RF. Therefore, we can substitute an RFM for a regular RF in the compiler, which gives us access to 64 registers rather than 16. As the registers are still stored in a single (abstract) node, this is fully compatible with the Blocks scheduler, which supports only one RF location. The RFM supports anywhere between two and four internal RFs, allowing for 32, 48 or 64 registers, as needed. Compared to a proper multiple RF support in the scheduler, the additional hardware overhead is one additional ALU and associated Instruction Decoder, which perform the multiplexing needed when loading registers, as well as up to 5n − 4 extra wires, where n is the number of internal RFs used. Currently, the Blocks compiler can only read the Blocks architecture config- uration and convert it into a resource graph, but it is not equipped to modify the configuration. As such, the compiler is unable to prune unused internal RFs

33 in the RFM, or replace the RFM with a normal RF altogether. Care must be taken by the programmer that the size of RFM chosen does not result in wasted RFs.

3.7 Vector indices

Due to how vectors on the Blocks-CGRA are stored and passed through the system, i.e. through a vector of Functional Units, a problem arises when dealing with vectors in shared memory. Traditionally, pointers are scalar types, e.g., a pointer to a struct points to the start of that struct in memory. Likewise, a pointer to a vector normally points to the first element in the vector. However, on the Blocks-CGRA, when loading a vector from, or storing a vector to shared memory, the memory operation will be performed by a series of vectorized Load- Store Units. As a result, each LSU in the vector must have a pointer to its own individual element in memory. This can be achieved by adding the index of each LSU in the vectorized set of LSUs to the base pointer as an offset. The result is that the scalar pointer is expanded to a vector of pointers, each pointing to an individual vector element in memory. This vector of pointers can then be passed to the vectorized LSUs to perform the memory operation correctly. When writing a Blocks program manually in PASM, this expansion can simply be done once at the start of the program or before the main loop. The already expanded pointer can then be freely used throughout the program. However, LLVM follows the traditional notion that a pointer to some object in memory points to the start of that object. Therefore, a pointer in LLVM is always treated as a scalar. In other words, there is a pointer type mismatch between LLVM and the Blocks programming model. To support vector memory operations on Blocks in LLVM, one option is to expand every memory operation such that the scalar pointer, which is used as an input parameter for the memory operation, is expanded to a vector of pointers, offset with each individual LSU’s vector index. However, doing so for every memory operation would significantly hamper performance. Instead, we propose supporting vector indices natively in hardware. We introduce an internal vector index register for Load-Store Units, which is trans- parently added as an offset when performing memory operations on shared memory. For some memory operation on shared memory, let O be the address operated on, and let A be the address that is specified. In the case of an explicit memory operation, A will be loaded from one of the LSU’s input ports, whereas for an implicit memory operation, it is loaded from an internal LSU control register instead. We can then model the addressing behavior of the LSU (somewhat trivially) as follows.

O = A That is, the address operated on (e.g. loaded from or written to) is exactly the address that was specified. We propose the following modification to the addressing behavior of the LSU, where V is the vector index of the individual LSU performing the memory

34 operation, and D is the size of the data type; i.e., 1 for BYTE, 2 for HWORD and 4 for WORD.

O = A + V × D With the revised behavior, it is now possible to transparently offset the memory address operated on by setting V , the vector index of the LSU. As a result, by setting V correctly in advance for each LSU in a vectorized set of LSUs, it is now possible to provide a pointer to the start of a vector in memory to the set of LSUs, and load or store the entirety of the vector from or to memory. This resolves the pointer type mismatch problem. Note that when V = 0, then O = A as before, preserving backwards compatibility with previous Blocks programs. The data size D is embedded in LSU instructions and is therefore already available. Though a full multiplication is somewhat expensive to do in hardware, as the data size is always a power of 2, the multiplication can be synthesized as a small left-shift instead. Meanwhile, as the LSU already contains a full- for computing the next address following an implicit memory operation, this full-adder could be re-used for adding the vector offset. We modify the LSU to add an additional internal register for the vector index and adjust the addressing behavior as specified above for shared memory. For local memory this addressing behavior is not needed, as the memory is unique to the individual LSU; writing vectors to or loading vectors from local memory already works correctly with the original addressing behavior, therefore we simply disable the addition for local memory. As the vector index for each expanded LSU is known at the architecture expansion stage, we incorporate a step into the architecture expansion that sets the default value for the internal vector index register. For example, an LSU with vector width 4 is expanded to 4 separate LSUs, for which the vector indices are set to 0, 1, 2 and 3. To support gather-scatter operations, we also introduce a new instruction to change the vector index at runtime, detailed in Section 4.2.1. By changing the vector indices before a load or store, the load or store can be gathered from or scattered to memory using an arbitrary pattern. We also introduce an instruction in Section 4.2.2 that resets the LSUs’ vector indices back to their original value, as set during architecture expansion. Moreover, we also add functionality to the LSU to disable memory operations at runtime. The most significant bit of the vector index is used as a read/write enable bit; if it is set, then any reads from or writes to shared memory are skipped. This allows for the vector width to be changed at runtime, simply by disabling the unneeded LSUs, i.e. by setting their vector indices to a negative value. In conclusion, with the addition of LSU index registers, the problem of pointer type mismatch between LLVM and the Blocks programming model is resolved whilst also adding hardware support for gather-scatter operations and dynamically sized vectors.

35 Chapter 4

Instruction set extensions

To facilitate the use of common vector operations on Blocks, a number of ex- tensions to the Blocks instruction set are proposed. Though the additional instructions introduced in this chapter are mostly intended to improve perfor- mance of vector operations, they also have use in scalar programs as shall be outlined.

4.1 ALU instructions 4.1.1 ADDI: Add Immediate The Add Immediate (ADDI) instruction adds the value from the specified input port and the specified immediate value together, and writes the resulting value to the specified output buffer. To fit the immediate value in the instruction encoding, only 4 bits are allocated to store it, which would allow a range of 0 through 15. However, since adding 0 is equivalent to a PASS operation, we can exclude 0 from the allowed range by also enabling the carry bit on the ALU during the addition, which shifts the range of immediates to 1 through 16.

Usage ADDI out,imm,in ; out := in + imm

Parameters ˆ out: Specifies the output buffer to which the result will be written. This pa- rameter can be any of the Functional Unit’s available outputs. ˆ imm: Specifies the immediate value to be added to the input value. The imme- diate value can be any integer in the range of 1 through 16. ˆ in: Specifies the input port from which the input value will be read. This parameter can be any of the Functional Unit’s available inputs.

36 Rationale The Add Immediate instruction can be used to add small values to an input, which provides key benefits in various situations. The main benefit of the ADDI instruction is to remove strain on the Immediate Unit(s) and free up inputs in the ALU. ˆ Vectorization: Perhaps the greatest benefit of the ADDI instruction comes into play when dealing with vectorized architectures, especially those with a Vector Shuf- fle Unit. A vector algorithm frequently makes use of a series of LSUs to read values, and each of these LSUs need their own offsets, which are usually quite small. A Vector Shuffle Unit can produce these offsets with- out the need for additional Immediate Units or computationally expensive insertelement and/or shufflevector operations. In the more general sense, a Vector Shuffle Unit can be used to produce any desired vector of (small) immediates. This is extremely helpful for vector algorithms that use per-element coefficients, such as convolution with a non-standard convolution window. ˆ Loop counter incrementing: In case an ALU is responsible for keeping track of a loop counter (e.g. by means of a self-edge from one of its outputs to one of its inputs), the ADDI instruction allows the ALU to increment the loop counter without needing to connect an extra input, from which the increment can be read. This frees up an input that can be used for other purposes. Loop counter incrementing can also be performed by the Zero-Overhead Loop Acceler- ator [13], but the ADDI instruction is still useful in the case that multiple counters need to be incremented across loop iterations, or need to be in- cremented conditionally (e.g. in a counting algorithm). ˆ Immediate generation: Combined with the Logic Shift Left 4-bit (SHLL4) instruction, the ADDI instruction allows for the generation of arbitrary immediates with only a single self-edge. This is performed by first resetting the output buffer to zero, which can be achieved with a XOR instruction on same inputs. Next, the ALU alternates between ADDI and SHLL4 to produce 4 bits of an immediate value at a time, until the desired value is obtained. This procedure runs in, at most, max(1, 2n) cycles, n = blog2(x)c, where x is the desired immediate value. The cycle count can be reduced by skipping any zero nibble in the desired value. It should be noted that, for many practical purposes, 4-bit or similarly small immediates may already suffice for many algorithms, particularly those that use small coefficients. For instance, image filters such as blur and edge detection frequently use small coefficients. Generating such im- mediates in an ALU reduces the strain on the Immediate Unit or even renders it unnecessary, which is beneficial as an Immediate Unit is rather costly to energy and area usage in the current Blocks hardware. This is because the program memory of an Immediate Unit has a width of 32 bits to store the full range of immediates, whereas other Functional Units have a program memory width of only 12 bits. In such situations, it takes only

37 2 cycles rather than 1 cycle to produce the desired immediate. Indeed, the first cycle (clearing the output) can usually be performed in advance, which removes the penalty on running time.

ˆ Offset calculation: When coupled with a Load-Store Unit, the ADDI instruction can be used to simplify offset calculation. In situations where implicit addressing is not always ideal, e.g. when reading or writing structs with mixed data types, the ADDI instruction provides a quick way to make small, varying incrementations to the LSU address without the need for an Immediate Unit.

Emulation on prior Blocks hardware An ADDI instruction could previously be emulated by attaching an Immediate Unit to an ALU. However, this did use up an additional input on the ALU that the ADDI instruction avoids.

4.1.2 SUBI: Subtract Immediate The Subtract Immediate (SUBI) instruction subtracts the specified immediate value from the value from a specified input port, and writes the resulting value to the specified output buffer. Similarly to the ADDI instruction, the allocated space for the SUBI immediate is set to 4 bits, which allows for a range of 1 through 16 when making use of the borrow flag already present inside the ALU.

Usage SUBI out,imm,in ; out := in - imm

Parameters ˆ out: Specifies the output buffer to which the result will be written. This pa- rameter can be any of the Functional Unit’s available outputs. ˆ imm: Specifies the immediate value to be added to the input value. The imme- diate value can be any integer in the range of 1 through 16.

ˆ in: Specifies the input port from which the input value will be read. This parameter can be any of the Functional Unit’s available inputs.

The imm parameter is intentionally chosen as the second parameter rather than the last parameter, as this is consistent with the current SUB instruction, where the second and third parameters represent the right-hand side and left- hand side of the operator respectively.

38 Rationale The Subtract Immediate instruction can be used to subtract small values from an input. This is useful in similar situations to the Add Immediate instruction. As with the ADDI instruction, the SUBI instruction frees up an input in the ALU from an Immediate Unit.

ˆ Vectorization: The SUBI instruction is not quite as useful on vectorized architectures as the ADDI instruction. However, the benefits listed above still extend to vectorized algorithms. In algorithms that use negative constants or coefficients, the SUBI instruction can be an effective way to generate small negative immediates. ˆ Loop counter decrementing and checking: In case an ALU is responsible for keeping track of a loop counter that decreases to 0, e.g. by means of a self-edge from one of its outputs to one of its inputs, the SUBI instruction allows the ALU to decrement the loop counter as well as check whether 0 is reached in one go. Since the SUBI instruction writes to an output buffer, this output buffer can be con- nected to the input of an Accumulate-and-Branch Unit, and used with a conditional branch instruction to jump back to the start of the loop. If the result of the SUBI instruction is 0, then the conditional branch is not taken and the loop terminates. In addition, this can be done without needing to add an extra input from which the decrement is read. This method of loop counter decrementing and checking allows for a powerful single-cycle loop construction for simple algorithms. Loop counter decrementing can also be performed by the Zero-Overhead Loop Accelerator [13], but the SUBI instruction is still useful in the case that multiple counters need to be decremented across loop iterations.

ˆ Immediate generation: In a similar fashion to the ADDI instruction, the SUBI instruction can be used to generate small negative values more efficiently. This is potentially quite useful for algorithms that use small coefficients.

ˆ Offset calculation: Similar to the ADDI instruction, the SUBI instruction can be used to make small adjustments to offsets for a Load-Store Unit. This could be used e.g. when a list is being read and a rewind needs to be performed, for instance when performing a bubble sort.

Emulation on prior Blocks hardware A SUBI instruction could previously be emulated by attaching an Immediate Unit to an ALU. However, this did use up an additional input of the ALU that the SUBI instruction avoids. If only the ADDI instruction is available, a SUBI can also be emulated by first performing a NOT on the input, then an ADDI, and finally another NOT. However, this again requires an additional input (in this case, a self-edge), and runs slower than when an Immediate Unit can be used directly.

39 4.2 LSU instructions 4.2.1 SVX: Set Vector Index The Set Vector Index (SVX) instruction changes the internal vector index regis- ter for the LSU, as specified in Section 3.7, at runtime. The value of the vector index is set to the value that is specified on the input port. If the new vector index is negative, then memory operations performed on this LSU are masked going forward, until the vector index is changed to a positive value.

Usage SVX in ; vector index := in

Parameters ˆ in: Specifies the input port from which the input value will be read. This parameter can be any of the Functional Unit’s available inputs.

Rationale The main use of the Set Vector Index instruction is to support gather-scatter operations. By changing the vector indices of the LSUs at runtime, a load from or write to memory can be gathered or scattered over any arbitrary pattern.

Emulation on prior Blocks hardware LSU index registers did not previously exist on the Blocks hardware, so there was no equivalent for the Set Vector Index instruction. However, gather-scatter operations could be performed by manually adding offsets to a vector of pointers.

4.2.2 RVX: Reset Vector Index The Reset Vector Index (RVX) instruction changes the internal vector index register for the LSU at runtime. The value of the vector index is reset to the value that was specified as the default value in the architecture XML, which is set during architecture expansion.

Usage RVX ; vector index := default vector index

Parameters The Reset Vector Index instruction does not use any parameters.

40 Rationale The Reset Vector Index instruction pairs with the Set Vector Index instruction to reset vector indices that were changed at runtime with the SVX instruction, back to their original values. As the RVX instruction does not use any parame- ters at all, there is no need to perform operand routing or generate immediate vectors, which makes this a convenient method to undo changes to vector in- dices.

Emulation on prior Blocks hardware LSU index registers did not previously exist on the Blocks hardware, so there was no equivalent for the Reset Vector Index instruction.

4.2.3 LVD: Load Vector Default Index The Load Vector Default Index (LVD) instruction writes a value to the specified output buffer. The value that is written is the default value for the vector index of the LSU that was specified in the architecture XML, set during architecture expansion. Even if the Set Vector Index instruction was used to change the vector index at runtime, the Load Vector Default Index instruction will produce the original value of the vector index before any changes were made.

Usage LVD out ; out := default vector index

Parameters ˆ out: Specifies the output buffer to which the result will be written. This pa- rameter can be any of the Functional Unit’s available outputs.

Rationale The Load Vector Default Index instruction has a limited, yet crucial use. As the architecture expansion pass initializes the default vector indices of an expanded LSU with vector width N to 0, 1, 2, 3,...,N − 1, the LVD instruction can be used to immediately produce an index vector < 0, 1, 2, 3,...,N − 1 > without any operand routing. This index vector is a crucial building block for several vector-specific operations, as it allows for element-specific behavior in a vector of ALUs via the EQ and CMOV instructions; this shall be further detailed in Chapter 5. By utilizing the fact that vector indices can be reset to exactly this index vector, we introduce the LVD instruction to efficiently produce this index vector at any time.

Emulation on prior Blocks hardware An LVD instruction could previously be emulated by attaching a series of Im- mediate Units to the consuming units. However, this is quite costly in terms of hardware, and uses up an additional input on the consuming units. With the LVD instruction, existing LSU units and their connections can be used instead.

41 An alternative is to pre-load each LSU’s index into local memory before the program is executed, so that they can be read with no stall cycles during execution of the program. However, this requires memory to be allocated and routing an address operand to the LSUs, which may not always be convenient or efficient to schedule.

42 Chapter 5

Compiler vector support

In order to actually implement support for vector data types and operations in the Blocks compiler back-end, a number of changes must be made to the code lowering (instruction selection) and scheduling stages. This chapter shall describe the changes made to these processes, and the procedures by which high-level LLVM vector operations are lowered to PASM code. We shall first detail how the storage and routing of vector data units is implemented in the code scheduling stage. This allows for the scheduling of all vector operations that have a scalar equivalent. Next, procedures are given for building arbitrary vectors, and for lowering each of the remaining vector-specific LLVM operations introduced in Section 3.1. This includes insertelement, extractelement and finally shufflevector. Various procedures are given for shufflevector in particular, which have varying requirements and benefits.

5.1 Storage and routing

On most traditional processors, the list of instructions to be executed is one- dimensional. Only one Instruction Fetch/Decode unit exists in the system, which reads from a singular Instruction Memory. However, on the Blocks- CGRA, the situation is different; the CGRA is reconfigurable, and any number of Instruction Fetch/Decode units may be present in the system. For an in- struction to be scheduled, choices must be made regarding not only when the instruction will be scheduled, but also which Functional Unit will execute the instruction, and how the input parameters will be routed through the system to the Functional Unit which will execute the instruction. In Section 2.2.2, it was described how the Blocks compiler converts the architecture configuration into a cycle-by-cycle resource graph, and schedules instructions onto the resource graph by routing input parameters through the system and selecting a suitable FU that is reachable. So far, this scheduler has only supported scalar architectures and data types. In order to add support for vectorized programs, we can identify two routes we can take. One is to expand all vectorized operations in the program back to scalar operations and schedule these onto the resource graph as we would a regular scalar program. Following this, all instructions scheduled onto the indi- vidual vector element nodes would be combined back into a single instruction.

43 However, this option is not very feasible due to a number of factors. In order for the instructions to be recombined into a single instruction, the instructions for each vector element node must be exactly the same across every cycle; op- codes, as well as which input ports are being read from and which output ports are written to. This also goes for any routing operations for vector operands. This is because all vector element nodes make use of the same IF/ID unit and instruction memory. If any one instruction is off, then the schedule is invalid. Of course, it would be possible to mitigate this by giving each vector element node its own individual Instruction Decoder, but at that point we are back to Single Instruction, Single Data; i.e., we are not really working with vectors any- more. Moreover, by scalarizing the program it becomes unclear where a certain vector variable is stored at any given time, as it may have been only partially scheduled or the elements may be strewn about other units in the system. Ad- ditionally, by scalarizing the program we lose some vector-specific operations like shufflevector, whereas efficient execution of such operations is one area where vectorized programs can gain speedups over scalar programs. Mainly because of the necessary synchronization across all vector element nodes, it makes more sense to consider the vector nodes as a whole. Therefore, in order to add support for vectorized programs, the first step is to introduce sup- port for vector architectures and data types to the resource graph and scheduling algorithms. The architecture expansion pass provides a key benefit to the compiler here. As described in Section 3.3, the architecture configuration that is fed to the compiler now has explicitly defined vector widths for each FU node, which are automatically expanded to a series of FUs for simulation and synthesis in a well- defined manner. As such, we can use this information in the compiler in order to know which nodes vector data types may be stored in and routed through. We allow vector data types to be stored in FU nodes that have at least the vector width required. For instance, a vector having a width of 2 may be stored in a node that has a width of 2, or a width of 4, but not in a node that has a width of 1, as then only one of the two elements in the vector could be stored. Because the Blocks scheduler stores instruction output in the executing FU node rather than an output buffer directly, as a result, this also allows operations that produce a vector as output to now select a vectorized FU as the executing FU. We achieve this by adding a condition to the scheduler which checks that the width of a FU node be at least the width of the output vector of the instruction in order for the FU to be marked as suitable for executing the instruction. For routing, we apply the same principle; data types may be routed through nodes that have at least the required width. In this case, note that a vector data type retains the same vector width regardless of the width of the node(s) that it passes through. We can apply the rules defined in Section 3.4 to verify that this does not result in problems. We route a vector in A to its destination C, both having the same vector width, i.e. |A| = |C|. The data is routed through node B, which may have a different vector width. By applying the vector-vector connection rules as specified in Section 3.4.3 for various vector widths for B, we obtain Figure 5.1. When |B| = |A| as in Figure 5.1a, all data in A can pass through B to C safely. This is also the case when |B| > |A| as we see in Figure 5.1b. However, as seen in Figure 5.1c, when |B| < |A|, some elements from A are lost when routed through B. In this example, there are no paths from A2 and A3 to C2 and C3 respectively.

44 A0 A1 A0 A1 A0 A1 A2 A3

B0 B1 B0 B1 B2 B3 B0 B1

C0 C1 C0 C1 C0 C1 C2 C3

(a) Same width (b) Larger width (c) Smaller width

Figure 5.1: Routing a vector through different node sizes.

We can therefore conclude that our method of allowing vector data types to be routed through any node with the same width or higher is sound. Routing through a node with a smaller width results in data loss, so we disallow this. By applying these rules to the resource graph and scheduler, it now becomes possible to schedule most standard instruction which use vector(s) as input pa- rameters and/or produces a vector as output, given that the instruction already had a supported scalar equivalent, and given than the instruction produces a result. For instance, the scalar add operation was previously supported in the scheduler, and a vectorized add operation is now similarly supported. In this case, the scheduler will route the input vectors through the system as defined in this section, and will select any FU node with at least the width of the output vector as the executing FU node, thereby scheduling the instruction onto the resource graph.

5.2 Building vectors

One operation that is frequently required in generating code for vectorized pro- grams is building arbitrary vectors; for instance, when declaring a constant vector, or casting a series of variables to a vector. LLVM can usually expand this operation to a series of insertelement instructions that insert the elements of the vector to be built one by one. However, for certain types of vectors we can improve performance by using a base vector obtained through other means. For example, filling a vector with copies of the same element can be done much faster than by using insertelement instructions, as will be described in this section. When building arbitrary vectors, we can identify two steps: (a) producing a base vector, and (b) inserting the remaining elements which do not correspond with the base vector.

45 5.2.1 Base vector To compute and produce the base vector, we first calculate three similarity counts on the result vector V . sconst, defined in Equation 5.1, is a count of how many elements in V are constant values. sindex, defined in Equation 5.2, counts how many elements in the vector match the index vector {0, 1, 2,... }. Note that this also means the elements must be constants, therefore sindex ≤ sconst. Finally, ssplat, defined in Equation 5.3, computes the number of occurrences of the maximally occurring element in V .

sconst = |{i | Vi is a constant}| (5.1)

sindex = |{i | Vi = i}| (5.2)

ssplat = max |{i | Vi = x}| (5.3) x∈V

We then compare the three computed similarity counts to find the maximum. Based on the maximum similarity value, one of three methods is chosen to generate a base vector B. If two similarity counts are equal, the method with the higher priority is chosen. The three methods are as follows, ordered by priority from highest to lower.

Index vector

If sindex is the maximum similarity count, this means the vector V which we want to build is very similar to the index vector {0, 1, 2,... }. Therefore, we use the index vector as the base vector, i.e. B = {0, 1, 2,..., |V | − 1}. Note that also sindex ≤ sconst; therefore, sindex = sconst in this situation, but sindex takes priority. The index vector is produced simply by taking the output of the LVD instruction on a vectorized LSU with suitable vector width. As was defined in Section 4.2.3, the LVD instruction outputs the default value for the vector index of the LSU that executes it. Because the compiler uses the architecture config- uration before the architecture expansion pass is run, the default value for each LSU’s vector index–and thereby the result of the LVD instruction–matches the index of the individual scalar LSU in the vectorized LSUs. As such, we can safely make the assumption that the LVD instruction will output {0, 1, 2,..., |V | − 1} for a vectorized LSU with width |V |, which matches our desired base vector B.

Splat vector

If ssplat is the maximum similarity count, this means that a single element y occurs many times in V . m, the maximally occurring element in V , is defined as y = arg maxx∈V |{i | Vi = x}|, which is similar to the definition of ssplat. We take as the base vector B = {y, y, y, . . . } such that |B| = |V |. In other words, we take the vector with size |V | where every element is y. This is also commonly referred to as splatting y into a vector of size |V |, where the resulting vector is called the splat vector of y. We can obtain the splat vector quite easily, simply by applying the scalar-to-vector rule as defined in Section 3.4.1; any connection of a scalar node to a vector node is expanded to a broadcast, where the value in the scalar node is connected to all individual elements of the vector node. In this sense, a splatting operation on a vectorized Blocks configuration is

46 not an explicit operation, but rather an implicit one; it is simply a byproduct of routing scalar data to vectorized nodes. As a result, producing the base vector B is effectively a zero-cycle operation, as y will automatically be splatted as soon as it is routed to a vector node.

Immediate generation

Finally, if sconst is the maximum similarity count, this means V mainly consists of constants or contains more constants than occurrences of the maximally oc- curring element. As such, we construct a base vector B which contains only the constants in V , and has a 0 for all elements where the corresponding element in V is a variable. Formally, B is defined in Equation 5.4 as the result vector of applying as constant(x) (Equation 5.5) to every element x ∈ V .

B = as constant(V ) (5.4) ( x if x is a constant as constant(x) = (5.5) 0 otherwise

The resulting base vector B is a vector containing only constants. We gener- ate this vector through a process of immediate generation on the SHF. As such, if there is no SHF node in the system, we disregard sconst and choose one of the other two methods. We can use the SHLL4 (Logic Shift Left 4-bit) and XOR (Exclusive OR) in- structions in the instruction set architecture for ALUs in the Blocks-CGRA, as well the added ADDI (Add Immediate) and SUBI (Subtract Immediate), de- scribed in Section 4.1, in order to generate arbitrary immediate values on an ALU without the need for an Immediate Unit. We start by executing a self- XOR; a XOR instruction where the two inputs are set to the same port. XOR- ing a value with itself always produces an output of 0, so the purpose of this self-XOR is to initialize the immediate to 0. Next, we use the ADDI or SUBI instruction to add or subtract a 4-bit immediate to the immediate being gen- erated. Then, if the immediate is not yet done, we use the SHLL4 operation to left-shift the immediate by 4-bits. We can then repeat the process of adding 4-bit immediates and shifting the result to fill in the hexadecimal digits of the desired immediate one by one. Suppose we would like to generate the immediate 4611, which is 0x1203 in hexadecimal. The chain of instructions for this immediate would be as follows:

1. XOR an arbitrary input port with itself to obtain the value 0x0. 2. ADDI 1 to the previous result to obtain the value 0x1. 3. SHLL4 left-shift the previous result to obtain 0x10.

4. ADDI 2 to obtain 0x12. 5. SHLL4 to obtain 0x120. 6. SHLL4 to obtain 0x1200. (We can skip the ADDI operation as adding 0 is equivalent to a PASS, which is not useful.)

47 7. ADDI 3 to obtain 0x1203.

Thus, we generate the immediate 4611 in 7 cycles. This, of course, is slower than simply using an Immediate Unit, where any 32-bit immediate can simply be generated in a single cycle. However, if there is a SHF node in the system, then we can perform im- mediate generation for the entire vector in tandem. For example, if we want to generate a vector of 8 immediates where each immediate takes (at most) 7 cycles to generate, then we only need 7 cycles to generate the entire vector. This is possible because the SHF node is expanded to a set of ALUs which each have their own Instruction Decoder, making them a perfect fit for immediate generation. As such, if the immediates to be generated are fairly small, and the vector size is large, then immediate generation on the SHF can yield far better performance than producing the immediate on a scalar Immediate Unit and inserting them one-by-one; in the example situation, this would result in 7/8th of a cycle per immediate using a SHF, and 1 cycle per immediate using an Immediate Unit. Having a vector of Immediate Units in the system that has the correct width would still produce a vector of immediates faster, as the entire vector could be output in a single cycle; however, Immediate Units are one of the more expensive units in the Blocks-CGRA in terms of energy usage due to having a much wider instruction memory, namely 32 bits versus the standard 12 bits, so using immediate generation on a SHF would avoid the energy and area overhead of having a vectorized IU.

5.2.2 Remainder insertion Following creation of the base vectors, it remains to insert any elements that are not present in the base vector. To achieve this, we generate a set of element, index pairs Rinsert as defined in Equation 5.6. For each element, index pair (x, i) ∈ Rinsert, we then append insertelement instructions where the input vector val is the base vector B for the first instruction, or the result of the previous instruction for subsequent inserts; the element to be inserted, elt, is x; and the index to insert into, idx, is i. Each insertelement instruction is lowered as described in Section 5.3.

Rinsert = {(x, i) | x = Vi ∧ x 6= Bi} (5.6)

5.3 Insert element

On the Blocks-CGRA, vector data types are stored in a vectorized set of Func- tional Units. However, this approach has the problem of vector locality, as described in Section 3.6.1. Because of this, it is not possible to read or write a vector element from/to an arbitrary index directly on the Blocks-CGRA. In or- der to implement the insertelement instruction, as described in Section 3.1.4, we take a different approach. We split up the insertion into two steps:

1. For every individual unit in a vectorized FU, check if the vector index of that unit matches the index of the element we want to insert.

48 2. Based on the result of the previous step, output either the element from the original vector (if the indices did not match), or the new element to be inserted (if the indices did match).

The only Functional Unit in the Blocks-CGRA that supports these opera- tions is the ALU, so we schedule the insertelement instruction onto the vec- torized ALU. For step 1, we use the EQ operation, with as its two parameters the splat vector of the index to insert into, and the index vector {0, 1, 2,... }. This makes use of the fact that splatting is an implicit operation with no additional overhead aside from routing, as described in Section 5.2.1, as well as the fact that the LVD instruction can be used to produce the index vector at any time, as described in Section 5.2.1. The EQ operation in step 1 sets the ALU flag if and only if the insertion index matches the vector index of the FU. For step 2, we use the CMOV instruction. As the false-parameter for the CMOV instruction, we use the input vector of the original insertelement instruction; as the true-parameter, we use the splat vector of the element to be inserted. This, again, uses the fact that splatting is implicit. The result of this operation is that the result element will be the element from the input vector if the indices did not match, or the new element to be inserted if the indices did match, thereby completing the insertion. With the architecture expansion rules for scalar-to-vector connections as described in Section 3.4.1, splatting a vector is a zero-cycle operation; moreover, the LVD instruction always takes 1 cycle and does not require routing. As a result, the insertelement instruction is implemented in such a way that always takes exactly 2 cycles (3 when including the LVD instruction, but this can be done ahead of time) regardless of the vector width.

5.4 Extract element

For extracting an element from a vector, we encounter the same problem as with insertion, in that it is not possible to read a vector element from an arbi- trary index directly. To properly extract the element from the vector, it must be usable in scalar operations after the extractelement instruction terminates. From the architecture expansion rules for vector-to-scalar connections, as de- scribed in Section 3.4.2, it follows that for the extracted element to be usable in scalar instructions afterwards, the element must be moved to index 0 in the vector. We can therefore consider the extraction operation as a vector shuffle instead, and we rewrite the extractelement instruction to a shufflevector instruction that moves the element to be extracted to the first position in the vector. Going by the vector-to-scalar connection rules again, we find that the con- tents of all vector elements that are not the first element are discarded when data is being routed from a vector node to a scalar node. This would imply that it would suffice to specify for the shufflevector instruction a mask of where x is the index of the element we wish to extract. For this we would accept any shuffle that produces x as the first element, and undef elements are essentially treated as wildcards. The result of a shuffle with this mask would be that the first element of the result vector will be the element at index x in the input vector, and all other elements of the

49 result vector will contain undefined data, which are then immediately discarded as the result vector is explicitly cast to a scalar value. However, the extractelement instruction is somewhat special: it is an in- struction that operates on a vector, but produces a scalar as the output. Because it operates on a vector, it must be scheduled onto a vectorized FU node; as a result, this also means that the (scalar) result will be stored in the vectorized FU output node. However, a problem arises when the result of the extractelement operation is then used as a splat vector for the input of a subsequent vector op- eration. In this situation, the output of the extractelement operation is routed from a vector node to another vector node, and there is no guarantee that it passes through a scalar node on the way. Thus, the scalar-to-vector connection rules from Section 3.4.1 do not apply. This has the result that the output of the extractelement operation is never splatted, and the vector which arrives for the next instruction may still contain undefined data. Because the lowering of successor instructions may result in additional splat operations appearing in later stages of code generation, it is not feasible to simply check for the presence of a successor splat instruction when lowering the extractelement instruction. We resolve this issue by always choosing for the shufflevector instruction a mask of , which effectively produces an explicit splat operation, but still treats the output as a scalar data type. This way, the output of the vector shuffle will explicitly contain the extracted element from the input vector in every element of the storage node, and no problems will arise when a successor instruction expects as input the splat vector of the result of the extractelement operation.

5.5 Vector shuffling

The main problem with vector shuffling on the Blocks-CGRA is its inherent vector locality. In order to perform a vector shuffle, in most cases some form of vector inter-element connectivity must be added; this can be in the form of a hard-wired shuffle pattern or the Vector Shuffle Unit. We shall introduce four methods of performing a vector shuffle: (a) passing a vector through a hard-wired shuffle; (b) reordering individual elements by means of a Vector Shuffle Unit; (c) writing the vector to memory via masked store operations; (d) chaining partial shuffles and insertions. A global comparison of these meth- ods is shown in Table 5.1; the specifics of each shuffling method will be elabo- rated in the following subsections.

5.5.1 Hard-wired shuffle One method of performing vector shuffles is by routing data through a hard- wired shuffle. As described in Section 3.5, a hard-wired shuffle is created in the architecture configuration by explicitly defining a shuffle mask on a vectorized input port. This is usually the fastest method of executing a specific vector shuffle. We add support to the compiler for recognizing such explicitly defined shuffle masks, as well as any shuffle masks that can be derived by changing the order of the data sources. Upon encountering a shufflevector instruction with the specific shuffle pattern during code generation, the compiler will then replace this with a PASS instruction. The input vector(s) are then forced to be

50 Configuration require- Application- Method ments specific Hard-wired shuffle Explicit shuffle pattern(s) Yes Vector Shuffle Unit Vector Shuffle Unit No Wide memory register Shuffle via memory No (optional) Any shuffle pattern(s) Partial insertion No with enough connectivity

Estimated area/energy Estimated performance footprint (per cycle) Excellent: O(1); small coefficient Very low High (n additional ALUs + Good: O(n); small coefficient IDs) OK: O(1) with wide memory, other- Low to medium (extra wise O(n); medium coefficient. memory or register) Poor: O(n); large coefficient Very low

Table 5.1: Comparison for vector shuffling methods for vector width n routed to the nodes that are defined in the source list of the input with the explicitly defined shuffle mask, and the PASS operation causes the shuffle to be executed. This method of vector shuffling, while very efficient, has a glaring drawback: the programmer must have advance knowledge of which shuffle patterns will be used in their program. However, these vector shuffles are added automagically by LLVM in the optimization layer, and it is difficult to predict which shuffles will be present whilst writing the program in C; doing so with any accuracy requires some degree of insight in how the LLVM auto-vectorizer operates. De- spite this, we still support this method because of the clear benefits, the fact that it requires no additional hardware, and the fact that falling back on alternative vector shuffling methods is also an option. Moreover, the programmer may be able to prematurely halt the code generation process, output the program in LLVM IR before the Blocks back-end layer is run, observe which shuffle pat- terns are present, then add these patterns as explicit hard-wired shuffles in the architecture configuration XML in order to benefit from the hard-wired shuffle method. A hard-wired shuffle is represented in the architecture configuration as an input port with a mask property and one or more source properties. In the Blocks-CGRA, an input port can only be connected to one output buffer at most; during architecture expansion, the source list is used to connect the elements of a vectorized Functional Unit to the individual elements of the source. However, the compiler uses the architecture model before architecture expansion takes place; as such, we need to resolve the issue of having multiple sources on a single input port in a different way. We solve this problem by adding a dummy node whenever a hard-wired shuffle is added to the resource graph. The (up to) two sources from the source list are connected to this dummy node with a latency of 0; the dummy node itself is then connected to the input port for

51 valu.0 vlsu.0

in0 in1

valu.0 vlsu.0 DUMMY

in0 in1 in2 in3 in0 in1 in2 in3

vmul vmul

vmul.0 vmul.1 vmul.0 vmul.1

(a) Two sources on in3 (b) Only one source on in3

Figure 5.2: Inserting a dummy node in compiler model for hard-wired shuffles. which the shuffle was defined, also with a latency of 0. This dummy node is given a special FU type, so that no instructions can be scheduled onto it through normal means; additionally, we disallow any sort of automatic routing through this node. This avoids a situation where a data value is inadvertently shuffled because it was routed through the hard-wired shuffle node on its way to a consuming FU. For example, suppose that input port 3 for a FU vmul performs a shuffle using two source vectors, valu output 0 and vlsu output 0. The compiler model for this is shown in Figure 5.2a. However, this model has two nodes connected to the same input port, which is invalid. To fix this, we insert the dummy node as in Figure 5.2b. During instruction selection, when a shufflevector instruction is encoun- tered for which a hard-wired shuffle exists in the architecture configuration, we replace this instruction with a placeholder instruction named HWSHUF. The instruction then passes through to the scheduling stage, in which we apply spe- cial scheduling rules for the shuffle. We forego the regular process of combined shortest path routing to select a processing FU, as detailed in Section 2.2.2. Instead, we forcibly route the input vectors to the two sources of the dummy node, and we do this in a cycle for which the FU associated with hard-wired shuffle is not yet reserved for an operation. In the prior example from Figure 5.2, this means we route the input vectors to valu.0 and vlsu.0 respectively, for a cycle where vmul is free. Note that we do not route the vectors to the dummy node directly, as this may cause vector 1 to be routed through vlsu.0 if that path is shorter, or vector 2 through valu.0; this would produce an in- correct shuffle. The same goes if we would route the vectors to vmul directly; they may be passed in through an input port other than in3. Once the input vectors have been routed to the correct location, we then manually reserve a PASS instruction for vmul, using the dummy node as the PASS source, and mark the vmul node as the location for the output vector, from which it can be used as operand for following instructions. This completes the scheduling for hard-wired shuffles.

52 5.5.2 Vector Shuffle Unit The Vector Shuffle Unit (SHF) is designed with the express purpose of sup- porting any arbitrary shuffle pattern that may be encountered during code gen- eration. As described in Section 3.6.1, the SHF unit consists of a number of ALUs equal to the vector width of the node, but unlike regular vector nodes, each individual ALU has its own dedicated Instruction Decoder. Moreover, two of the four input ports on each ALU are reserved for connections with the two neighboring elements. This allows data to pass arbitrarily through each element of the vector. For the LLVM shufflevector instruction, the mask of the shuffle is constant and known at compile-time. We expand the shufflevector instruction to a list of parallelized PASS instructions, which route vector elements through the shuffle network towards their destination. Let A and B be the two input vectors for the shufflevector instruction (B may be undef), and let M be the shuffle mask. Furthermore, we denote undef elements as U, and declare another special element T as placeholder for elements that have yet “to be loaded”. The first step for generating the shuffle is choosing a pivot index p. We choose p such that 0 ≤ p ≤ |M| ∧ M[p] 6= U ∧ ¬∃x : M[x] 6= U ∧ x > p. In other words, p is the index of the rightmost element in M that is not undef. The full procedure for generating this list of instructions is formalized in Algorithm 1, but a general description is given below. For simplicity’s sake, we assume that indexing of all vectors and arrays “wraps around”, e.g. M[−1] means the last element in M; moreover, all numbers used are integers. As semantics for modulo operations on negative numbers can vary, we define this formally as M[i] = M[i mod |M|] for i ≥ 0, and M[i] = M[|M| − (|i| mod |M|)] for i < 0. The algorithm SoftwareShuffle takes as input the two vectors A and B as well as the shuffle mask M, and produces a list I of Blocks instructions that the vector shuffle is expanded to. Each element in I is a tuple consisting of the opcode as the first element, and any instruction operands as subsequent elements. The instructions are generally executed in the order that they are added to I, though this is no guarantee and LLVM may re-order them if it detects that instructions are independent from one another. To prevent this, we may add instructions themselves as an operand to another instruction; this indicates that the new instruction uses the result of a previous instruction, or simply that the previous instruction is a predecessor of the new instruction. For example, I[−1] as an operand means that the last instruction that was added is a predecessor of the new instruction, and must be executed before the new instruction can be executed. (If I is empty, then we ignore this operand I[−1].) An example of a vector shuffle via the SHF unit using shuffle mask {0, 8, 1, 9, 2, 10, 3, 11} is shown in Figure 5.3. Using our algorithm, the shuffle is performed in 5 cycles. An alternative example is shown in Figure 5.4 using shuffle mask {9, 9, 1, 9, 6, 13, 5, 13}. This shuffle is performed in 7 cycles using our algorithm. However, it is possible to perform this shuffle in only 4 cycles by using a hand- written sequence of PASS operations, as shown in Figure 5.5. This shows that the SoftwareShuffle algorithm does not always produce an optimal PASS sequence for any given shuffle. However, we can create hand-written PASS sequences in case we encounter such “hard” shuffles, and use the SoftwareShuffle algorithm

53 0 1 2 3 4 5 6 7 A Pivot M 0 8 1 9 2 10 2 11 8 9 10 11 12 13 14 15 B

P1 2 10 3 11 0 8 1 9 R1 11 Load 11

P2 9 2 10 3 11 0 8 1 R2 10 3 11 Load 3 and 10

P3 1 9 2 10 2 11 0 8 R3 9 2 10 3 11 Load 2 and 9

P4 8 1 9 2 10 2 11 0 R4 8 1 9 2 10 3 11 Load 1 and 8

P5 0 8 1 9 2 10 2 11 R5 0 8 1 9 2 10 3 11 Load 0

Figure 5.3: Example of vector shuffle via SHF unit using SoftwareShuffle algo- rithm, with shuffle mask {0, 8, 1, 9, 2, 10, 3, 11}. Following all loads, R5 does not need to be rotated further to match the shuffle mask. for any other shuffle pattern. In the worst case, the SoftwareShuffle algorithm does a full rotation of the vector, taking N cycles, followed by a rotation of the vector to match the shuffle 1 pattern again, taking up to b 2 Nc cycles. Therefore, the worst-case performance of the SoftwareShuffle algorithm scales at a rate of O(n) with the vector size. In the best case, for instance a simple rotation over 1 element, the performance is always constant regardless of vector size, i.e. scales at a rate of Ω(1).

Line 4–12: We initialize a pattern vector P0 with a copy of the shuffle pattern, M. Also, we initialize a result vector R with a copy of M, but replace every element that is not U by a T element. Moreover, we set a rotation offset o for P and R, such that, when M is right-rotated o + 1 times, the pivot element will line up with its vector index in A or B. Line 13–14: We initialize the list of instuctions, I, to an empty list. As an iteration counter as well as instruction counter, we initialize a variable t to 0.

Line 15–17: We repeat the while-loop until all T elements are removed from R, in other words, when all elements that were not undef have been loaded. At the start of every iteration we initialize a list of loads L to an empty list; this list will be filled with the PASS source for every ALU in the SHF unit. Line 20–22: For every iteration, as the first step, we rotate the shuffle pattern P to the right by 1. Line 23–35: For each element in the rotated shuffle pattern, we select a cor- responding element for the new result vector Rt as well as a PASS source for every ALU, in the following order: ˆ If the element in the rotated shuffle pattern lines up with a right- neighboring element from the previous result vector Rt−1, then we

54 0 1 2 3 4 5 6 7 A Pivot M 9 9 1 9 6 13 5 13 8 9 10 11 12 13 14 15 B

P1 1 9 6 13 5 13 9 9 R1 9 13 Load 9 and 13

P2 9 1 9 6 13 5 13 9 R2 9 1 9 13 5 13 Load 1 and 5

P3 9 9 1 9 6 13 5 13 R3 9 1 9 13 5 13 ...

P4 13 9 9 1 9 6 13 5 R4 13 9 9 1 9 13 5 Load 9

P5 5 13 9 9 1 9 6 13 R5 5 13 9 9 1 9 6 13 Load 6

P6 13 9 9 1 9 6 13 5 13 9 9 1 9 6 13 5 Rotate left

P7 9 9 1 9 6 13 5 13 9 9 1 9 6 13 5 13 Rotate left

Figure 5.4: Example of vector shuffle via SHF unit using SoftwareShuffle algo- rithm, with shuffle mask {9, 9, 1, 9, 6, 13, 5, 13}. Following all loads, R5 needs to be left-rotated twice to match the shuffle mask.

0 1 2 3 4 5 6 7 A

8 9 10 11 12 13 14 15 B

R1 9 13 6 Load 6, 9, 13

R2 9 1 9 6 13 Load 1

R3 9 9 1 9 6 5 13 13 Load 5

R4 9 9 1 9 6 13 5 13 ...

Figure 5.5: Example of vector shuffle via SHF unit using handwritten PASS sequence, with shuffle mask {9, 9, 1, 9, 6, 13, 5, 13}.

55 Algorithm 1 Generate vector shuffles on a SHF 1: procedure SoftwareShuffle(A, B, M) 2: o ← (M[p] − p) mod |A| − 1 . Initialize rotation offset 3: P0 ← ∅ 4: R0 ← ∅ 5: for i ← 0 to |M| − 1 do 6: P0[o + i] ← M[i] . Initialize P with shuffle pattern 7: if M[i] = U then . Initialize R with Us (undef) and T s (todo) 8: R0[o + i] ← U 9: else 10: R0[o + i] ← T 11: end if 12: end for 13: t ← 0 14: I ← ∅ 15: while T ∈ Rt do . Repeat until all T s are gone 16: t ← t + 1 17: Lt ← ∅ 18: o ← o + 1 19: Pt ← ∅ 20: Rt ← ∅ 21: for i ← 0 to |M| − 1 do 22: Pt[i] ← Pt−1[i − 1] . Right-rotate P by 1 23: if Pt[i] = Rt−1[i + 1] then . Load from right-neighbor 24: Lt ← Lt ∪ {(i, in3)} 25: Rt[i] ← Rt−1[i + 1] 26: else if Pt[i] = i then . Load from A 27: Lt ← Lt ∪ {(i, in0)} 28: Rt[i] ← i 29: else if Pt[i] = |A| + i then . Load from B 30: Lt ← Lt ∪ {(i, in1)} 31: Rt[i] ← |A| + i 32: else . Load from left-neighbor 33: Lt ← Lt ∪ {(i, in2)} 34: Rt[i] ← Rt−1[i − 1] 35: end if 36: I ← I ++{(PASS,Lt,I[−1])} . Add instruction 37: end for 38: end while

56 Algorithm 1 Generate vector shuffles on a SHF (continued) 39: if o ≤ |M| − (o mod |M|) then . Check shortest rotation 40: δ ← −1 . Left rotate 41: s ← in3 42: else 43: δ ← 1 . Right rotate 44: s ← in2 45: end if 46: while Pt 6= M do 47: t ← t + 1 48: Lt ← ∅ 49: for i ← 0 to |M| − 1 do 50: Pt[i] ← Pt−1[i + δ] . Rotate P 51: Lt[i] ← L ∪ {(i, s)} 52: end for 53: I ← I ++{(PASS,Lt)} . Add instruction 54: end while 55: return I 56: end procedure

load from the right-neighbor (input port 3). In other words, we propagate data to the left. ˆ If the element in the rotated shuffle pattern lines up with an element in A, then we load from A (input port 0). For instance, if the rotated shuffle pattern is <0, 5, 6, 7>, this means we can load the first element (0) from A in this iteration. ˆ Similarly, if the element in the rotated shuffle pattern lines up with an element in B, then we load from B (input port 1). ˆ Otherwise, we load from the left-neighbor (input port 2). In other words, we propagate data to the right.

Each PASS source is appended to L as a tuple of the vector index of the ALU that performs the PASS, and the input port from which data is PASSed. Line 36: Once PASS sources for every element have been determined, a parallel PASS instruction is defined and appended to I, having as operand the list of PASS sources for that cycle. Line 39–45: Once all T elements for the result vector R have been loaded from A and B, then we check the current rotation offset of the shuffle mask P , and determine a rotation direction that would result in the shortest rotation of P back to match M. Line 46–54: Finally we generate more instructions that perform the rotation of P back to match M. In many cases, however, P already matches M at the end of the first while-loop.

57 5.5.3 Shuffle via memory If there are no hard-wired shuffle patterns or SHF unit available in the system, then there are still other methods available that allow us to shuffle vectors. Due to vector locality, a lack of either of these two elements prohibits moving data around in vectors directly. However, we can bypass this restriction by writing the vector to a memory location, manipulating the vector in memory, then loading it back out. For this we can use the SVX instruction as defined in Section 4.2.1. This instruction lets us change the vector indices of a vectorized LSU at runtime, as well as selectively disable reads/writes for specific LSUs. We can use this functionality to write partial vectors to memory and/or reorder the vector as it is being written to or read from memory. The core idea of this method is to derive a store mask from the vector shuffle mask M. The shuffle mask M is a load mask; it specifies the locations in A and B, the vectors to be shuffled, from which elements are loaded for the output vector. For both A and B, we derive store masks SA and SB, that instead specify the locations in the output vector to which elements will be written. For any elements in A or B that do not appear in the output vector, we specify −1 as a location. The store mask SA is then committed to the vectorized LSU by means of the SVX instruction, and the associated input vector A is written to memory. As any unnecessary elements have an index of −1, this has the effect of masking out the memory store operation for those elements; thus, only the elements in A that ought to be present in the output vector are actually written to memory. We perform the same process for the other input vector (if applicable), setting vector indices and writing the vector to the very same memory address. This completes the shuffle of the output vector in memory. Next, we use the RVX instruction to reset the vector indexes of the vectorized LSU, and read the memory back out from the same memory address. As the vector indices are now in order, the (shuffled) output vector will be read in-order from memory. An alternative method, which we shall name “shuffle-load” (as opposed to “shuffle-store” which has been described above), would be to store A and B in consecutive memory locations using the default store mask {0, 1, 2, ···}, then using the SVX instruction in conjunction with the load mask (i.e. the mask from the shufflevector operation) and loading the shuffled vector back from mem- ory. In case two vectors are being shuffled, this saves one cycle (store, store, SVX, load, RVX for shuffle-load, versus SVX, store, SVX, store, RVX, load for shuffle-store) when two vectors are being shuffled together. If a single vector is being shuffled, performance is the same. However, since both vectors are stored to memory in separate locations, rather than to the same location, the amount of memory required for shuffle-load is double that of shuffle-store. Moreover, the shuffle-store method maps a little more cleanly to LLVM’s shufflevector operation, as the result value is produced in the last cycle of the sequence of instructions, whereas for shuffle-load the result value is produced in the second- to-last cycle; the extra RVX is needed to put the vector indices of the LSUs back in a correct state for normal operation, but the RVX instruction itself does not produce an output. For these reasons, we use the shuffle-store method for this project, though it may be interesting to also look into using shuffle-load in future research. The full procedure for generating the list of instructions for shuffle-store is

58 formalized in Algorithm 2. The algorithm takes as input the two vectors A and B as well as the shuffle mask M. In addition, a memory address O is specified, which has been reserved/allocated for performing the shuffle in. As output a set I of Blocks instructions is produced, that the vector shuffle is expanded to. Each element in I is a tuple consisting of the opcode as the first element, and any instruction parameters as subsequent elements. Blocks currently uses a 32-bit DTL memory for global memory. However, vectors are frequently larger than 32 bits. As such, storing them to, or reading them from memory, would incur stall cycles in the LSUs; the entire system is stalled until the memory operation finishes. We circumvent this problem by adding a dedicated memory-mapped register to the system, with the same bus width as the vector(s) to be shuffled. Following adjustments made to the Blocks-CGRA’s memory interface, the Blocks-CGRA is now able to perform the memory operation for the entire vector in one go, i.e. without incurring extra stall cycles for every individual element; the memory operation for the vector now has the same latency and stall cycles as a normal scalar memory operation.

Algorithm 2 Generate vector shuffle via memory 1: procedure MemoryShuffle(A, B, M, O) 2: SA ← ∅ 3: SB ← ∅ 4: for i ← 0 to |A| − 1 do . Initialize store mask for A 5: SA[i] ← −1 6: end for 7: for i ← 0 to |B| − 1 do . Initialize store mask for B 8: SB[i] ← −1 9: end for 10: for i ← 0 to |M| − 1 do . Generate store mask for A 11: if 0 ≤ M[i] < |A| then 12: SA[M[i]] ← i 13: end if 14: end for 15: for i ← 0 to |M| − 1 do . Generate store mask for B 16: if |A| ≤ M[i] < |A| + |B| then 17: SB[M[i] − |A|] ← i 18: end if 19: end for 20: I ← ∅ . Create and add the instructions 21: I ← I ++{(SVX,SA)} 22: I ← I ++{(store, O, A, I[−1])} 23: I ← I ++{(SVX,SB,I[−1])} 24: I ← I ++{(store,O,B,I[−1])} 25: I ← I ++{(RVX,I[−1])} 26: I ← I ++{(load,O,I[−1])} 27: return I 28: end procedure

Line 2–9: We initialize the store masks, SA and SB, to {−1, −1, −1, ...}.

59 A B maskIn1 0 1 2 3 4 5 6 7 maskIn2

shuffle <1, 2, 3, 4>

1 2 3 4 maskOut

R1

Figure 5.6: Model of a shuffle pattern {1, 2, 3, 4} with all vector widths = 4.

Line 10–14: For each mask element in the load mask M, if the mask element points to an element in A, we set the output vector destination index for the element in the store mask SA.

Line 15–19: The same process is repeated for B and its store mask SB.

Line 15–19: Once the store masks SA and SB have been derived from M, we generate the set of instructions I that the vector shuffle will be expanded to.

5.5.4 Partial insertion The previous method relies on the fact that there is a wide memory register available in the system in which a memory shuffle can be performed. If no such memory register exists, then a memory shuffle will be prohibitively slow. In the event that there is no hard-wired shuffle pattern available in the system for the shuffle mask that we want to use, and there is also no SHF unit or wide memory register in the system, there is yet one more method we can use to perform a vector shuffle, so long as there is at least some hard-wired shuffle pattern in the system, even if it does not match our desired shuffle mask exactly. This method is somewhat of an extension of the hard-wired shuffle method. It relies on producing more shuffle patterns by chaining together repeated ap- plications of some other shuffle pattern. Upon parsing the architecture con- figuration, the compiler will generate derived shuffle patterns; these are shuffle patterns obtained by applying hard-wired shuffle patterns present in the archi- tecture more than once. For every shuffle pattern shuf, we record three properties in particular: shuf.maskIn1, shuf.maskIn2 and shuf.maskOut. These represent the shuf- fle masks of the two input vectors, and the shuffle mask that is produced for the result vector. Non-derived shuffles, i.e. hard-wired shuffles that directly exist in the system, have their maskIn1 and maskIn2 set to the trivial pattern {0, 1, 2,... |A|−1} for the first vector or {|A|, |A|+1, |A|+2,... |A|+|B|−1} for the second vector; the maskOut property matches the one that is defined as the mask in the architecture configuration. A visual representation of this model for a shuffle pattern <1, 2, 3, 4> is shown in Figure 5.6. For derived shuffles, at least one of maskIn1 or maskIn2 is set to a mask equal to the maskOut of the preceding shuffle, and so forth. For example: by applying the shuffle {1, 2, 3, 4} two times, where one of the inputs is set to the result of the previous iteration, we could obtain the shuffle patterns {2, 3, 4, 0},

60 R1 A R1 B maskIn1 1 2 3 4 0 1 2 3 maskIn2 maskIn1 1 2 3 4 4 5 6 7 maskIn2

shuffle shuffle <1, 2, 3, 4> <1, 2, 3, 4>

2 3 4 0 maskOut 2 3 4 4 maskOut

R2A R2B

A R1 B R1 maskIn1 0 1 2 3 1 2 3 4 maskIn2 maskIn1 4 5 6 7 1 2 3 4 maskIn2

shuffle shuffle <1, 2, 3, 4> <1, 2, 3, 4>

1 2 3 1 maskOut 5 6 7 1 maskOut

R2C R2D

R1 R1 maskIn1 1 2 3 4 1 2 3 4 maskIn2

shuffle <1, 2, 3, 4>

2 3 4 1 maskOut

R2E

Figure 5.7: Derived shuffle patterns from applying pattern {1, 2, 3, 4} twice.

61 A A B A maskIn1 0 1 2 3 0 1 2 3 maskIn2 maskIn1 4 5 6 7 0 1 2 3 maskIn2

shuffle shuffle <1, 2, 3, 4> <1, 2, 3, 4>

1 2 3 0 maskOut 5 6 7 0 maskOut

Figure 5.8: Mirrored versions of the shuffle pattern {1, 2, 3, 4}.

{2, 3, 4, 4}, {1, 2, 3, 1}, {5, 6, 7, 1} or {2, 3, 4, 1}. This is shown in Figure 5.7. All five of these results could then be iterated on again, and potentially even com- bined with results from previous iterations, to obtain yet more shuffle patterns; potentially until no more shuffle patterns can be found. The compiler keeps a list of every shuffle pattern found this way; if a shuffle pattern is found more than once, the version that requires the least number of iterations is kept. We discard any shuffle patterns that produce an output mask where every element is |A| or higher, e.g. output mask {4, 5, 6, 7} for |A| = 4, as this represents a shufflevector operation where the first operand is undef, which is invalid in LLVM. In addition to the basic shuffle {1, 2, 3, 4} on A and B, we can also obtain a shuffle {1, 2, 3, 0} by setting both inputs to A, or {5, 6, 7, 0} by swapping the positions of A and B, and then derive even more shuffle patterns from this. Note that setting both inputs to B will always produce an output mask where every element is |A| or higher. The two mirrored versions of our sample pattern {1, 2, 3, 4} are shown in Figure 5.8. In addition to this, the “trivial” shuffle {0, 1, 2, 3} is also added to the list, requiring 0 PASS operations. Clearly, this search space blows up very quickly; a single hard-wired shuffle pattern can easily produce tens of thousands of derived shuffle patterns. As such, in practice, we must limit this search space in some way. For our test cases, we enforce a limit of N repeated applications of the same shuffle pattern, where N is the width of the result vector. Moreover, we only allow the repeated shuffle where both inputs are set to the result of the previous iteration (in Figure 5.7, this would be the shuffle that produces R2E). Of course, this is a trade-off; by searching more of the search space, more optimal shuffle patterns may be found, but for our benchmarks these limitations are able to find all the derived shuffle patterns we need without significantly slowing down the compilation process. When a vector shuffle is requested using a pattern that is not in the list of (derived) shuffle patterns that have been found by the compiler, the vector is shuffled by performing several “partial” shuffles and individually inserting elements from those vectors into a base vector. The term “partial” shuffle stems from the fact that we only care about a subset of the elements in the vector, namely the elements that will be inserted into the base vector. The full procedure is shown in Algorithm 3, with a general description given below. We begin by calculating similarity values on input vectors A and B by computing the number of elements in A and B that are in the same position as they would be in the shuffle vector. The input vector with the highest similarity is chosen as the base vector. We then form a remainder mask and determine which elements are left to insert. Next, partial shuffles on A and B are performed

62 A B R1 R1 maskIn1 0 1 2 3 4 5 6 7 maskIn2 maskIn1 1 2 3 4 1 2 3 4 maskIn2

shuffle shuffle <1, 2, 3, 4> <1, 2, 3, 4>

1 2 3 4 maskOut 2 3 4 1 maskOut

R1 R2E

R2E R2E R3EE R3EE maskIn1 2 3 4 1 2 3 4 1 maskIn2 maskIn1 3 4 1 2 3 4 1 2 maskIn2

shuffle shuffle <1, 2, 3, 4> <1, 2, 3, 4>

3 4 1 2 maskOut 4 1 2 3 maskOut

R3EE R4EEE

Figure 5.9: Used search space for hard-wired shuffle pattern {1, 2, 3, 4} (top left). as necessary and the remaining elements are inserted, until the output vector is complete. As an example, suppose we would like to shuffle two vectors A and B, both having width 4, with mask {4, 3, 2, 1}, but only a shuffle pattern {1, 2, 3, 4} exists in the system. A = {0, 1, 2, 3} and B = {4, 5, 6, 7}; both share one element with the desired output vector {4, 3, 2, 1}; for A, the 2 is already in the right place, and for B, the 4 is in the right place. We choose A as the base vector, which leaves the 4, 3 and 1 to be inserted. Using the search space shown in Figure 5.9, we can perform this shuffle as follows. ˆ Take A = {0, 1, 2, 3} as base vector. ˆ Insert element 4 from B (trivial shuffle {0, 1, 2, 3} on B).

ˆ Perform shuffle {2, 3, 4, 1} (R2E), and insert elements 1 and 3. A line-by-line description of the algorithm PartialInsertionShuffle (Algorithm 3) follows:

Line 2–11: First, we compute similarity values sA and sB for A and B re- spectively. If an element in A or B is already in the correct place, or the corresponding element in the output vector is undef (U), then we increase similarity by 1.

Line 12–21: In this step, we choose the base vector V , and fill R with the indices present in the base vector (e.g. {0, 1, 2, 3,... } for A); this is done mostly for convenience for the next step. Line 22–28: We form the remainder mask R in this step. Each element in R is set to U (undef) if the element in M is also undef, or the element

63 Algorithm 3 Generate vector shuffle by partial shuffles and inserts 1: procedure PartialInsertionShuffle(A, B, M) 2: sA ← 0 3: sB ← 0 4: for i ← 0 to |M| − 1 do . Compute similarity counts 5: if M[i] = U ∨ M[i] = i then . Similarity for A 6: sA ← sA + 1 7: end if 8: if M[i] = U ∨ M[i] = i + |A| then . Similarity for B 9: sB ← sB + 1 10: end if 11: end for 12: R ← ∅ 13: for i ← 0 to |M| − 1 do . Init. R with indices, choose base V 14: if sA ≥ sB then 15: V ← A. Set V = A (only needed once) 16: R[i] ← i . Initialize R for A 17: else 18: V ← B. Set V = B (only needed once) 19: R[i] ← i + |A| . Initialize R for B 20: end if 21: end for 22: for i ← 0 to |M| − 1 do . Form remainder mask R 23: if M[i] = U ∨ M[i] = R[i] then 24: R[i] ← U. Already in V or not needed 25: else 26: R[i] ← M[i] . Still needs to be read 27: end if 28: end for 29: I ← ∅ 30: while ∃x ∈ R : x 6= U do . Repeat until all undef in R 31: shuf ← find best shuffle(R) . Elaborated below 32: T ← PassTree(I, shuf, A, B) . See Algorithm 4 33: I ← I ++{T } 34: for i = 0 to |M| − 1 do . Update R with new shuffle result 35: if R[i] 6= U ∧ R[i] = shuf. maskOut[i] then 36: V ← (insertelement, V, T, i) . Insert into previous V 37: I ← I ++{V } . Add insert to list of instructions 38: R[i] ← U. Mark as done 39: end if 40: end for 41: end while 42: return I 43: end procedure

64 in the base vector is already in the correct place. Otherwise, we fill the remainder mask R with the mask index from M.

Line 29–33: In this while-loop, we finish the rest of the output vector. The process continues until all elements in the remainder mask are undef. In every iteration, we first find a (derived) shuffle pattern that best matches the remainder mask R using the function find best shuffle(R). Line 34–40: Finally, for the partially shuffled vector that was obtained from the previous step, we insert the elements that should be present in the output vector for the full shuffle. We then mark these elements as undef in the remainder mask R, indicating that we do not need these elements anymore. Once all indices in R have been replaced by U, the full shuffle is finished.

The find best shuffle(R) function used in the PartialInsertionShuffle algo- rithm is a helper function which checks all (derived) shuffle patterns present in the system, and for each pattern, checks how many elements are not undef and match an element in R. If no matching shuffle can be found, the process ter- minates with an error; in this case, there is not enough connectivity present in the system to be able to shuffle the vector with pattern M, or the search space for derived shuffles was not large enough. Once the best shuffle is found, a tree of HWSHUF-operations is generated to perform the shuffle, and these operations are also appended to the list of instructions. The PartialInsertionShuffle algorithm makes use of a secondary algorithm PassTree; this algorithm is shown in Algorithm 4. The PassTree algorithm recursively forms a tree of PASS operations (technically HWSHUF placeholders which are lowered to PASSes in a later stage; see Section 5.5.1). This tree represents a tree of dependencies; each node in the tree represents a single application of a shuffle pattern, with its child nodes representing the inputs; the leaves of the tree represent a shuffle on only A and/or B. For each of the input masks of the hardware shuffle, the algorithm checks whether this mask matches the trivial mask of input vector A or B; if so, these are used as input operand directly. Otherwise, the function is called recursively to obtain the PASS tree for the input mask, which is then used as operand for the current iteration. The instructions produced during this are added to the list of instructions I as normal, and will generally be executed in that order, though LLVM may re-order the list taking into account the dependencies of each operation. For example, the derived shuffle {2, 3, 4, 1} from the example produces the PASS-tree shown in Figure 5.10. LLVM additionally performs common sub- expression elimination on this tree, the result of which is also shown in Figure 5.10.

65 <2, 3, 4, 1> <2, 3, 4, 1>

HWSHUF HWSHUF <1, 2, 3, 4> <1, 2, 3, 4>

HWSHUF HWSHUF HWSHUF <1, 2, 3, 4> <1, 2, 3, 4> <1, 2, 3, 4>

A B A B A B

Figure 5.10: PASS-tree for derived shuffle {2, 3, 4, 1}. Left: original tree as produced by PassTree algorithm; right: tree after LLVM performs common sub-expression elimination.

Algorithm 4 Generate PASS tree for chain shuffle 1: procedure PassTree(I, shuf, A, B) 2: if shuf. maskIn1 = U then . Check source of input 1 3: o1 ← U. Input 1 not needed 4: else if ∀0 ≤ i < |shuf. maskIn1| : shuf. maskIn1[i] = i then 5: o1 ← A. Input 1 is A 6: else if ∀0 ≤ i < |shuf. maskIn1| : shuf. maskIn1[i] = |A| + i then 7: o1 ← B. Input 1 is B 8: else . Input is result of previous shuffle 9: o1 ← PassTree(I, find best shuffle(shuf. maskIn1), A, B) 10: I ← I ++{o1} . Add previous shuffle to instructions 11: end if 12: if shuf. maskIn2 = U then . Do the same for input 2 13: o2 ← U 14: else if shuf. maskIn1 = shuf. maskIn2 then 15: o2 ← o1 16: else if ∀0 ≤ i < |shuf. maskIn2| : shuf. maskIn2[i] = i then 17: o2 ← A 18: else if ∀0 ≤ i < |shuf. maskIn2| : shuf. maskIn2[i] = |A| + i then 19: o2 ← B 20: else 21: o2 ← PassTree(I, find best shuffle(shuf. maskIn2), A, B) 22: I ← I ++{o2} 23: end if 24: return (HWSHUF, o1, o2, shuf. maskOut) . Create PASS instruction 25: end procedure

66 Chapter 6

Evaluation

To evaluate the results of our compiler vectorization efforts, we execute a number of benchmarks and measure the resulting runtime, area size and overall energy usage. This chapter shall describe the benchmarks that are run, how these measurements are performed, and evaluate and interpret the results. Section 6.1 describes the testing environment, including which tools have been used to perform compilation, simulation and measurements. This section also details the benchmark applications that have been run and their corre- sponding Blocks configurations. Section 6.2 gives the measurements related to runtimes of the benchmarks, and evaluates these results. Section 6.3 describes the results that have been obtained related to the used area size of the Blocks configurations, and evaluates these. Finally, in Section 6.4 we give the energy usage measurements and evaluate the results.

6.1 Testing environment

The current Blocks-CGRA compiler is based on LLVM version 4, which is some- what outdated; at the time of writing, the most recently released version of LLVM is version 8. In particular, this LLVM version has poor support for vari- able vector widths. As a result, we arbitrarily choose a fixed vector width of 8 for all operations. We have implemented all Blocks-CGRA hardware extensions and Blocks compiler modifications as detailed in the preceding chapters. More- over, we have implemented support for all vector operations that have scalar equivalents; the vector-specific operations shufflevector, insertelement and extractelement; as well as vector building, for a vector width of 8. All benchmarks are written as C programs which are converted to LLVM IR using Clang. However, the version of Clang that ships with LLVM 4 has a rather basic vectorizer which has been greatly improved in later Clang versions. As such, wherever Clang from LLVM 4 is unable to auto-vectorize a benchmark, we instead use the Clang version from the latest version of LLVM (LLVM 8) and manually convert this back into a format compatible with LLVM 4. In the C program files, we use #pragma compiler directives to force vectorization on all loops. Following translation of the C benchmarks to LLVM IR, we compile these benchmarks to PASM format with our Blocks-CGRA compiler, which has been

67 implemented as an LLVM back-end. We then perform a logic synthesis on the Blocks configurations of each benchmark, translating the behavioral RTL description of the Blocks base architecture, included Functional Units and their connections to an optimized logic gates design. This is done using Cadence on a commercial 40nm library, including memory macros. We then simulate our PASM-assembled benchmark on the synthesized Blocks-CGRA instance with a clock frequency of 100 MHz. Currently, the Blocks-CGRA compiler has no support for function calls; fur- thermore, there is no stack or any form of spilling code available. Additionally, Instruction Memories in the current Blocks-CGRA are limited to 255 instruc- tions. This, unfortunately, greatly limits the complexity of benchmarks that we can currently test with the compiler. We choose 5 simple benchmarks, chosen to each test a different facet or archetype of vectorized programs. Each bench- mark is run using 5 different configurations; 1 scalar and 4 vector. This gives a total of 25 compile-synthesize-simulate cycles that each produce results for total runtime, energy usage and area size. The tested benchmarks are as follows:

ˆ MemoryCopy: Memory copy on a 10, 000 bytes array. This is generally the simplest form of vectorized program. ˆ Binarization: Binarization on a 100 × 100 pixels grayscale image (1 byte per pixel). This benchmark does a light amount of individual processing on each element. ˆ Sum: Summation on an 8, 000 integers array. This tests vector reduction on a large data set, which also uses vector shuffles. ˆ Convolution: 2D convolution on a 100 × 100 pixels grayscale image (1 byte per pixel). The size of the convolution window is 2 × 2, chosen as Clang is unable to vectorize larger window. The convolution is done in 2 stages: first in the horizontal direction, then in the vertical direction, with the intermediate results stored in a buffer in global memory. This benchmark requires accessing neighboring elements and thus makes heavy use of vector shuffles. ˆ MatVecMul: Matrix-vector-multiplication on a 64 × 64 integers matrix and a 64 integers vector. This is an operation for which vectorized pro- grams are very commonly used.

The 5 configurations used are listed in Table 6.1. These 5 configurations each use one of the FU layouts shown in Figure 6.1. We separate each vectorized configuration into two parts: a scalar part consisting only of scalar units (on the right), and a vector part consisting only of vector units with a width of 8 (on the left). For the scalar part of each vector configuration, we use a Multiplexed Register File (RFM) as defined in Section 3.6.2 as substitute for a regular Register File. This RFM contains 2 internal RFs, for a total of 32 registers. The scalar configuration uses a single RF. All ALUs not part of the RFM use a buffered output port. A wide memory register or hard-wired shuffle patterns are added on top of the FU layout; hard-wired shuffle patterns are added on input ports that are unused in the base layout. Each vector configuration uses a different method of vector shuffling: Vector-SHF uses a Vector Shuffle Unit (SHF) to do the shuffle

68 Design Layout Wide memory Hard-wired shuffles Scalar Figure 6.1a No No Vector-SHF Figure 6.1b No No Vector-HWSHUF Figure 6.1c No Application-specific Vector-MemShuf Figure 6.1c Yes No Vector-Chain Figure 6.1d No Only <1, 2, 3, 4, 5, 6, 7, 8>

Table 6.1: Blocks-CGRA configurations used.

IMM ABU vMUL IMM ABU

RF ALU vRF vALU RFM ALU

LSU MUL SHF vLSU LSU MUL

(a) Scalar FU layout (b) Vector FU layout (SHF)

vMUL IMM ABU vMUL IMM ABU

vRF vALU RFM ALU vRF vALU RFM ALU

vLSU LSU MUL vLSU LSU MUL

(c) Vector FU layout (HW- (d) Vector FU layout (chain) SHUF/MemShuf)

Figure 6.1: Layouts used for Blocks-CGRA configurations. Vector-to-vector connections are denoted with thick arrows.

69 using the SoftwareShuffle algorithm from Section 5.5.2; Vector-HWSHUF uses application-specific hard-wired shuffle pattern(s) that directly match the masks of shufflevector operations in the IR; Vector-MemShuf performs a shuffle via writing to and reading from a wide memory register, using the algorithm from Section 5.5.3; finally, Vector-Chain uses the partial insertion algorithm from Section 5.5.4 with a generic hard-wired shuffle <1, 2, 3, 4, 5, 6, 7, 8>, which does not match any masks of shufflevector operations in the IR directly. It has been verified through manual inspection of the generated PASM code that the correct shuffle method for each configuration is used. The vector FU layouts in Figure 6.1c and Figure 6.1d are very similar. How- ever, note that the layout in Figure 6.1d has one extra connection going from the vLSU node to the vALU node. On top of that, extra connections are added to unused input ports which represent (application-specific) hard-wired shuf- fles. This connection was added as the compiler was unable to schedule some benchmarks onto the Figure 6.1c layout after hard-wired shuffles are added, due to routing conflicts when performing repeated shuffles for the partial insertion shuffle algorithm. This extra vLSU-to-vALU connection is not present in the other layouts, as the vALU instead has an extra incoming connection from the SHF unit or from a hard-wired shuffle.

6.2 Runtime

For all 25 combinations of benchmark and Blocks configuration, we compile the benchmark C code to PASM and run this through a simulator on the behavioral RTL logic files for the Blocks-CGRA hardware. From this we obtain a cycle- count, which represents the runtime of the application. The collected runtimes and associated speedups, normalized to the scalar version of the benchmark, are shown in Table 6.2. Graphical representations of the runtimes for each bench- mark are also shown in Figure 6.2 for the MemoryCopy benchmark, Figure 6.3 for the Binarization benchmark, Figure 6.4 for the Sum benchmark, Figure 6.5 for the Convolution benchmark, and Figure 6.6 for the MatVecMul benchmark. As the cycle-counts for each benchmark are not directly comparable, these are formatted as separate graphs. Finally, normalized speedup results are shown graphically in Figure 6.7 for total cycle-count, and in Figure 6.8 for non-stall cycle-count. For almost all test cases, we observe a significant speedup is achieved when we only count non-stall cycles. For the Convolution and MatVecMul bench- marks, this speedup mostly approaches the vector width, which is 8. For Mem- oryCopy, Binarization and Sum, the speedup even exceeds the vector width. In other words, by vectorizing the scalar program with a vector width of 8, the program is made more than 8 times faster. However, this can be explained by the fact that the scalar part of the Blocks configuration is essentially running double-duty when using the Scalar configuration. In the scalar version of the benchmark, the scalar Functional Units are used not only for computing the actual results but also for managing loop conditions. When the program is vectorized, the computation is off-loaded to the vector part of the architecture (i.e. the vectorized Functional Units), and the scalar units only need to manage loop conditions. Because of this, in addition to processing 8 times as many data elements during each loop iteration, the loop body can also be made smaller,

70 180000

160000

140000

120000

100000

80000

60000

40000

20000

0 Scalar Vector-SHF Vector-HWSHUF Vector-MemShuf Vector-Chain Non-stall cycles Stall cycles

Figure 6.2: Number of cycles runtime for MemoryCopy. resulting in a speedup that exceeds 8×. Two test cases where we see a notably lower speedup is with the Convolu- tion benchmark on the Vector-MemShuf and Vector-Chain configurations. We can explain this by the fact that the Convolution benchmark makes heavy use of vector shuffles inside the loop body. As noted in Table 5.1 (Section 5.5), the memory shuffle and partial insertion algorithms for vector shuffling were estimated to have OK to poor performance, whereas the hard-wired shuffle and Vector Shuffle Unit (SHF) methods were estimated to have good to excellent per- formance, based on the algorithm complexity and coefficients. This observation, then, is in line with our results that Vector-MemShuf and Vector-Chain have a worse performance for this benchmark than Vector-SHF and Vector-HWSHUF. We also find that the Vector-SHF and Vector-HWSHUF configurations produce the same cycle-count for this method. As the shuffle pattern for Convolution is a simple left-rotation by 1 element, the shuffle lowers to 2 cycles using the SHF method and 1 cycle using the hard-wired shuffle method. However, routing the vector to the FU that will perform the shuffle also plays a factor, and the Vector-HWSHUF configuration in this case incurs one extra cycle of routing the vector to the hard-wired shuffle compared to routing the vector to the SHF unit with the Vector-SHF configuration, which explains the identical cycle-counts in this case. The MemoryCopy, Binarization and MatVecMul benchmarks do not use vec- tor shuffles, and the Sum benchmark uses comparatively few vector shuffles; as a result, for these benchmarks the performance for all vector configurations is largely identical. The exception is the Vector-Chain configuration for the Bina-

71 Cycles Speedup Benchmark Configuration Non-stall Total Non-stall Total Scalar 130017 170022 1.000× 1.000× Vector-SHF 11271 23776 11.536× 7.151× MemoryCopy Vector-HWSHUF 11271 23776 11.536× 7.151× Vector-MemShuf 11271 23776 11.536× 7.151× Vector-Chain 11271 23776 11.536× 7.151× Scalar 170017 210022 1.000× 1.000× Vector-SHF 16271 28776 10.449× 7.299× Binarization Vector-HWSHUF 16271 28776 10.449× 7.299× Vector-MemShuf 16271 28776 10.449× 7.299× Vector-Chain 15021 27526 11.319× 7.630× Scalar 104019 120026 1.000× 1.000× Vector-SHF 10042 30049 10.358× 3.632× Sum Vector-HWSHUF 10034 33041 10.367× 3.633× Vector-MemShuf 10067 33074 10.333× 3.629× Vector-Chain 9092 32099 11.441× 3.739× Scalar 366443 465256 1.000× 1.000× Vector-SHF 54999 93447 6.663× 4.979× Convolution Vector-HWSHUF 54999 93447 6.663× 4.979× Vector-MemShuf 92504 130952 3.961× 3.553× Vector-Chain 142467 180915 2.572× 2.572× Scalar 78415 102996 1.000× 1.000× Vector-SHF 10192 45525 7.694× 2.262× MatVecMul Vector-HWSHUF 10192 45525 7.694× 2.262× Vector-MemShuf 10192 45525 7.694× 2.262× Vector-Chain 10192 45525 7.694× 2.262×

Table 6.2: Runtime and speedup results for all benchmarks

72 220000

200000

180000

160000

140000

120000

100000

80000

60000

40000

20000

0 Scalar Vector-SHF Vector-HWSHUF Vector-MemShuf Vector-Chain Non-stall cycles Stall cycles

Figure 6.3: Number of cycles runtime for Binarization.

130000

120000

110000

100000

90000

80000

70000

60000

50000

40000

30000

20000

10000

0 Scalar Vector-SHF Vector-HWSHUF Vector-MemShuf Vector-Chain Non-stall cycles Stall cycles

Figure 6.4: Number of cycles runtime for Sum.

73 480000

440000

400000

360000

320000

280000

240000

200000

160000

120000

80000

40000

0 Scalar Vector-SHF Vector-HWSHUF Vector-MemShuf Vector-Chain Non-stall cycles Stall cycles

Figure 6.5: Number of cycles runtime for Convolution.

110000

100000

90000

80000

70000

60000

50000

40000

30000

20000

10000

0 Scalar Vector-SHF Vector-HWSHUF Vector-MemShuf Vector-Chain Non-stall cycles Stall cycles

Figure 6.6: Number of cycles runtime for MatVecMul.

74 8×

0× MemoryCopy Binarization Sum Convolution MatVecMul Scalar Vector-SHF Vector-HWSHUF Vector-MemShuf Vector-Chain

Figure 6.7: Total speedup for all benchmarks.

12×

11×

10×

0× MemoryCopy Binarization Sum Convolution MatVecMul Scalar Vector-SHF Vector-HWSHUF Vector-MemShuf Vector-Chain

Figure 6.8: Non-stall speedup for all benchmarks.

75 rization and Sum benchmarks, where the runtime is lower compared to the other vector configurations. For the Binarization benchmark, the difference is 1250 cycles, whereas for the Sum benchmark, the difference is 975 cycles. The Bina- rization benchmark operates on a 100×100 = 10, 000 pixels data set; divided by 8, this gives 1250 loop iterations. Meanwhile, the Sum benchmark operates on an 8000 integers data set; divided by 8, this gives 1000 loop iterations. There- fore, we can reason that on the Vector-Chain configuration, the loop body of the Binarization benchmark is 1 cycle lower than on the other vector configurations. This also seems to be the case for the Sum benchmark as the cycle difference, 975 cycles, approaches the number of loop iterations, 1000. Note that the Sum benchmark does, in fact, use vector shuffles, and the Vector-Chain configura- tion uses a slower vector shuffling method than the other vector configurations, which can explain the 25 cycle difference. The reason that these loop bodies are 1 cycle smaller for the Vector-Chain configuration can be explained by the extra connection from vLSU to vALU that is added in the Vector-Chain con- figuration compared to the other vector configurations (see Section 6.1). Both the Binarization and Sum benchmarks perform computations on data read from memory, and the added connection means that the data read from memory does not need to pass through the vRF node first, which indeed saves 1 cycle. As the MemoryCopy benchmark does not perform computations on the read data, there is no cycle saved here. When we take the cycle-counts with stall cycles added in, the speedups are considerably lower. While for the MemoryCopy and the Binarization bench- marks we still achieve a speedup that is close to the vector width, it is possible this is simply due to the fact that duties are now split between the scalar and vector parts of the Blocks design. The Convolution benchmark does also per- form respectably using the Vector-SHF and Vector-HWSHUF configurations, which feature fast vector shuffling methods. However, for the rest of the Sum, Convolution and MatVecMul test cases, the total speedup does not approach the vector width. One reason why this is becomes clear when we look at the number of stall cycles in each benchmark. As shown in Figure 6.2, Figure 6.3 and Figure 6.5, the number of stall cycles is significantly reduced for the vectorized versions of the MemoryCopy, Binarization and Convolution benchmarks. This can be explained by the fact that these particular benchmarks operate on a data set where each data element is a single byte. When four adjacent bytes are read from, or written to memory on a 32-bit aligned memory address, the Blocks-CGRA is able to coalesce this into a single memory access which incurs less stall cycles than reading/writing the bytes one-by-one, as the scalar versions of the benchmarks do. This is because the DTL bus used for global memory on the Blocks CGRA always causes at least one (stall) cycle delay. Meanwhile, from Figure 6.4 and Figure 6.6, we observe that the number of stall cycles is significantly increased for the vectorized versions of the Sum and MatVecMul benchmarks. These two benchmarks operate on data sets where the data elements are 32-bit integers, and so the 8 simultaneous memory accesses cannot be coalesced in any way. As the Blocks-CGRA must perform arbitration when multiple non-coalescible memory accesses occur simultaneously, this potentially incurs more stall cycles than performing the memory accesses one-by-one, as the scalar versions of the benchmarks do. Despite this, at the very least we do not see any performance regressions for

76 any of our test cases, meaning the vectorized test cases do indeed run in less time than their scalar equivalents.

6.3 Area size

For all five Blocks configurations, we additionally measure the area usage. This is achieved by performing a logic synthesis on each Blocks configuration, which instantiates the Functional Units in the configuration and then translates the behavioral RTL of the Blocks hardware + Functional Units to logic gates. Fol- lowing this, we measure the area usage. The area measurements are obtained from a logic synthesis and not from a full place-and-route; moreover, the Blocks measurement model still has some inaccuracies. Therefore, the obtained results are not accurate in terms of absolute numbers, but the results are still believed to be good enough for relative comparisons. The obtained results are normalized based on the measured area size for the scalar configuration and are listed in Table 6.3. A graphical representation of the results is given in Figure 6.9. We list Vector-HWSHUF twice, as this config- uration contains hard-wired shuffle patterns that are specific to each benchmark. As a result, we list the base area size for a Vector-HWSHUF configuration with no shuffle patterns at all, as well as the maximum area size measured across all five benchmarks with the application-specific hard-wired shuffle patterns added in. In addition to total area, we also gives a rough estimation of how the area is divided over the components of the architecture, which is obtained from the synthesis results. The area is divided into categories as follows: ˆ App-specific: This represents the difference in area size between the base Vector-HWSHUF area and the maximum Vector-HWSHUF area measured (with application-specific shuffles added in). ˆ IF + ID + IMem + Imm: This category encompasses all Instruction Decoder units in the configuration, along with Instruction Fetch modules and associated Instruction Memory. As the Immediate Units in Blocks- CGRA are special versions of Instruction Decoder units, we group them together in this category. ˆ FU + LMem + Arbiter: Represents all other Functional Units that are not Instruction Decoders or Immediate Units. For LSUs, we also include their grouped Local Memory. Moreover, we include the size of the general arbiter, as this is mostly dependent on the number of LSUs. This category together with the previous category forms the main user- configured portion of the Blocks-CGRA. ˆ GMem + Peripherals: This category represents the Global Memory present in the system as well as any memory-mapped peripherals (mainly the wide memory register used for Vector-MemShuf). ˆ Rest: This category represents all other area usage that does not fall neatly into any of the other categories, including the base hardware area for Blocks-CGRA. We find that the relative increase in total area size falls far below the roughly 8× that we would expect based on the vector width. This can be explained by

77 Configuration FU+LMem+Arb IF+ID+IMem+Imm GMem+Peri Total Scalar 1.000× 1.000× 1.000× 1.000× Vector-SHF 9.315× 2.970× 1.000× 2.305× Vector-HWSHUF (base) 8.673× 1.842× 1.000× 2.069× Vector-HWSHUF (max) 9.276× 1.842× 1.000× 2.143× Vector-MemShuf 9.326× 1.841× 1.006× 2.157× Vector-Chain 9.566× 1.841× 1.000× 2.180×

Table 6.3: Normalized total area size for all configurations.

2.5×

2.0×

1.5×

1.0×

0.5×

0.0× Scalar Vector-SHF Vector-HWSHUF Vector-MemShuf Vector-Chain App-specific FU + LMem + Arbiter IF + ID + IMem + Imm GMem + Peripherals Rest

Figure 6.9: Normalized area size for all configurations.

78 the fact that while we add, for instance, 8 LSU units to the system in the vector- ized part of the configuration, these LSUs are all driven by a single Instruction Decoder. Indeed, we see an increase of 1.8× in ID-related area size for most configurations; for Vector-SHF, the increase is larger due to the addition of the Vector Shuffle Unit (SHF) unit (which contains a number of extra ID units). This makes sense, as the vector portion of the configuration contains more or less the same nodes as the scalar portion, excluding the ABU and Immediate Unit, so an increase approaching 2× is to be expected. Meanwhile, the increase in area size for other Functional Units and related elements tends to lie around 9×, which also falls in line with the fact that the vectorized configurations con- tain the scalar nodes as well as vectorized versions of those nodes with width 8. Fluctuations of this increase across configurations can be explained by the fact that FU connections, including shuffle patterns, are grouped together with the corresponding FUs by the measurement software; this is best seen with the base Vector-HWSHUF and maximum Vector-HWSHUF configurations, which differ only in the number of hard-wired shuffle patterns present. Another explanation for the fact that we do not reach 8× area size can be seen by looking at the area usage for the base hardware and global memory. On the Scalar configuration, the user-configurable portion of the CGRA (Functional Units and associated Instruction Decoders) takes up only 25.7% of the total area size, which is relatively small. When the configuration is vectorized, we find that the area usage mainly increases for this part, whereas the base hardware and global memory area usage do not increase by much. This indicates that the base Blocks hardware and global memory simply have a significant overhead when a scalar configuration is used, and this overhead is significantly reduced by using vectorized nodes. We can additionally see the impact of the SHF unit by comparing the Vector- HWSHUF (base) result to the Vector-SHF result here; these FU configurations are very similar with the main difference that Vector-SHF additionally contains a SHF unit, which consists of 8 ALUs and 8 Instruction Decoders. We find that the ID-related area size is significantly higher, going from 1.8× to almost 3× due to the addition of the 8 extra ID units; area size for other FUs also grows somewhat. Looking at the total area size increase, we find that the SHF unit, by itself, increases total area usage by 11.4% compared to the Vector-HWSHUF (base) configuration.

6.4 Energy usage

Finally, we measure the total energy usage for all 25 test cases by performing a full compile, logic synthesis, simulation cycle. This yields power values for each test case, which are multiplied with the total cycle-count to obtain an energy usage data point. Similar to the area results, these energy results are not accurate in terms of absolute numbers, but are believed to be relatively comparable. The obtained results are normalized based on the measured energy usage for the scalar version of the benchmark and are listed in Table 6.4. A graphical representation is given in Figure 6.10. We see significant energy reductions here, upwards of more than 4× for the MemoryCopy and Binarization benchmarks. The worst case is the MatVecMul

79 Scalar Vector-SHF Vector-HWSHUF Vector-MemShuf Vector-Chain MemoryCopy 100.00% 30.51% 24.21% 25.17% 24.93% Binarization 100.00% 30.68% 24.80% 25.63% 24.28% Sum 100.00% 56.88% 46.34% 46.18% 44.67% Convolution 100.00% 41.05% 33.15% 55.52% 68.19% MatVecMul 100.00% 88.89% 69.54% 70.57% 70.75%

Table 6.4: Normalized energy usage for all benchmarks.

100%

90%

80%

70%

60%

50%

40%

30%

20%

10%

0% MemoryCopy Binarization Sum Convolution MatVecMul Scalar Vector-SHF Vector-HWSHUF Vector-MemShuf Vector-Chain

Figure 6.10: Normalized energy usage for all benchmarks.

80 benchmark using the Vector-SHF benchmark, which only uses 11% less energy than the scalar version. However, this is still an overall reduction in energy usage, and the total runtime for this particular test case is reduced by a factor 2.3× as well. The Vector-HWSHUF, Vector-MemShuf and Vector-Chain configurations perform similarly for most benchmarks, except for Convolution where large differences can be seen; this can be attributed to the fact that the Convolution benchmark has a significantly longer runtime for these configurations. These energy measurements are largely congruent with our runtime find- ings; the benchmarks that see the lowest energy reduction also saw the lowest speedup, and likewise, benchmarks with a high energy reduction also saw a high speedup. However, we find that the Vector-SHF configuration performs notably worse in terms of energy usage when compared to all other vector configurations. For instance, compared to the Vector-HWSHUF configuration, which resulted in equal cycle-counts in most test cases, energy usage for the Vector-SHF con- figuration increases by 23−28%. Given that the major difference between these configurations is the presence of a SHF unit, we can conclude that although the SHF method is one of the faster vector shuffling methods, using a SHF unit in a Blocks configuration is rather energy inefficient.

81 Chapter 7

Conclusion

7.1 Summary

In this work, we have introduced a host of modifications made to several stages of the Blocks-CGRA toolflow in order to add support for compiling vectorized C programs to Blocks code. We have identified various problems on the Blocks- CGRA that complicate the compilation of vectorized C programs, and extended the Blocks compiler model in order to circumvent these. Various additions to the Blocks hardware, memory interface and instruction set have also been implemented with the goal of improving performance of vectorized programs. Moreover, the LLVM-based Blocks compiler has been extended with procedures to lower vectorized code on the Blocks-CGRA, using various different methods to perform vector shuffles based on the units that are available in the Blocks configuration. This supports the parametric design of the Blocks-CGRA where the configuration instance can be tailored to the application in order to achieve a higher energy efficiency or throughput. From the results obtained in Chapter 6, we find that both runtime and en- ergy usage are significantly decreased for vectorized programs compared to their scalar equivalents. The speedups obtained for a vector width of 8 range from 2.3× to 7.6×; if we leave out stall cycles, then the speedup can, in certain cases, even exceed the vector width. Moreover, we find energy reductions ranging from 11.1% in the worst case to as much as 75.8%. In the optimal case, a vectorized program was found to finish in 13% of its original runtime whilst only using 24.2% of the original energy. In the worst case, a vectorized program finished in 44% of its original runtime whilst using 88.9% of the original energy. No regressions in either runtime or energy usage were observed in any of the 20 vectorized test cases when compared to their scalar equivalents. At the same time, area size increased by up to 2.3× in the worst case, but 2.1× to 2.2× on average; a rather modest increase for a vector width of 8. This suggests that the hardware design of Blocks is very well tailored to vectorized applications, which is supported in the literature. [18] As we have found in this work, this also applies to high-level C programs that have been vectorized through a compiler rather than having been hand-written in Parallel Assembly. A number of problems pertaining to vectorized programs on the Blocks- CGRA remain. The main challenge in running vectorized programs on a CGRA

82 is vector locality; each vectorized unit can only directly access its own element. This makes operations such as vector shuffling rather difficult to do without specialized hardware extensions or embedding hard-wired shuffle patterns into the architecture configuration. Moreover, stall cycles incurred when reading or writing vectors from or to global memory are a significant bottleneck when dealing with data element sizes that cannot be coalesced into a single memory access. Our findings indicate that getting around these two limitations can decrease running times of vectorized programs much further.

7.2 Future work

In this section we shall describe some topics that may be interesting for future research.

ˆ Partial insertion search space Currently the search space for our partial insertion algorithm (Section 5.5.4) is quite limited. The potential search space for derived shuffles is enormous, and this method of vector shuffling can be particularly useful for Blocks configurations with limited (room for) hard-wired shuffle patterns. Future work may look into a more efficient or effective navigation of this search space in order to find more convenient derived shuffles, and bring down the cycle-counts for vector shuffles performed using this method.

ˆ Vector width variations We have implemented our vector support in the Blocks-CGRA compiler only for a vector width of 8 at this time. The Blocks-CGRA compiler is currently based on LLVM version 4, but there are recent efforts towards updating it to a newer version of LLVM. Once this is completed, it may be worthwhile to look into adding support for more vector widths, as this could produce further insight into runtime, area and energy usage trade- offs related to the choice of vector width. ˆ Configuration adjustments in the compiler At this time the Blocks-CGRA compiler is not equipped to make modifica- tions to the Blocks configuration during the compilation process. Achiev- ing this would likely require a significant restructuring effort of the Blocks scheduler. However, this may have its merits as the compiler has unique insight into which units in the system are most or least valuable to the program being compiled. The compiler could even cull unnecessary Func- tional Units from the configuration if it detects that these are not useful. For instance, this would allow the Multiplexed Register File (RFM) to be automatically scaled to the number of units required by the system. Moreover, this would allow the compiler to automatically add hard-wired shuffle patterns to the configuration for the shufflevector operations used in the program. ˆ Stall cycle reduction Stall cycles currently form a significant bottleneck for some vectorized applications, such as our Sum and MatVecMul classes of benchmarks. These stall cycles occur when vectors are accessed in global memory. In

83 order to improve performance of vectorized programs, it would be highly useful to reduce the number of stall cycles incurred during the execution of an application. One way to achieve this might be to move input data directly into the local memories of the vectorized LSUs, so that they can be loaded or written without producing stall cycles. This may be done with e.g. a Direct Memory Access (DMA) controller. Another way would be to introduce posted writes; this would reduce stall cycles at least for writing to memory.

84 Bibliography

[1] M. Adriaansen, M. Wijtvliet, R. Jordans, L. Waeijen, and H. Corporaal. Code generation for reconfigurable explicit datapath architectures with LLVM. In 2016 Euromicro Conference on Digital System Design (DSD), pages 30–37, United States of America, August 2016. Institute of Electrical and Electronics Engineers (IEEE).

[2] C.H. van Berkel. Multi-core for mobile phones. In 2009 Design, Automation Test in Europe Conference Exhibition, pages 1260–1265, April 2009. [3] Free Software Foundation. Loop-specific pragmas. In Using the GNU Compiler Collection (GCC), chapter 6.61.16. 8.1 edition, May 2018. https://gcc.gnu.org/onlinedocs/gcc-8.1.0/gcc/Loop- Specific-Pragmas.html. [4] R. Hartenstein. Coarse grain reconfigurable architecture (embedded tuto- rial). In Proceedings of the 2001 Asia and South Pacific Design Automa- tion Conference, ASP-DAC ’01, pages 564–570, New York, NY, USA, 2001. ACM.

[5] J. Hoogerbrugge. Code generation for Transport Triggered Architectures. PhD thesis, Delft University of Technology, Delft, The Netherlands, Febru- ary 1996. [6] P. Larsson and E. Palmer. Image processing acceleration techniques using Intel Streaming SIMD Extensions and Intel Advanced Vector Ex- tensions. Intel, September 2009. https://software.intel.com/en- us/articles/image-processing-acceleration-techniques-using- intel-streaming--extensions-and-intel-advanced-vector- extensions/. [7] LLVM Developer Group. Auto-vectorization in LLVM. In LLVM 4 docu- mentation. 4.0.0 edition, March 2017. https://releases.llvm.org/4.0. 0/docs/Vectorizers.html. [8] LLVM Developer Group. LLVM language reference manual. In LLVM 4 documentation. 4.0.0 edition, March 2017. https://releases.llvm.org/ 4.0.0/docs/LangRef.html. [9] C. Lomont. Introduction to Intel Advanced Vector Extensions. Intel, June 2011. https://software.intel.com/en-us/articles/introduction- to-intel-advanced-vector-extensions.

85 [10] NVIDIA Corporation. CUDA toolkit documentation, 9.2.88 edition, March 2018. https://docs.nvidia.com/cuda/. [11] C. Piper. An introduction to vectorization with the Intel C++ Compiler. Intel, 2012. [12] A. Tiemersma. Optimal instruction scheduling and register allocation for Coarse-Grained Reconfigurable Architectures. Master’s thesis, Eindhoven University of Technology, Eindhoven, The Netherlands, March 2017. [13] K. Vadivel. Energy efficient loop mapping techniques for Coarse-Grained Reconfigurable Architecture. Master’s thesis, Eindhoven University of Technology, Eindhoven, The Netherlands, January 2017. [14] K. Vadivel, R. Jordans, S. Stuijk, H. Corporaal, P. J¨a¨askel¨ainen, and H. Kultala. Towards efficient code generation for exposed datapath archi- tectures. In Proceedings of the 22nd International Workshop on Software and Compilers for Embedded Systems, SCOPES 2019, pages 86–89, United States, 5 2019. Association for Computing Machinery, Inc. [15] K. Vadivel, M. Wijtvliet, R. Jordans, and H. Corporaal. Loop overhead reduction techniques for Coarse-Grained Reconfigurable Architectures. In 2017 Euromicro Conference on Digital System Design (DSD), pages 14–21, August 2017. [16] L. Waeijen, D. She, H. Corporaal, and Y. He. SIMD made explicit. In 2013 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS), pages 330–337, July 2013. [17] S. Walstock. Pipelining streaming applications on a multi-core CGRA. Master’s thesis, Eindhoven University of Technology, Eindhoven, The Netherlands, February 2018. [18] M. Wijtvliet, J.A. Huisken, L.J.W. Waeijen, and H. Corporaal. Blocks: Re- designing coarse grained reconfigurable architectures for energy efficiency. In 29th International Conference on Field Programmable Logic and Appli- cations, FPL 2019, September 2019 (to be published). [19] M. Wijtvliet, L. Waeijen, and H. Corporaal. Coarse grained reconfigurable architectures in the past 25 years: overview and classification. In 2016 16th International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation, SAMOS 2016, pages 235–244, United States, January 2017. Institute of Electrical and Electronics Engineers. [20] M. Wijtvliet, L.J.W. Waeijen, M. Adriaansen, and H. Corporaal. Reaching intrinsic compute efficiency requires adaptable micro-architectures. pages 1–7, January 2016. In 9th International Workshop on Programmability and Architectures for Heterogeneous Multicores (MULTIPROG-2016). [21] P. Xiang, Y. Yang, and H. Zhou. Warp-level divergence in GPUs: char- acterization, impact, and mitigation. In 2014 IEEE 20th International Symposium on High Performance (HPCA), pages 284–295, February 2014.

86