Eindhoven University of Technology MASTER Compiler Vectorization For

Eindhoven University of Technology MASTER Compiler vectorization for coarse-grained reconfigurable architectures Linders, G.C. Award date: 2019 Link to publication Disclaimer This document contains a student thesis (bachelor's or master's), as authored by a student at Eindhoven University of Technology. Student theses are made available in the TU/e repository upon obtaining the required degree. The grade received is not published on the document as presented in the repository. The required complexity or quality of research of student theses may vary by program, and the required minimum study period may vary in duration. General rights Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. • Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain Compiler Vectorization for Coarse-Grained Reconfigurable Architectures Geert Linders [email protected] Master's Thesis Eindhoven University of Technology August 13, 2019 Abstract In recent years, there has been a demand for increasingly powerful embedded processing units which operate at an ever higher energy efficiency. However, tra- ditional architectural platforms such as CPUs and FPGAs are not always able to support growing workloads on a tight energy budget. To mitigate this, an al- ternative paradigm of processing has been gaining traction: the Coarse-Grained Reconfigurable Architecture (CGRA). Configuration of Functional Units inside a CGRA allow it to be structured in various ways including general-purpose, application-specific. To support large workloads on a CGRA with minimal energy, a CGRA can be configured as an SIMD processor, which allows for the execution of vectorized programs. This work explores how vectorized programs from existing CPU programming models can be effectively mapped to a CGRA configuration and program schedule. Currently, a C compiler exists based on the LLVM compiler frame- work, which is able to compile scalar programs for Blocks: a CGRA in development at Eindhoven University of Technology which is well-suited for SIMD programs. To extend this compiler with vectorization, we make adjustments to the Blocks hardware, compiler model, and instruction set. Moreover, we im- plement procedures in the compiler and surrounding toolflow in order to add support for lowering vector-specific operations, particularly vector shuffles, for various different Blocks configurations. Benchmarking results indicate that for a compiler-friendly Blocks configuration, significant speedups and energy reductions can be obtained by vectorization. A vector width of 8 can yield a speedup ranging from 2:3× to 7:6×, whilst energy usage is reduced by 11:1% to 75:8%. Finally, a number of key issues which prevent vectorized Blocks programs from achieving their full potential are identified. Contents 1 Introduction 3 2 Related work 5 2.1 Blocks . 5 2.2 Blocks compiler . 10 2.2.1 LLVM-based compiler . 10 2.2.2 Resource graph . 12 2.3 Vector programming models . 15 2.3.1 CPU-style . 16 2.3.2 GPU-style . 20 2.4 Common problems on Blocks . 21 3 Compiler model 23 3.1 LLVM vector operations . 23 3.1.1 build vector . 24 3.1.2 shufflevector . 24 3.1.3 extractelement . 24 3.1.4 insertelement . 24 3.2 Hardware model . 25 3.3 Architecture expansion . 26 3.4 Scalar-vector boundary . 27 3.4.1 Scalar to vector . 27 3.4.2 Vector to scalar . 27 3.4.3 Vector to vector . 27 3.5 Shuffle patterns . 28 3.6 Pseudo-units . 30 3.6.1 Vector Shuffle Unit (SHF) . 30 3.6.2 Multiplexed Register File (RFM) . 31 3.7 Vector indices . 34 4 Instruction set extensions 36 4.1 ALU instructions . 36 4.1.1 ADDI: Add Immediate . 36 4.1.2 SUBI: Subtract Immediate . 38 4.2 LSU instructions . 40 4.2.1 SVX: Set Vector Index . 40 4.2.2 RVX: Reset Vector Index . 40 4.2.3 LVD: Load Vector Default Index . 41 1 5 Compiler vector support 43 5.1 Storage and routing . 43 5.2 Building vectors . 45 5.2.1 Base vector . 46 5.2.2 Remainder insertion . 48 5.3 Insert element . 48 5.4 Extract element . 49 5.5 Vector shuffling . 50 5.5.1 Hard-wired shuffle . 50 5.5.2 Vector Shuffle Unit . 53 5.5.3 Shuffle via memory . 58 5.5.4 Partial insertion . 60 6 Evaluation 67 6.1 Testing environment . 67 6.2 Runtime . 70 6.3 Area size . 77 6.4 Energy usage . 79 7 Conclusion 82 7.1 Summary . 82 7.2 Future work . 83 Bibliography 85 2 Chapter 1 Introduction As workloads and performance requirements increase, there has been an increasing demand for highly efficient embedded platforms. Embedded processing units continue to become more and more powerful, but this often comes at the cost of a higher energy usage. To an increasing degree, there is a need for parallel processing in order to support growing workloads on a limited power budget; for instance, using a multi-core approach [2] or a vectorized one [6]. Nowadays, general purpose Central Processing Units (CPUs) and Field- Programmable Gate Arrays (FPGAs) are commonly used in embedded platforms. A CPU has the advantage that it is highly flexible and available complete- off-the-shelf, being able to run any sort of application out of the box. However, this generality introduces a great deal of computational overhead; for instance, instruction fetching/decoding and data transport. As such, in some cases a CPU is simply not efficient enough in order to support compute intensive applications; instead, an adaptable micro-architecture is required [20]. On the other hand, an FPGA is such an adaptable micro-architecture: the FPGA itself is generic, but its configuration can be optimized for running a single specific application, which generally yields a much higher throughput. However, the high configurability and low path-width of an FPGA causes much energy to be lost due to the large number of switch-boxes and long wires [20][4]. As a result, neither CPUs nor FPGAs are an ideal solution when a high energy efficiency is desired. To improve energy efficiency, CPUs and FPGAs are often paired with spe- cialized computing units such as Graphical Processing Units (GPU) and Dig- ital Signal Processors (DSP). In terms of granularity, these computing units fall somewhere in between; they are more fine-grained than a CPU, but more coarse-grained than an FPGA. In addition, Application-Specific Integrated Cir- cuits (ASIC) are also frequently used when a higher energy efficiency is desired. However, these computing units are usually domain specific and do not have the flexibility that a CPU or FPGA offers. A new paradigm of processing unit has been proposed which potentially achieves a higher energy efficiency than both a CPU and an FPGA: the Coarse- Grained Reconfigurable Architecture (CGRA). The concept has been around since the 90s in various shapes and forms [4], but has especially gained traction in recent years [19]. Similar to an FPGA, a CGRA can be configured to be optimized for a specific application; however, a CGRA is configured at the 3 Functional Unit level, whereas an FPGA is configured at the gate level. In other words, a CGRA has larger building blocks than an FPGA and thus requires a lower amount of wires and switch-boxes, which reduces static energy dissipation. In this report we shall specifically deal with the CGRA known as Blocks, that is currently in development at Eindhoven University of Technology. [20] Due to the use of larger building blocks, code generation for a CGRA is more akin to that of a CPU rather than the high level synthesis that is used for FPGAs. A compiler based on the popular LLVM architecture currently exists for the Blocks-CGRA, which supports compilation of basic programs to Blocks machine code. However, up until now the focus has been exclusively on scalar programs, with vectorized programs only being created through manual Parallel Assembly programming. Moreover, little research has been done into the map- ping of vectorized programs to Blocks-CGRA designs. We wish to extend the compiler in order to add support for vectorized architectures and operations, similar to those available on contemporary processing units. This report will explore and evaluate the possibility of introducing support for vectorized programs on the Blocks-CGRA hardware and supporting vector operations in the compiler. The remainder of this report is structured as follows. Chapter 2 gives a detailed introduction of the Blocks-CGRA and CGRAs in general, as well as the Blocks-CGRA compiler, and gives an overview of related work. Chapter 3 describes the compiler model, including vector-specific operations and how a vectorized Blocks-CGRA is represented in the compiler, as well as introducing pseudo-units constructs and Blocks hardware extensions used to aide compilation. Chapter 4 provides the specifications of all newly implemented instructions in the Blocks instruction set. Chapter 5 details the changes made to code lowering and scheduling processes in the Blocks compiler, and introduces procedures to lower high-level vector operations to low-level assembly code. Chapter 6 describes and evaluates the results of vectorized benchmarking applications that have been run on the modified Blocks toolflow. Finally, Chapter 7 gives a conclusion regarding our findings in this work. 4 Chapter 2 Related work This chapter will discuss relevant work that has been done, regarding both CGRAs in general and the Blocks-CGRA specifically, as well as vectorization of computer programs.

Load more