Towards Compilation of an Imperative Language for Fpgas Baptiste Pauget David J

Towards Compilation of an Imperative Language for FPGAs Baptiste Pauget David J. Pearce Alex Potanin School of Engineering and Computer School of Engineering and Computer Département d’informatique, École Science, Victoria University of Science, Victoria University of Normale Supérieure, France Wellington, New Zealand Wellington, New Zealand Abstract mass market. FPGAs are used in a variety of arenas such Field-Programmable Gate Arrays (FPGA’s) have been around as HPC acceleration [14], seismic modelling [31], encryp- since the early 1980s and have now achieved relatively wide- tion [1, 28], financial applications [33, 44], DNA sequence spread use. For example, FPGAs are routinely used for high- alignment [15, 35], software-defined networking [30, 41], and performance computing, financial applications, seismic mod- are even found in smartphones (e.g. Samsung Galaxy S5 elling, DNA sequence alignment, software defined network- and iPhone 7). FPGAs provide a middle-ground between ing and, occasionally, are even found in smartphones. And Application-Specific Integrated Circuits (ASICs) and general yet, despite their success, there still remains something of a purpose CPUs. Compared with a CPU, an FPGA offers the gap between programming languages and circuit designs for performance benefits from a gate-level abstraction [5] and an FPGA. We consider the compilation of an imperative pro- can provide higher throughput with lower power usage than a gramming language, Whiley, to VHDL for use on an FPGA. CPU or GPU [43, 44]. Bacon et al. note the following [5]: A key challenge lies in splitting an arbitrary function into a “When it comes to power efficiency (performance series of pipeline stages, as necessary to expose as much task per watt), however, both CPUs and GPUs signifi- parallelism as possible. To do this, we introduce a language cantly lag behind FPGAs” construct which gives the programmer control over how the When compared with traditional ASIC manufacturing pro- pipeline is constructed. cesses, FPGAs offer greater flexibility as, in some sense, CCS Concepts • Hardware ! Hardware description lan- hardware becomes more like software. This is even more guages and compilation; Reconfigurable logic applications; apparent today since modern FPGAs can be reconfigured (or • Software and its engineering ! Imperative languages; partially reconfigured) in a matter of milliseconds. Roughly speaking, an FPGA is made up of a large num- Keywords Field-Programmable Gate Arrays, Compilers, Hard- ber of configurable logical blocks (CLBs). Each logic block ware Description Languages typically consists of some number of lookup tables (LUTs), ACM Reference Format: flip-flops, and full-adders. Modern FPGAs, such as Xilinx’s Baptiste Pauget, David J. Pearce, and Alex Potanin. 2018. Towards Virtex-7 series, are manufactured at the nanometer scale (e.g. Compilation of an Imperative Language for FPGAs. In Proceed- 28nm), operate in the 500Mhz frequency range and provide ings of the 10th ACM SIGPLAN International Workshop on Virtual millions of logic blocks. They may also contain other compo- Machines and Intermediate Languages (VMIL ’18), November 4, 2018, Boston, MA, USA. ACM, New York, NY, USA, 11 pages. nents, such as block RAMs (BRAMs), Digital Signal Process- https://doi.org/10.1145/3281287.3281291 ing slices (DSPs), communication interfaces (e.g. Ethernet, PCI), and processor cores (e.g. ARM) [46]. The design pro- 1 Introduction cess for an FPGA typically revolves around the use of Hard- Field-Programmable Gate Arrays (FGPAs) can no longer ware Description Languages (HDLs) such as VHDL [24] be considered a “new” technology. Since their introduction or Verilog [25]. These languages offer relatively low-level in the early 1980’s by Xilinx they have finally reached the abstractions and, as such, are ideally suited to thinking in terms of low-level building blocks such as gates, registers and Permission to make digital or hard copies of all or part of this work for multiplexers [5]. However, they offer few high-level abstrac- personal or classroom use is granted without fee provided that copies are not tions and this makes developing for FPGAs costly and time made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components consuming [12]. of this work owned by others than ACM must be honored. Abstracting with We are interested in compiling programs written for an im- credit is permitted. To copy otherwise, or republish, to post on servers or to perative programming language, Whiley, for an FPGA [40]. redistribute to lists, requires prior specific permission and/or a fee. Request Whiley is unusual in providing support for function speci- permissions from [email protected]. fications (i.e. pre-/post-conditions) which can be verified at VMIL ’18, November 4, 2018, Boston, MA, USA compile-time. Whiley follows in a long lineage of languages, © 2018 Association for Computing Machinery. ACM ISBN 978-1-4503-6071-5/18/11. $15.00 such as ESC/Java [19], Spec# [8], Dafny [32], Why3 [18], https://doi.org/10.1145/3281287.3281291 VeriFast [27], and SPARK/Ada [7]. Such languages, of course, 1 are used to verify properties of software systems. From this, I/O I/O I/O I/O I/O a natural question arises: can they verify properties of hardware systems? This provides the general motivation for our I/O CLB CLB CLB CLB work (and, indeed, shares similarity with aspects of Deep- Spec [2, 11]). However, it raises an immediate challenge: how I/O do we compile imperative languages, such as Whiley, for CLB CLB CLB CLB FPGAs? I/O The Problem. Compiling high-level languages for an FPGA CLB CLB CLB CLB is, of course, not a new problem. Many so-called “C-to-gates” systems exist for compiling subsets of C for FPGAs, such I/O as Stream-C [21], Handle-C [10], Autopilot [47] and more. Figure 1. A simplistic architectural view of an FPGA. Other approaches implement libraries or Domain Specific Languages (DSLs) for generating circuit designs from program code, such as ASC [36, 37], JHDL [9] and more. And yet, we find these systems remain unsatisfactory. Either they cout are too “gate-oriented” or overly restrictive in terms of sup- cin d1 LUT FA mux out ported language features. Part of the challenge is that it re- d2 mux d3 FF mains unclear what it even means to compile a program for an d4 LUT mux FPGA. Certainly, FPGAs are well suited for particular kinds sel1 of computation. Perhaps the most common paradigm is that sel2 of hardware acceleration, whereby a loop nest — or kernel — clk is converted into a data pipeline on the FPGA. Brodtkorb et al. make the following statement in this regard [12]: Figure 2. A simplistic view of a CLB. “A key element to obtaining high-performance on FPGAs is to use as many slices as possible for parallel computation. This can be achieved by pipelining the blocks, trading latency for through- 2 Background put; by data-parallelism, where data-paths are In this section, we provide some background on FPGAs and replicated; or a combination of both” a short introduction to the Whiley language. This gives insight into the difference between programming for a CPU versus for an FPGA. Specifically, when program- 2.1 Field-Programmable Gate Arrays ming for an FPGA we must utilise as much silicon as possible An FPGA consists, roughly speaking, of an array of Con- to maximise performance. Thus, we cannot view a function figurable Logic Blocks surrounded by I/O pads and apro- as an entity to execute sequentially; rather, we must view it grammable interconnect network which, typically, constitutes as something to be broken up into a pipeline. most of the area [17]. Figure 1 provides a simplistic illustra- tion following a traditional “island-style” layout. The Approach. Compiling programs written in a high-level language for an FPGA presents two primary challenges: (1) Configurable Logic Blocks (CLBs). Roughly speaking, a extracting latent data parallelism; (2) enabling task paral- CLB contains various components, normally including some lelism. In this paper, we focus on the latter. More specifically, lookup tables (LUTs), storage elements (e.g. D-type flip- how do we split an arbitrary function into a series of pipeline flops), and other machinery such as carry logic, multiplexers, stages? This is important to ensure as much silicon as possi- etc (the exact details differ between devices). ble is utilised at any given moment. And yet, at the same time, Figure 2 illustrates an example CLB made up from two there is no explicit information embedded in the program three-bit lookup tables (LUT), one full adder (FA), one D- source to guide us. Therefore, our approach is to provide such type flip flop (FF) and three multiplexers (MUX). We can, for information. The contributions of this paper are as follows: example, implement a four-bit LUT by combining the two • We present a novel approach to compiling imperative LUTs together and bypassing the adder and the flip flop. Or, languages for an FPGA. This is based around a lan- we could implement a simple ripple-carry adder by chaining guage primitive (stage) which allows fine-grained several CLBs together (i.e. by connecting carry out to succes- control over the generated pipeline. sive carry in, etc), making the LUTs sensitive only to d1 and, • We have developed a prototype implementation which again, bypassing the flip flop. Finally, we could implement compiles programs in the Whiley language into VHDL one-bit of a register by again making the LUTs sensitive only which can then be synthesised for an FPGA. to d1, and bypassing the adder. 2 Building Blocks. A CLB can implement any of the classical then the maximum clock frequency is determined by the gates (e.g.

Load more