Empowering Parallel Computing with Field Programmable Gate Arrays
Total Page:16
File Type:pdf, Size:1020Kb
16 Parallel Computing: Technology Trends I. Foster et al. (Eds.) © 2020 The authors and IOS Press. This article is published online with Open Access by IOS Press and distributed under the terms of the Creative Commons Attribution Non-Commercial License 4.0 (CC BY-NC 4.0). doi:10.3233/APC200020 Empowering Parallel Computing with Field Programmable Gate Arrays Erik H. D’HOLLANDER a,1 a Electronics and Information Systems Department, Ghent University, Belgium Abstract. After more than 30 years, reconfigurable computing has grown from a concept to a mature field of science and technology. The cornerstone of this evo- lution is the field programmable gate array, a building block enabling the configu- ration of a custom hardware architecture. The departure from static von Neumann- like architectures opens the way to eliminate the instruction overhead and to op- timize the execution speed and power consumption. FPGAs now live in a grow- ing ecosystem of development tools, enabling software programmers to map al- gorithms directly onto hardware. Applications abound in many directions, includ- ing data centers, IoT, AI, image processing and space exploration. The increasing success of FPGAs is largely due to an improved toolchain with solid high-level synthesis support as well as a better integration with processor and memory sys- tems. On the other hand, long compile times and complex design exploration re- main areas for improvement. In this paper we address the evolution of FPGAs to- wards advanced multi-functional accelerators, discuss different programming mod- els and their HLS language implementations, as well as high-performance tuning of FPGAs integrated into a heterogeneous platform. We pinpoint fallacies and pit- falls, and identify opportunities for language enhancements and architectural re- finements. Keywords. FPGAs, high-level synthesis, high-performance computing, design space exploration 1. Introduction Parallel Computing is predominantly focusing on high performance computing. It typi- cally covers three major domains: architectures, algorithms and applications. These three components are considered a joint force to accelerate computations. Modern applications have an ever growing processing need. This is the case for many areas such as artificial intelligence, image processing, Internet of Things, and big data, not to mention cyber physical systems, which is the topic of one of the Horizon 2020 calls by the European Commission [14]. Despite an exponential rise of computing power for CPUs in general, we see that since 2003 the performance rate is dropping [19], first of all due to the Den- nard scaling [9], which is actually a problem of warming up caused by a continuously rising clock frequency. A second aspect, not related to technology but to the type of ap- plication, is Amdahl’s law [1], which is analyzed using parallel program models in sec- tion 5. The third one is the end of Moore’s law [26], because transistors cannot prac- 1E-mail: [email protected]. E.H. D’Hollander / Empowering Parallel Computing with Field Programmable Gate Arrays 17 tically be made smaller than a few silicon atoms. This means that we have to go new ways to accelerate the computations. One of the accelerators notoriously present in our today’s desktops and supercomputers is the graphics processing unit. A GPU is an ex- cellent device for parallel number crunching, originally rooted in image processing, but which since many years has outgrown this application domain by an order of magnitude. While GPUs are voracious number crunchers, graphics processing units are not a one size fits all architecture for every problem on the earth. A GPU is best suited for mas- sively parallel computations. It has a fixed architecture consisting of many parallel pro- cessing engines, and a fast thread manager which is able to switch very quickly between threads waiting for data from the memory. A major alternative emerging as a rising star is the field programmable gate array. In contrast to the GPU, the field programmable gate array has not a fixed but a flexible, re- configurable architecture. An FPGA behaves metaphorically speaking like a chameleon because it is able to adapt itself to the type of the algorithm. This compute engine allows to build an ad hoc architecture which is a one-to-one mapping of the algorithm onto the hardware. There is no program, no instructions, just an interconnection of computing el- ements and control logic performing the task implied by the algorithm. This brings us to a very regular layout of the field programmable gate array: it consists of logic elements, computing elements such as digital signal processors, memory elements or RAM blocks, input-output elements at the border, clock signals to synchronize all the operations and finally a network of programmable interconnections used to configure the control and data paths of the design (Fig. 1). Figure 1. FPGA architecture with basic logic elements, programmable interconnects, I/O pins, block RAM and digital signal processors. The so-called programming of an FPGA involves mapping a logic design, including DSPs and their interconnections, onto the physical hardware of the FPGA. This is done in two steps. First, a high-level synthesis or HLS compiler converts the program into a task graph, assigns the nodes of the graph to computing resources of the FPGA, maps the edges to control and data paths between the computing elements and generates a low-level hardware description in VHDL or Verilog. In the second step, the hardware description is mapped, placed and routed onto the physical available resources of the FPGA. The result is stored in a binary file, called the bitstream, which is uploaded to the configuration memory of the FPGA. After configuration, i.e. setting the interconnections, the programmed FPGA can operate e.g. as a data flow engine. 18 E.H. D’Hollander / Empowering Parallel Computing with Field Programmable Gate Arrays The rest of this paper is organized as follows. The origins of FPGAs are presented in section 2. The back-end of an FPGA compiler is described in section 3 and section 4 introduces the front end of high-level synthesis compilers, including performance factors and design exploration. The differences between FPGA and GPU accelerators are ad- dressed in section 5, in particular focusing on the typical programming styles and appli- cation areas of both accelerators when using the same programming language OpenCL. Section 6 explores an integrated FPGA-CPU environment, including shared virtual mem- ory, coherent caching and high bandwidth, as presented in the HARP-v2 platform. The power of this platform is illustrated by a guided image filter application. In section 7 some common misconceptions are addressed and concluding remarks are given in sec- tion 8. 2. History Field programmable gate arrays have come a long way. They found their origin in the 60s when researchers were looking for new computer architectures as an alternative for the stored program concept of von Neumann. Not that the stored program concept was a bad idea, on the contrary: it is an ingenious innovation to put instructions and data in the same memory and to reconfigure the processor during instruction fetch and decode cycles. Why then look for new architectures? Well, by configuring the control and data paths at each instruction, the von Neumann computer trades in performance for flexibility. The first person to challenge the von Neumann concept was Gerald Estrin from UCLA [12]. His idea was to extend the traditional computer, which he called a fixed ar- chitecture, with a variable part, used to implement complex logic functions. He built a six by six motherboard to hold configurable computing and memory elements. The compute elements implemented functions much more complex than the simple instructions in the fixed architecture. The backside of the motherboard consists of a wiring panel to inter- connect the different components. Programming the ”fix+variable” computer was quite a challenge, e.g. the computation of the eigenvalues of a symmetric matrix is described in a publication of 20 pages and resulted in a speedup of 4 with respect to an existing IBM computer [13]. Although these results are modest, they show that implementing an algorithm directly into hardware is beneficial. The question remains how this can be done efficiently. In 1975 Jack Dennis of MIT pioneered a new architecture which he called the data- flow processor [10]. The architecture consists of a large set of computing elements which can be arbitrarily interconnected and which fire when data is present on the inputs. Pro- gramming the dataflow processor consists of interconnecting the computing elements according to the dataflow graph of the algorithm and percolating the data from the top. There are two performance gains. First, since the program is not stored in a memory, the instruction fetch and instruction decode phases are eliminated, leading to a performance gain of 20 to 40%. Second, the parallelism is maximally exploited, because it is only limited by the dependencies in the algorithm and by the available computing elements. A drawback of the first dataflow architecture is that the computing resources may be underutilized when occupied by a single data stream. Therefore Arvind, a colleague of Jack Dennis, proposed the tagged dataflow architecture [2] in which several separate data streams are identified by tokens with different colors and tokenized data compete for the compute resources in the system. E.H. D’Hollander / Empowering Parallel Computing with Field Programmable Gate Arrays 19 Why did these new architectures not materialize? Obviously, the technology was lagging behind, compared to the large-scale integration of today and also silicon com- pilers were still in their infancy. These problems have been largely overcome by the confluence and synergistic effects of hardware and software developments together with a major progress in the detection and management of parallelism in programs.