An Automated Flow to Generate Hardware Computing Nodes from C for an FPGA-Based MPI Computing Network
Total Page:16
File Type:pdf, Size:1020Kb
An Automated Flow to Generate Hardware Computing Nodes from C for an FPGA-Based MPI Computing Network by D.Y. Wang A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF BACHELOR OF APPLIED SCIENCE DIVISION OF ENGINEERING SCIENCE FACULTY OF APPLIED SCIENCE AND ENGINEERING UNIVERSITY OF TORONTO Supervisor: Paul Chow April 2008 Abstract Recently there have been initiatives from both the industry and academia to explore the use of FPGA-based application-specific hardware acceleration in high-performance computing platforms as traditional supercomputers based on clusters of generic CPUs fail to scale to meet the growing demand of computation-intensive applications due to limitations in power consumption and costs. Research has shown that a heteroge- neous system built on FPGAs exclusively that uses a combination of different types of computing nodes including embedded processors and application-specific hardware accelerators is a scalable way to use FPGAs for high-performance computing. An ex- ample of such a system is the TMD [11], which also uses a message-passing network to connect the computing nodes. However, the difficulty in designing high-speed hardware modules efficiently from software descriptions is preventing FPGA-based systems from being widely adopted by software developers. In this project, an auto- mated tool flow is proposed to fill this gap. The AUTO flow is developed to auto- matically generate a hardware computing node from a C program that can be used directly in the TMD system. As an example application, a Jacobi heat-equation solver is implemented in a TMD system where a soft processor is replaced by a hardware computing node generated using the AUTO flow. The AUTO-generated hardware module shows equivalent functionality and some improvement in performance over the soft processor. The AUTO flow demonstrates the feasibility of incorporating au- tomatic hardware generation into the design flow of FPGA-based systems so that such systems can become more accessible to software developers. i Acknowledgment I acknowledge Synfora and Xilinx for hardware, tools and technical support, and my supervisor, Professor Paul Chow, for his guidance, patience, and insights, all of which are very valuable for the completion of this project. Thanks to Chris Madill and Arun Patel for their help in setting up the development environment, and Manuel Salda~na for help with the MPE network and scripts, and patiently answering all my questions during the many unscheduled drop-by visits. Also many thanks to Henry Wong for discussions, suggestions and debugging tips, and Ryan Fung for proofreading the final report. Finally, I would like to thank my mother for her love and support as always. ii Contents 1 Introduction 1 2 Related Work 4 2.1 FPGA-Based Computing . 4 2.2 The TMD-MPI Approach . 5 2.3 Behavioral Synthesis . 7 3 System Setup 9 3.1 TMD Platform Architecture . 9 3.2 Design Flow . 11 3.3 C-to-HDL Using PICO . 13 4 Implementation of the Tool Flow 15 4.1 Flow Overview . 15 4.2 MPI Library Implementation . 16 4.3 Control Block . 19 4.4 Scripts . 19 4.4.1 Preprocessing Script . 20 4.4.2 Packaging Script . 23 4.5 Test Generation . 24 4.6 Limitations . 25 4.6.1 Floating-Point Support . 26 4.6.2 Looping Structure . 26 iii 4.6.3 Pointer Support . 26 4.6.4 Division Support . 27 4.6.5 Performance Specification . 27 4.6.6 Hardware Debugging . 28 4.6.7 Exploitable Parallelism . 28 5 The Heat-Equation Application 30 5.1 Implementation . 30 5.2 Experiment Methodology . 32 5.3 Results . 33 6 Conclusion and Future Work 36 Appendix 37 A Hardware Controller for PICO PPA 37 A.1 Control FSM . 37 A.2 Stream Interface Translation . 39 B Using PICO: Tips and Workarounds 43 B.1 Stream Ordering . 43 B.2 Improving Performance . 44 Bibliography 47 iv Glossary The glossary contains the acronyms that are used in this report. • CAD { Computer Aided Design • CPE { Cycles Per Element • DCM { Digital Clock Manager • FSL { Fast Simplex Link. Xilinx's FIFO stream IP block. • FSM { Finite State Machine • HDL { Hardware Description Language • HPC { High-Performance Computing • IP { Internet Protocol • IP { Intellectual Property • MHS { Microprocessor Hardware Specification • MSS { Microprocessor Software Specification • MPI { Message Passing Interface • MPE { Message Passing Engine. Provides MPI functionality to • NetIf { Network Interface used in the TMD network v • PICO { Program-In Chip-Out. An algorithmic synthesis tool from Synfora, Inc. • PPA { Pipeline of Processing Arrays. The top-level hardware block generated from a function by the PICO flow. • TCAB { Tightly Coupled Accelerator Blocks. A hardware module generated by PICO from a C procedure that can be used as a black box when generating a higher-level hardware block. • TMD { Originally the Toronto Molecular Dynamics machine; now refers to the exclusively FPGA-based HPC platform developed at the University of Toronto. hardware accelerators in a TMD system. • VLSI { Very Large Scale Integrated Circuit • XPS { Xilinx Platform Studio. Xilinx's embedded processor system design tool. • XST { Xilinx Synthesis Technology. Xilinx's synthesis tool. • XUP { Xilinx University Program vi List of Figures 3.1 TMD platform architecture ([13]) . 10 3.2 Network configuration for different node types ([13]) . 11 3.3 TMD design flow ([13]) . 12 3.4 PICO design flow ([15], p.5) . 14 4.1 The AUTO tool flow . 16 4.2 Stream operations required to implement MPI behaviour . 17 4.3 TMD system testbed . 25 5.1 A simple two-node TMD implementation of a Jacobi heat-equation solver 32 5.2 Main loop execution time per element with different iteration lengths 34 A.1 Design of the control block . 38 A.2 State transition diagram of the control block FSM . 40 A.3 The PICO stream interface . 41 A.4 The FSL bus interface . 41 vii List of Tables 4.1 Implemented families of MPI functions . 18 5.1 Normalized computing power of the reference and test systems . 35 A.1 I/O ports exported by the control block . 38 A.2 Raw control ports on the PPA module . 39 viii Chapter 1 Introduction Much of today's scientific research relies heavily on numerical computations and demands high performance. Computational fluid dynamics, molecular simulation, finite-element structural analysis and financial trading algorithms are examples of computation-intensive applications that would not have been possible without the advances in computing infrastructure. Since the 1960s, generations of supercom- puters have been built to address the growing needs of the scientific community for more computing power. With the improved performance and availability of micro- processors, clusters of conventional CPUs connected in a network using commercially available interconnects became the dominant architecture used to build modern su- percomputers. As of November 2007, 409 of the top 500 supercomputers were cluster based [18]. However, as computing throughput requirements of new applications continue to increase, supercomputers based on clusters of generic CPUs become increasingly limited by power budgets and escalating costs and cannot scale further to keep up with the demand. As a result, specialized hardware accelerators became popular. In recent years there have been significant development in both GPU-based and FPGA-based computing models. While GPUs demonstrated remarkable performance improvement in highly data-parallel stream-based applications [1], FPGAs, with the flexibility they offer, are good candidates for specialized hardware acceleration systems. In order to leverage FPGAs in high-performance computing systems, hardware 1 accelerators need to be built from software specifications. The primary challenge in this is that hardware design is intricate and software developers typically do not have the expertise to design high-performance hardware. On the other hand, having to have both software and hardware designers working on the same project is costly and inefficient. As a result, hardware acceleration have not been adopted more widely among software developers. A tool flow that allows the software designers to eas- ily harness the power of hardware acceleration is hence essential to make hardware acceleration feasible in non-high-end applications. To address this need, we show in this project an automated tool flow, AUTO, that generates a hardware accelerator from a C program directly. This work builds on previous work on TMD, which is a scalable multi-FPGA high-performance computing system [11] that consists of a collection of computing nodes, where each node can be a soft processor or a hardware engine. The TMD uses the TMD-MPI message-passing programming model [12] for inter-node communication. The AUTO flow takes in an MPI program written in C as input and produces a hardware computing node that can be used directly in the TMD system. As a proof-of-concept prototype, the main objective of this project is to explore the possibility of algorithmic synthesis to target an FPGA-based system, with a focus on the feasibility of an automated tool flow. A Jacobi heat equation solver is implemented on TMD as an example application to demonstrate the functionality of the AUTO flow. With little designer intervention, we are able to automatically generate a functional hardware block that performs better than the soft processor node it replaces. Our eventual goal is to completely automate the design flow that generates a system of hardware accelerators from a parallel program as opposed to a single hardware computing node at a time. The rest of the report is organized as follows. Chapter 2 reviews existing research work in FPGA-based computing and algorithmic synthesis, which provides context to our work. Chapter 3 describes the TMD platform, the TMD-MPI design flow and AUTO's role in it. Chapter 4 explains the implementation of the AUTO flow. The limitations of the implementation are also outlined.