High-Level Languages and Floating-Point Arithmetic for FPGA- Based CFD Simulations

[3B2-9] mdt2011040028.3d 29/6/011 11:13 Page 28 FPGA-Based Acceleration of Scientific Computing High-Level Languages and Floating-Point Arithmetic for FPGA- Based CFD Simulations Diego Sanchez-Roman, Gustavo Sutter, Francisco Palacios Sergio Lopez-Buedo, Ivan Gonzalez, Stanford University Francisco J. Gomez-Arribas, and Javier Aracil Universidad Autonoma de Madrid numerically. The drawback with CFD Editor’s note: simulation in aeronautical design is Computational fluid dynamics is a classical problem in high-performance that, in many situations, flows develop computing. In order to make use of an existing code base in this field, the use of high-level design tools is an imperative. The authors explore the two physical phenomena: shock waves use of the Impulse C design tools for a Navier-Stokes implementation. and turbulence. Such cases require ÀÀGeorge A. Constantinides (Imperial College London) using a fine discretization of the space and Nicola Nicolici (McMaster University) to obtain accurate results, so the time required to compute the solutions becomes prohibitive, even in the best COMPUTATIONAL FLUID DYNAMICS (CFD) plays a high-performance computing (HPC) clusters. Be- key role in the design and optimization of many in- cause of this, researchers have expended consider- dustrial applications. In the case of aeronautics, air- able effort to try to accelerate the execution of craft design has been traditionally based on costly these algorithms. Developed algorithms include com- and time-consuming wind tunnel tests. Computer- puting parallelization,1 GPU computing,2-4 and FPGA based flow simulations would enable much faster solutions.5,6 and less expensive tests, significantly reducing design FPGA-based high-performance reconfigurable costs and allowing for the exploration of new airfoil computing (HPRC) is a promising technology for ap- geometries. More importantly, CFD would also enable plication acceleration because it creates a synergy shape optimization, thus facilitating the development between system-level parallelism and lower-level of safer, less polluting and less fuel-consuming air- hardware parallelism. Unfortunately, reconfigurable crafts. Unfortunately, the huge computational costs solutions typically require skilled engineers to write of CFD prevent it from being a valid tool for the entire hardware description language (HDL) code, and design process. CFD is currently used only at some development time and testing usually far exceed design steps, and wind tunnel tests are still essential. that in software solutions. Fortunately, this drawback These huge costs come from the Navier-Stokes is being reduced, thanks to the advent of hardware equations that govern the air flow motion. These compilers from high-level languages (HLLs), typi- Navier-Stokes equations derive from the physical cally C or C dialects.7,8 Although HLLs reduce devel- laws of mass, momentum, and energy conservation, opment time, manual optimizations are still needed and they cannot be solved analytically except in con- to improve the implementation, and this limits the crete cases, so their solutions must be approximated reduction in design efforts and costs. Additionally, 28 0740-7475/11/$26.00 c 2011 IEEE Copublished by the IEEE CS and the IEEE CASS IEEE Design & Test of Computers [3B2-9] mdt2011040028.3d 27/6/011 12:9 Page 29 newer FPGA families provide better support for floating-point arithmetic, so it’s no longer nec- essary to go through the painful step of porting the algorithm to fixed-point arithmetic. Finally, the advent of a new generation of acceleration-oriented FPGA platforms, such as in-socket accelerators, has made HPRC far more efficient and simpler to use. In this article, we show how to employ these new methodolo- gies and tools to significantly improve the performance of complex applications such as aeronautical CFD simulations, while reducing the energy required to perform those com- putations. We have obtained a 22Â speedup and one order of magnitude in energy savings Figure 1. 2D discretization of a National Advisory Committee for Aeronautics for a 2D Navier-Stokes solver (NACA) 4412 airfoil. The space is discretized in finite volumes shown as 2D using an approach based on triangles, where the laws of physics are imposed. the XtremeData XD2000i In- Socket Accelerator,9 along with single-precision In the first preprocessing stage, the program floating-point arithmetic and Impulse Accelerated reads the mesh and computes the control volumes. Technologies’ Impulse C software (http://www. It also determines the edge connectivity, calculates impulseaccelerated.com), which generates VHDL the normal vectors (to edges), and initializes the or Verilog from standard ANSI-C code. conservative values for each node. After this preprocessing stage, the actual integration loop Navier-Stokes solver begins. Figure 2 shows the algorithm flow. The We’ve used an in-house Navier-Stokes solver writ- arrows between the boxes represent the depen- ten in C++ using single floating-point arithmetic, dency between the subroutines, and the numbering which implements a vertex-centered finite-volume reflects the order in the sequential execution of the method (FVM) applying the MUSCL (Monotone software solver. The new conservative values for Upstream-Centered Schemes for Conservation Laws) each node are updated in the time integration rou- scheme. The program’s input is an unstructured tine, as a function of the residuals computed in the mesh, in which the space is discretized in finite vol- space integration and the time step. Because of the umes. These finite volumes are triangles or rectangles numerical scheme implemented, one node uses in the 2D case, and hexahedrons, pyramids, tetrahe- data from two neighborhood layers to update its drons, or wedges in the 3D case. Figure 1 represents new conservative values. a 2D discretization of a National Advisory Committee for Aeronautics (NACA) 4412 airfoil consisting of tri- Hardware platform angles. The density, momentum, and energy are asso- The development system is a workstation with a ciated with each node (we will call them conservative dual Xeon motherboard populated with one Intel values since they follow conservation laws). These Xeon L5408 quad-core processor and one Xtreme- magnitudes are computed in successive time integra- Data XD2000i In-Socket Accelerator. The XD2000i, tions until a convergence criterion is reached. installed in the second processor socket, uses the July/August 2011 29 [3B2-9] mdt2011040028.3d 27/6/011 12:9 Page 30 FPGA-Based Acceleration of Scientific Computing However, the traditional approach Time step for FPGA design is based calculation Inviscid residuals Gradients on fixed-point arithmetic and (1) calculation calculation HDLs such as VHDL or Verilog. (upwind scheme) (2.1) (2.1.1) Using this traditional approach, algorithm acceleration in an Primite vars Primitive gradients Viscous residuals Space integration HPRC system requires a huge calculation calculation calculation (2) effort: analyze the dynamic (2.2) (2.2.1) (2.2.1.1) range of variables to transform Boundary floating- to fixed-point arithme- residuals tic; manually schedule arithme- calculation Time integration (2.3) tic operations; create finite-state (3) automata to control the ex- Implemented in hardware ecution of operations, and so Inviscid forth. Additionally, validation stress (4.1) and debug is especially diffi- Stress calculation cult, making this methodology (4) suitable only for very stable Viscous stress codes. This is a serious draw- (4.2) back for algorithms that are continually evolving because of new scientific knowledge, as Figure 2. Algorithm flow for the Navier-Stokes solver that we used. with aeronautical CFD applications. Therefore, we use an alter- motherboard’s existing CPU infrastructure to native methodology based on Impulse C and create a full-featured environment for FPGA copro- floating-point arithmetic. cessing. The high-bandwidth, low-latency front- side bus (FSB) link between the coprocessor Synthesis framework creation module and the Intel processor enables tightly The first step was to create a C synthesis frame- coupled FPGA acceleration of x86 applicationsÀÀ work optimized for the accelerator platform used. previously impossible with legacy PCI-busbased We chose Impulse C because it is the only tool that solutions.9 supports our XD2000i module, since Mitrion-C has The XD2000i In-Socket Accelerator features three been discontinued.8 To the best of our knowledge, Altera Stratix III FPGAs (http://www.altera.com). there are only two other HLLs that support floating- One of these FPGAs (Stratix III SL150) serves as a point arithmetic out of the boxÀÀXilinx AutoESL bridge to the FSB, whereas the other two (Stratix III and ROCCC (Riverside Optimizing Compiler for Con- SE260 with 255,000 logic elements, 768 embedded figurable Computing)ÀÀbut neither of these HLLs pro- multipliers, and 15 Mbytes of internal memory) are vides support for acceleration modules. Another available to implement the user logic. These two advantage of Impulse C is that it lets users easily application FPGAs are connected through two unidi- change the cores being used for arithmetic operations. rectional 64-bit buses at 400 megatransfers per Actually, we found that the floating-point cores second (MT/s). In addition, the XD2000i module originally provided by Impulse C and Altera are includes two QDRIIþ SRAM banks, one for each deeply pipelined, featuring a high clock frequency user FPGA.9 The accelerator system typically works but at the cost of increased latency and area. How- at 100 MHz. ever, the XD2000i module is aimed at 100-MHz designs, so it made no sense to use such deeply Hardware development methodology pipelined cores. Therefore, we developed our own using HLLs single-floating-point library based on IEEE Std. Scientific-computing algorithms are mainly written 754-2008 (IEEE Standard for Floating Point Arithmetic) with floating-point arithmetic in C, C++, or Fortran. binary32, targeting this clock frequency at the lowest 30 IEEE Design & Test of Computers [3B2-9] mdt2011040028.3d 27/6/011 12:9 Page 31 Table 1. Latency and area of floating-point operations.

Load more