FLASH: Fast, Parallel, and Accurate Simulator for HLS Young-Kyu Choi, Member, IEEE, Yuze Chi, Jie Wang, and Jason Cong, Fellow, IEEE
Total Page:16
File Type:pdf, Size:1020Kb
IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 1, NO. 1, JANUARY 2020 1 FLASH: Fast, ParalleL, and Accurate Simulator for HLS Young-kyu Choi, Member, IEEE, Yuze Chi, Jie Wang, and Jason Cong, Fellow, IEEE Abstract—A large semantic gap between a high-level synthesis time. Some work has been done to automate hardware probe (HLS) design and a low-level RTL simulation environment often insertion from the HLS source file [11]–[17], but this work creates a barrier for those who are not FPGA experts. Moreover, requires regeneration of the FPGA bitstream if there is a such a low-level simulation takes a long time to complete. Software HLS simulators can help bridge this gap and accelerate change in the debugging point. The turnaround time is often the simulation process; but their shortcoming is that they do in hours. On-board emulators also have a similar problem and not provide performance estimation. To make matters worse, require a long time for the bitstream generation. we found that the current FPGA HLS commercial software These problems can be partially solved by the software- simulators sometimes produce incorrect results. In order to solve based simulators provided by HLS tools. The HLS software these performance estimation and correctness problems while maintaining the high speed of software simulators, this paper simulators compile the C or OpenCL source code for na- proposes a new HLS simulation flow named FLASH. The main tive execution on the host machine. It takes little time to idea behind the proposed flow is to extract scheduling information reconfigure the debugging points, and no semantic gap exists from the HLS tool and automatically construct an equivalent between the simulation and the design. However, a well- cycle-accurate simulation model while preserving C semantics. known shortcoming of these simulators is that most of them do Experimental results show that FLASH runs three orders of magnitude faster than the RTL simulation. not provide performance estimation. In addition, we found a critical deficiency—they sometimes provide incorrect results. Index Terms—Simulation acceleration, high-level synthesis, An example can be found in the molecular dynamics field-programmable gate array, source-to-source transformation. simulation [18] (Fig.1). Multiple distance processing elements (Dist PEs) filter out faraway molecules above threshold and send them to Force PE. The pruned molecules create a bubble I. INTRODUCTION (empty data) in the FIFO, and Force PE processes only the LTHOUGH the field-programmable gate array (FPGA) valid data (after non-blocking read) in the order they are A has many promising features that include power- received from any of the FIFOs. However, if the modules are efficiency and reconfigurability, the low-level programming instantiated in the order of (Dist PE1, PE2, . Force PE) environment makes it difficult for programmers to use the in the source file, Vivado HLS software simulator finishes the platform. In order to solve this problem, many high-level simulation of Dist PE1 first, followed by Dist PE2, and so on. synthesis (HLS) tools such as Xilinx Vivado HLS [1], [2] As a result, by the time the Force PE is simulated, the bubbles and Intel OpenCL HLS [3] have been released. These tools in the FIFOs are completely removed, and the Force PE output allow programmers to design FPGA applications with high- ordering can be entirely different from the RTL simulation. If level languages such as C or OpenCL. This trend is reinforced one is trying to quickly trace the source of a problem that was by recent efforts on FPGA programming with languages of observed in the output of an RTL simulation, the person will higher abstraction—such as Spark or Halide [4]–[6]. not be able to reproduce the problematic state in the software Even though such progress has been made on the de- simulation. sign automation side, a large semantic gap still exists on Another problematic example can be found in the artificial the simulation side. Programmers often need to use low- deadlock situation [19], which occurs when the depth of the level register-transfer level (RTL) simulators (e.g., Model- FIFO is smaller than the latency difference among modules Sim [7], NCSim [8], or VCS [9]) or on-board emulators (details in Section III-B). The issue is that the HLS software (e.g., Zebu [10]) and try to map the result back to HLS. The simulator cannot detect the deadlock situation and proceeds as result is often incomprehensible to those who are not FPGA if there is no problem with the design. We also have found experts. Moreover, low-level RTL simulation takes a very long a problem in the simulation of feedback loops where the feedback data is ignored by the HLS tool (Section III-C). Manuscript received June 21, 2019; revised December 6, 2019; ac- cepted January 7, 2020. Date of publication March 1, 2020; date of cur- The primary reason for the incorrect simulation result is rent version October 1, 2020. This paper was recommended by Asso- that HLS software simulators do not guarantee cycle accuracy. ciate Editor J. Cortadella. The authors are with Computer Science Depart- The comparison between the software simulator of the two ment, University of California, Los Angeles, CA, 90095, USA. E-mail: fykchoi,chiyuze,jiewang,[email protected] most popular ([20]) commercial FPGA HLS tools, Xilinx This research is in part supported by Intel and NSF Joint Research Center Vivado HLS (VHLS) [2] and Intel OpenCL HLS (AOCL) on Computer Assisted Programming for Heterogeneous Architectures (CAPA) [3], is presented in TableI. VHLS assumes unlimited FIFO (CCF-1723773) and Baidu, NEC, and VMWare under the CDSC industrial partnership. depth, which makes it difficult to accurately model FIFO Digital Object Identifier XX.XXXX/TCAD.2020.XXXXXXX fullness/emptiness. Also, the sequential simulation execution IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 1, NO. 1, JANUARY 2020 2 1 6 (II=4) Dist PE1 Dist PE2 Dist PE3 Dist PE4 Allocation 9 5 2 i=1 : (bubble) 2 (bubble) (bubble) Library i=2 : 5 (bubble) (bubble) 8 11 10 I=3 : (bubble) 10 11 (bubble) HLS C code Compilation Binding Generation RTL code 3 8 4 (Round-robin r<thres (II=1) Force PE non-blocking stmt,loop, read) func, ... Scheduling 7 12 < HLS C code> RTL sim output: 2 5 8 10 11 SW Proposed RTL #pragma HLS dataflow simulator simulator scheduling info simulator SW sim output: 5 211810 Dist_PE1(); 5 Simulated in Dist_PE2(); 2 10 Does not instantiation Dist_PE3(); 11 match Fig. 2. HLS design steps [21] and comparison of simulation flows Dist_PE4(); order 8 → Missing bubbles Force_PE(); Fig. 1. Molecular dynamics simulation [18] In this paper, we propose FLASH—an HLS software simu- lation (HSS) flow that addresses these challenges. We describe TABLE I transformations that allow cycle-accurate simulation of FIFO COMPARISON OF THE SOFTWARE SIMULATION OF XILINX VIVADO HLS communication (defined in SectionIV). Also, a method is [2] AND INTEL OPENCL HLS [3]. U NDESIRABLE CHARACTERISTICS ARE presented to simulate task-level and pipelined parallelism with IN BOLD. C semantics. These steps are described in SectionV. Xilinx VHLS C Sim Intel AOCL Sim In order to simulate pipelined parallelism, variables need to FIFO depth Unlimited Exact Exec model Sequential Concurrent be duplicated to match the depth of a loop pipeline (explained Feedback Not supported Supported in Section V-B1). But this results in a redundant data copy, Sim speed ∼5 Mcycle/s ∼1 Mcycle/s which slows down the simulation. We propose optimization Sim order Deterministic Non-deterministic Cycle-acc Not cycle-accurate Not cycle-accurate techniques to reduce this overhead in SectionVI. We obtain the scheduling information from the HLS synthe- sis report and automatically generate a new simulation code model prevents correctly simulating designs with feedback based on the information. The new simulation code is made to loops (Section III-C). AOCL simulates about 5X slower than be compatible with the conventional HLS software simulator VHLS, but it correctly simulates the FIFO depth. The tool for easy integration with the existing tool. The overall flow is assigns a thread to each module for concurrent simulation; described in Section VII. but the execution order of the threads is not deterministic and FLASH also provides correctness and performance debug- may produce different results in different simulation runs for ging support for programmers. In order to ease the process of cases in Section III. detecting deadlocks or stalls, a set of source-level directives is HLS design steps and conventional simulation flows are included. Also, the debugging time is shortened by allowing shown in Fig.2. A software simulator runs fast but provides variables to be added to the capture list in the middle of no cycle estimation and may have the correctness problem. An simulation. This will be explained in Section VIII. RTL simulator is accurate but runs slow, because it simulates The contribution of this paper can be summarized as fol- low-level implementation details. We attempt to devise a new lows: simulation flow that solves both problems. The idea is to • We show that simulating based on the scheduling infor- add the scheduling information of C statements in the HLS mation can help solve the correctness issue of HLS soft- software simulation. The new simulation flow would be faster ware simulators and rapidly provide accurate performance than the RTL simulation without the allocation / binding estimation. information and the component libraries; it would solve the • We develop a framework that allows fast cycle-accurate correctness problem of the software simulation and provide simulation of an HLS design. Several code transformation accurate performance estimation with its cycle accuracy.