IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 1, NO. 1, JANUARY 2020 1

FLASH: Fast, ParalleL, and Accurate Simulator for HLS Young-kyu Choi, Member, IEEE, Yuze Chi, Jie Wang, and Jason Cong, Fellow, IEEE

Abstract—A large semantic gap between a high-level synthesis time. Some work has been done to automate hardware probe (HLS) design and a low-level RTL simulation environment often insertion from the HLS source file [11]–[17], but this work creates a barrier for those who are not FPGA experts. Moreover, requires regeneration of the FPGA bitstream if there is a such a low-level simulation takes a long time to complete. Software HLS simulators can help bridge this gap and accelerate change in the debugging point. The turnaround time is often the simulation process; but their shortcoming is that they do in hours. On-board emulators also have a similar problem and not provide performance estimation. To make matters worse, require a long time for the bitstream generation. we found that the current FPGA HLS commercial software These problems can be partially solved by the software- simulators sometimes produce incorrect results. In order to solve based simulators provided by HLS tools. The HLS software these performance estimation and correctness problems while maintaining the high speed of software simulators, this paper simulators compile the C or OpenCL source code for na- proposes a new HLS simulation flow named FLASH. The main tive execution on the host machine. It takes little time to idea behind the proposed flow is to extract scheduling information reconfigure the debugging points, and no semantic gap exists from the HLS tool and automatically construct an equivalent between the simulation and the design. However, a well- cycle-accurate simulation model while preserving C semantics. known shortcoming of these simulators is that most of them do Experimental results show that FLASH runs three orders of magnitude faster than the RTL simulation. not provide performance estimation. In addition, we found a critical deficiency—they sometimes provide incorrect results. Index Terms—Simulation acceleration, high-level synthesis, An example can be found in the molecular dynamics field-programmable gate array, source-to-source transformation. simulation [18] (Fig.1). Multiple distance processing elements (Dist PEs) filter out faraway molecules above threshold and send them to Force PE. The pruned molecules create a bubble I.INTRODUCTION (empty data) in the FIFO, and Force PE processes only the LTHOUGH the field-programmable gate array (FPGA) valid data (after non-blocking read) in the order they are A has many promising features that include power- received from any of the FIFOs. However, if the modules are efficiency and reconfigurability, the low-level programming instantiated in the order of (Dist PE1, PE2, . . . Force PE) environment makes it difficult for programmers to use the in the source file, Vivado HLS software simulator finishes the platform. In order to solve this problem, many high-level simulation of Dist PE1 first, followed by Dist PE2, and so on. synthesis (HLS) tools such as Xilinx Vivado HLS [1], [2] As a result, by the time the Force PE is simulated, the bubbles and Intel OpenCL HLS [3] have been released. These tools in the FIFOs are completely removed, and the Force PE output allow programmers to design FPGA applications with high- ordering can be entirely different from the RTL simulation. If level languages such as C or OpenCL. This trend is reinforced one is trying to quickly trace the source of a problem that was by recent efforts on FPGA programming with languages of observed in the output of an RTL simulation, the person will higher abstraction—such as Spark or Halide [4]–[6]. not be able to reproduce the problematic state in the software Even though such progress has been made on the de- simulation. sign automation side, a large semantic gap still exists on Another problematic example can be found in the artificial the simulation side. Programmers often need to use low- deadlock situation [19], which occurs when the depth of the level register-transfer level (RTL) simulators (e.g., Model- FIFO is smaller than the latency difference among modules Sim [7], NCSim [8], or VCS [9]) or on-board emulators (details in Section III-B). The issue is that the HLS software (e.g., Zebu [10]) and try to map the result back to HLS. The simulator cannot detect the deadlock situation and proceeds as result is often incomprehensible to those who are not FPGA if there is no problem with the design. We also have found experts. Moreover, low-level RTL simulation takes a very long a problem in the simulation of feedback loops where the feedback data is ignored by the HLS tool (Section III-C). Manuscript received June 21, 2019; revised December 6, 2019; ac- cepted January 7, 2020. Date of publication March 1, 2020; date of cur- The primary reason for the incorrect simulation result is rent version October 1, 2020. This paper was recommended by Asso- that HLS software simulators do not guarantee cycle accuracy. ciate Editor J. Cortadella. The authors are with Computer Science Depart- The comparison between the software simulator of the two ment, University of California, Los Angeles, CA, 90095, USA. E-mail: {ykchoi,chiyuze,jiewang,cong}@cs.ucla.edu most popular ([20]) commercial FPGA HLS tools, Xilinx This research is in part supported by Intel and NSF Joint Research Center Vivado HLS (VHLS) [2] and Intel OpenCL HLS (AOCL) on Computer Assisted Programming for Heterogeneous Architectures (CAPA) [3], is presented in TableI. VHLS assumes unlimited FIFO (CCF-1723773) and Baidu, NEC, and VMWare under the CDSC industrial partnership. depth, which makes it difficult to accurately model FIFO Digital Object Identifier XX.XXXX/TCAD.2020.XXXXXXX fullness/emptiness. Also, the sequential simulation execution IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 1, NO. 1, JANUARY 2020 2

1 6 (II=4) Dist PE1 Dist PE2 Dist PE3 Dist PE4 Allocation 9 5 2 i=1 : (bubble) 2 (bubble) (bubble) Library i=2 : 5 (bubble) (bubble) 8 11 10 I=3 : (bubble) 10 11 (bubble) HLS C code Compilation Binding Generation RTL code 3 8 4 (Round-robin r RTL sim output: 2 5 8 10 11 SW Proposed RTL #pragma HLS dataflow simulator simulator scheduling info simulator SW sim output: 5 211810 Dist_PE1(); 5 Simulated in Dist_PE2(); 2 10 Does not instantiation Dist_PE3(); 11 match Fig. 2. HLS design steps [21] and comparison of simulation flows Dist_PE4(); order 8 → Missing bubbles Force_PE();

Fig. 1. Molecular dynamics simulation [18] In this paper, we propose FLASH—an HLS software simu- lation (HSS) flow that addresses these challenges. We describe TABLE I transformations that allow cycle-accurate simulation of FIFO COMPARISON OF THE SOFTWARE SIMULATION OF XILINX VIVADO HLS communication (defined in SectionIV). Also, a method is [2] AND INTEL OPENCLHLS[3].U NDESIRABLECHARACTERISTICSARE presented to simulate task-level and pipelined parallelism with INBOLD. C semantics. These steps are described in SectionV. Xilinx VHLS C Sim Intel AOCL Sim In order to simulate pipelined parallelism, variables need to FIFO depth Unlimited Exact Exec model Sequential Concurrent be duplicated to match the depth of a loop pipeline (explained Feedback Not supported Supported in Section V-B1). But this results in a redundant data copy, Sim speed ∼5 Mcycle/s ∼1 Mcycle/s which slows down the simulation. We propose optimization Sim order Deterministic Non-deterministic Cycle-acc Not cycle-accurate Not cycle-accurate techniques to reduce this overhead in SectionVI. We obtain the scheduling information from the HLS synthe- sis report and automatically generate a new simulation code model prevents correctly simulating designs with feedback based on the information. The new simulation code is made to loops (Section III-C). AOCL simulates about 5X slower than be compatible with the conventional HLS software simulator VHLS, but it correctly simulates the FIFO depth. The tool for easy integration with the existing tool. The overall flow is assigns a thread to each module for concurrent simulation; described in Section VII. but the execution order of the threads is not deterministic and FLASH also provides correctness and performance debug- may produce different results in different simulation runs for ging support for programmers. In order to ease the process of cases in Section III. detecting deadlocks or stalls, a set of source-level directives is HLS design steps and conventional simulation flows are included. Also, the debugging time is shortened by allowing shown in Fig.2. A software simulator runs fast but provides variables to be added to the capture list in the middle of no cycle estimation and may have the correctness problem. An simulation. This will be explained in Section VIII. RTL simulator is accurate but runs slow, because it simulates The contribution of this paper can be summarized as fol- low-level implementation details. We attempt to devise a new lows: simulation flow that solves both problems. The idea is to • We show that simulating based on the scheduling infor- add the scheduling information of C statements in the HLS mation can help solve the correctness issue of HLS soft- software simulation. The new simulation flow would be faster ware simulators and rapidly provide accurate performance than the RTL simulation without the allocation / binding estimation. information and the component libraries; it would solve the • We develop a framework that allows fast cycle-accurate correctness problem of the software simulation and provide simulation of an HLS design. Several code transformation accurate performance estimation with its cycle accuracy. techniques have been presented to enable this process. Although simulating based on the LLVM intermediate Moreover, optimizations are proposed to accelerate the representation (IR) is a possible option, we have instead simulation speed. decided to simulate in C syntax (augmented with scheduling • We propose unique debugging features for HSS. information). This allows us to raise the simulation abstraction level—accelerating the simulation process and making it easier This paper is an extended version of our preliminary work for programmers to understand what is being simulated. To presented in [22]. Compared to [22], this paper presents new our knowledge, this is the first HLS-based software simulation optimizations to accelerate the simulation speed (SectionVI). flow that takes such an approach. We also propose novel source-level debugging features (Sec- By taking such an approach, however, several challenges tion VIII). Moreover, we add a formal definition of the were encountered (elaborated on in SectionIV). One problem problem and a detailed explanation of the proposed solution is how to guarantee cycle-accuracy of untimed C statements. and the code transformation process (Sections III,IV, andV). Another is correctly simulating the parallelism that is inher- Our current initial version is based on the Vivado HLS tool, ent in hardware (and the corresponding RTL simulation) in but we hope to extend our work to the Intel HLS tool if it sequential C semantics. provides detailed internal scheduling information in the future. IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 1, NO. 1, JANUARY 2020 3

II.RELATED WORK Work in [11]–[17] describe frameworks that allow users to Dist Dist Dist Dist Dist_PE1(); PE1 PE2 PE3 PE4 Simulated in Dist_PE2(); Dist_PE3(); specify debugging points in high-level language and synthesize F1 F2 F3 F4 instantiation order Dist_PE4(); hardware probes into the FPGA for analysis. They can be Force PE Force_PE(); categorized into work that is more focused on verifying func- tional correctness [11]–[15] and work that is more focused on 𝑡_ extracting performance-related parameters [16], [17]. Goeders Dist PE1 F1 write 𝑡_ 5 and Wilton describe how to record and replay the value of Dist PE2 F2 write 2 𝑡_ 𝑡_ variables from an HLS-generated circuit [11]. Their work is Force PE F1 nb rd N/A 5 extended to cover compiler-optimized designs in [12] and to Force PE F2 nb rd 2 𝑡_ allow offline signal restoration in [13]. Monson and Hutchings t introduce event observability ports (EOP) to enable source- level signal trace and explain how to combine multiple signals to reduce trace buffer size [14], [15]. HLScope [16] describes Dist PE1 F1 write 5 ... Dist PE2 F2 write 2 ... an in-FPGA monitoring flow that extracts cycle information Force PE F1 nb rd from FPGA designs written in C. The work of Verma et 5 al. [17] is based on OpenCL and measures stall latency and Force PE F2 nb rd 2 monitors memory access patterns by utilizing trace buffers to store an event’s timestamp. However, these hardware-based Fig. 3. Timing diagram of the molecular dynamics simulation in Fig.1 (FIFO transactions among only Dist PE1, Dist PE2, and Force PE are shown) HLS debuggers typically require hours of initial overhead for bitstream generation. There are other software-based HLS simulators. The LegUp A. Data Ordering Problem HLS [23] simulator provides a speedup prediction based on the profiling result of the source code and the execution cycle The problem of incorrect output ordering in the HSS for from its synthesis result. HLScope+ [24] describes a method molecular dynamics simulation was presented in our intro- to extract cycle information that is hidden by HLS abstraction duction. In this section we discuss the cause of this problem and uses VHLS C simulation to predict the performance for in more detail. Fig.3 shows the timing diagram of the FIFO applications with dynamic behavior. These works, however, do transactions among Dist PE1, Dist PE2, and Force PE in Fig.1. not guarantee cycle-accuracy. Dist PE1 and Dist PE2 communicate with Force PE through FIFO F1 and F2 respectively. Consider a case where data (2) There are several SystemC simulators (e.g., [25]–[27]) that is written to F2 before data read from F1 and F2, and F1 is achieve cycle-accuracy for the source code that has explicit written (data 5) afterwards (illustrated in the RTL simulation scheduling information specified by the programmer. However, part of Fig.3). At the time of the first F1 non-blocking read constructing a cycle-accurate input design file may be too attempt (t ), the first F1 write (t ) has not yet difficult for non-experts. Our flow, on the other hand, achieves F 1 RD1 F 1 WR1 occurred (t < t ), and the successful F2 read cycle-accuracy for an HLS C source code without requiring F 1 RD1 F 1 WR1 precedes the successful F1 read (t < t ). user-defined scheduling information. F 2 RD1 F 1 RD2 There is a class of work that accelerates the simulation of In the VHLS software simulation, however, data (5) is an HLS tool’s output RTL code by converting the RTL code available in the first read attempt to F1 because Dist PE1 into a cycle-accurate C model [28], [29]. Mahapatra et al. [28] is evaluated entirely before Dist PE2 and Force PE. That is, report a speedup of 5X after removing the core computation unlike the RTL simulation, the first F1 write has already oc- t > t and only maintaining the IO timing, but such an approach curred before the first F1 read attempt ( F 1 RD1 F 1 WR1). cannot be used for data-dependent benchmarks. Verilator [29], As a result, the successful F2 read happens after the suc- t > t on the other hand, can be used to provide a functionally correct cessful F1 read ( F 2 RD1 F 1 RD2). The ordering of the and cycle-accurate HLS simulation as our work. Verilator data processed at Force PE is not maintained. If the HSS employs several techniques for acceleration—such as removal has evaluated the FIFO write correctly before each FIFO t < t < t of time delays, randomized unknown value, and creation of read attempts (i.e., F 1 RD1 F 1 WR1 F 1 RD2 and t < t table lookups. But the speedup in Verilator is limited because F 2 WR1 F 2 RD1), this problem would not have occurred. it is very difficult to completely remove allocation and binding In the AOCL simulation, the simulation order of the producer information from the RTL code—whereas in our approach, this modules is undetermined, and a similar data ordering problem information is never added in the first place. A quantitative occurs. comparison is presented in SectionIX. As demonstrated in the example, the HLS software simula- tor should evaluate the FIFO writes before each non-blocking read attempt in the same order as in the RTL simulation. If III.PROBLEM DESCRIPTIONAND MOTIVATING EXAMPLES not, a data ordering problem may occur. The data ordering In this section, we describe four classes of problems (three problem is defined as a case where a consumer module MC is correctness-related and one performance-related) in current reading data in a non-blocking fashion from multiple producer HLS software simulators. The problems are demonstrated with modules MP through FIFOs, and the order of data processed relevant examples in the literature. at MC in the RTL simulation is not maintained in the HSS. IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 1, NO. 1, JANUARY 2020 4

B. Module Latency Problem C. Feedback Problem Consider an example in Fig.4 (named toy_mpath) where As mentioned previously, the VHLS software simulator the module M2 has a latency of 5 and M3 has a latency of 15. evaluates functions in the order they are instantiated in the All FIFOs have a depth of 2. After M2 has produced two output source code. This causes a problem if there is a feedback elements, M4 cannot consume any of them because fifo4 is path that passes data from later instantiated functions to earlier still empty due to the long latency of M3. Because of back ones. At the time earlier functions are simulated, the data is pressure from M2 and fifo3, fifo1 becomes full. Then M1 not be available. As a result, VHLS simulates the program stops producing output to fifo2 because fifo1 and fifo2 as if the FIFOs in the feedback path are always empty. We have to be written in the same cycle. fifo2 eventually will refer to this issue as a feedback problem. The feedback becomes empty, which blocks the pipeline of M3. Even though problem occurs when the content of a FIFO buffer in the HSS M3 has consumed some remaining data in fifo2, fifo4 does not match that in the RTL simulation at the cycle a read is still empty because of M3’s long latency. Then none of operation is performed on a FIFO in a feedback path. This the modules can do any further useful work, and the circuit happens when FIFO writes before each read is not correctly deadlocks. This is called an artificial deadlock. The artificial simulated. An example of the feedback problem in the case deadlock is caused by the mismatching latency of multiple of matrix multiplication can be found in [22]. The AOCL tool datapaths and inadequate FIFO depth to balance the latency can simulate the feedback data from a blocking read correctly difference [19]. We can also observe this in the architecture because a thread simulating each module can wait for others to for stencil computations [30] that contains modules and FIFOs pass the data. However, it is not guaranteed that the feedback with various latencies and depths. data from a non-blocking read will arrive at the right timing. In order to reproduce the deadlock situation, an HLS soft- ware simulator should create the output data after reading input D. Performance Estimation Problem with a delay that reflects the module latency. However, existing Performance estimation problem is defined as providing HLS software simulators evaluate each iteration of a loop as if incorrect estimation of the module execution time. VHLS the data is instantaneously passed from input to output. Thus, synthesis report has a performance estimation problem for the latency among different datapaths is not simulated, and applications with data-dependent loop bounds, conditional the artificial deadlock does not occur. As a result, even after statements, or stalls [32], and almost all benchmarks used for running a HSS, the user is unaware of a potential problem that the experiment have such properties (details in Section IX-C). might occur during actual on-board execution. AOCL synthesis report and VHLS/AOCL software simulators We will refer to this problem as the module latency problem. do not provide any performance estimation at all. Suppose that a module has a sequence of C FIFO read and write statements cstmt1, ... cstmtc, ... cstmtC . If cstmtc is IV. PROBLEM STATEMENT AND CHALLENGES a blocking read/write, multiple read/write attempts may be Before we provide the problem statement, we will define performed before the read/write is completed. If non-blocking, the concept of FIFO communication cycle-accurate (FCCA) read/write is always completed on the first attempt. We assume simulation. The FIFO communication refers to the FIFO- that the HLS tool has scheduled a delay of delayc cycles accessing expressions in the source code (listed in the second between the completion of cstmtc and the first attempt of column of TableII). A FIFO communication statement refers cstmtc+1. This delay reflects the computation latency. The to a statement with a FIFO communication. Let us assume module latency problem is defined as a case where HSS fails that a FIFO communication has been evaluated in HSS at to simulate delayc between the first attempt of cstmtc+1 and cycle t. We declare that the FIFO communication is simulated the completion of cstmtc for some of c = 1 ... C − 1. cycle-accurately if the FIFO input value and the FIFO output Note that we have modified M2 to the code shown in value of the FIFO APIs (FAPIs) in the HSS match the FIFO Fig.5. The purpose is to make a fair comparison of the input ports (din, rd_en, wr_en) and the FIFO output ports simulation time by making the module simulation to finish (dout, empty, full)[33] in RTL simulation at the same at the same point (the HLS RTL simulation of Fig.4 will cycle t. That is, the C variables and the RTL signals have the deadlock, whereas HSS will not). If the input FIFO is empty, same value as described in the third column of TableII at a bubble is inserted into the pipeline (line 4 of Fig.5)—this the same cycle t. The FIFO input value of the FAPIs refers allows the pipeline to keep processing the already-read data to the value of “wdata”, and the FIFO output value of the even if there is no additional input. A similar transformation FAPIs refers to the value of “test” and “rdata” in TableII. 1 is applied on M3. Deadlock situation does not occur, because If all FIFO communication in a C source code is simulated M4 can now receive the output from M3. cycle-accurately in HSS, the simulation will be FCCA. For example, the non-blocking read expression, “test = 1Alternative solutions for deadlock avoidance are presented in [19], but fifo.read_nb(rdata)”, is cycle-accurately simulated if the they require modification to the HLS scheduling, allocation, and binding kernels. Deadlock can also be avoided by increasing the buffer size [31] value of “rdata” matches dout and “test” has the toggled (AOCL has this functionality). However, since the efficiency of the solution value of empty at the same cycle as the RTL simulation. for avoiding deadlock is not the focus of this work, we apply a solution that We assume that we simulate an HLS design that is com- only requires simple source-to-source transformation of the loops without an elaborate analysis of the whole circuit. This method cannot be used to resolve posed of multiple finite-state machine (FSM) modules (in- all deadlocks—such as the one that is caused by circular wait. ferred from C functions). The modules execute concurrently IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 1, NO. 1, JANUARY 2020 5

TABLE II THE FIFO COMMUNICATION IN THE C SOURCECODE, THECORRESPONDING FIFO IP RTL PORTS AND C VARIABLES, ANDTHECORRESPONDING FLASH SIMULATION CODE (VHLSFIFOAPIS [2] AND FIFO IP RTL PORTS [33] AREIN MONOSPACEFONT)

Description FIFO comm in source code RTL ports & C variables FLASH simulation code Blocking read rdata = fifo.read() 1 == !empty, rd_en = 1, rdata = dout (stall cond: fifo rnum==0), test = (..rnum!=0), Non-blocking read test = fifo.read_nb(rdata) test = !rd_en = !empty, rdata = dout rdata = fifo arr[fifo rptr++]; fifo rnum−−; Blocking write fifo.write(wdata) 1 == !full, wr_en = 1, din = wdata (stall cond: fifo wnum==0), test = (..wnum!=0), Non-blocking write test = fifo.write_nb(wdata) test = !wr_en = !full, din = wdata fifo arr[fifo wptr++] = wdata; fifo wnum−−; Empty test = fifo.empty() test = empty test = (fifo rnum == 0) Full test = fifo.full() test = full test = (fifo wnum == 0)

(with directive #pragma HLS dataflow) and use stream- empty FIFO buffers) in a deadlock situation for debugging pur- ing FIFOs for inter-module communication. poses. Moreover, the simulation code should be semantically Our main goal is to construct an HLS software simulator similar to the source code as much as possible (as opposed to that is FCCA. The input and output of the simulator is defined being a low-level code such as RTL), so that users can easily as follows: understand what is being simulated. Input: (1) An HLS C source code (2) Scheduling informa- With such complicated requirements, several challenges tion (3) Input data of the design arise: Output: Output data of the design • Challenge 1: FCCA simulation The scheduling information is defined as the information on It is difficult to discover the exact cycle when statements the FSM state transition and the assigned FSM state of FIFO are executed since the information given by the HLS communication, conditional statements, and loop statements. tool is very limited. For example, The AOCL tool only FCCA simulator does not have the data ordering, module provides loop initiation intervals (II). The Vivado HLS latency, feedback, and performance estimation problems that tool provides slightly more information—it provides a list were described in Section III. of LLVM IR and the corresponding state of an FSM. But Recall that the data ordering problem occurs when a con- mapping such low-level representation (e.g., lines 27–31 sumer module MC is reading data in a non-blocking fashion of Fig.6) back to the original C code is a difficult task. from multiple producer modules MP through FIFOs and the Moreover, the execution cycle may change due to FIFO order of data processed at MC is not maintained in the HSS. being empty or full. Even if the execution cycle is known, If the FIFO communication is cycle-accurate, the relative an FCCA simulator needs to selectively simulate a code ordering of FIFO reads and writes matches that of the RTL region that corresponds to a particular cycle. Furthermore, simulation. That is, the number of FIFO reads and writes the value of variables at a certain cycle must be correctly and the data before each FIFO read match that of the RTL supplied to the simulation of the next cycle. M simulation. Thus, the order of data read from the FIFO at C • Challenge 2: Simulation of parallelism in FCCA simulation matches that of the RTL simulation. The HLS designs have multiple levels of parallelism including proof that an FCCA simulator does not have the data ordering task-level parallelism and pipelined parallelism. Cycle- problem is provided in [32]. accurately simulating parallelism in a C syntax becomes We have explained that the feedback problem happens with a difficult task because the value of variables and the incorrect simulation of writes before each read operation to a simulation order of the statements become different from FIFO in the feedback path. Since the relative ordering of reads that of the source code. For example, if the statement in and writes is maintained in an FCCA simulator, the feedback line 21 of Fig.4 is executed 14 cycles after the statement problem does not take place. in line 20, we would need to simulate line 21 with a Since all FIFO transactions occur at the same cycle as in “temp” value that corresponds to iteration i and line 20 the RTL simulation, the delay between any consecutive pair with that of iteration i + 14 in a single cycle. of FIFO reads and writes of a module matches that of the • Challenge 3: Loop and function simulation RTL simulation. Thus, the FCCA simulator does not have the We need to construct an equivalent model of high-level module latency problem. C semantic such as loops and functions. Let us assume that a simulator can model modules’ FSM states correctly if stalls caused by empty and full FIFO signals have been simulated correctly. We also assume that a module’s execution time is determined by its FSM state (more V. AUTOMATED CODE GENERATION FOR RAPID explanation on these assumptions in Section V-A2). Since CYCLE-ACCURATE SIMULATION FCCA simulator models the empty and full signals cycle- accurately, it estimates the modules’ execution time accurately In this section we provide a solution to each challenge in and does not have the performance estimation problem. SectionIV and describe our proposed automated simulation In addition to the main goal of achieving cycle-accurate code generation flow. For illustration, we use the toy_mpath FIFO communication, the simulator should provide the content example (Fig.4) after modifying the source code to avoid the of the registers (e.g., the state of a module or the number of deadlock as shown in Fig.5. IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 1, NO. 1, JANUARY 2020 6

fifo1 M2 fifo3 01 static bool p1_en_st3, ... p1_en_st6 =false; //enable signals (IL=5) 02 static int temp_st3, ... temp_st6; //6 //pipelined variables M1 M4 03 ... data_out M3 04 else if(M2_state == 2){ //starting state for the pipelined loop fifo2 (IL=15) fifo4 05 if(p1_en_st6 && fifo3_wnum==0){ //if stalled due to FIF03 full 01 void M1(stream& f_out1,.. f_out2){17 void M3(stream& f_in, ... f_out){ 06 return; //exit without any changes (see Sect S.A.4) 02 for (int i= 0; i< N/16; k++) { 18 for(int i= 0; i< N; i++){ 07 } 03 for (int j = 0; j < 16; k++) { 19 #pragma HLS pipeline II=1 08 04 #pragma HLS pipeline II=1 20 int temp = f_in.read(); • • • 05 f_out1.write(16*i+j); 21 f_out.write((int)((float)temp*3)); 09 if( p1_en_st6 == true ){ //enabled 4 cycles after FIFO read 06 f_out2.write(16*i+j+10); 22 } } 10 p1_en_st6 = false; //disables enable signal after use 07 } } } 23 // M4 is omitted 11 fifo3_arr[fifo3_wptr++] = temp_st6; //7 //FIFO data write 08 24 void toy_mpath(int data_out[N]){ 12 fifo3_wnum--; //(see Sect S.A.3) 09 void M2(stream & f_in, ... f_out){ 25 stream fifo1,fifo2,fifo3,fifo4; 10 for(int i= 0; i< N; i++){ 26 #pragma HLS stream var=fifo1,.. depth=2 13 } 11 #pragma HLS pipeline II=1 27 #pragma HLS dataflow 14 • • • 12 int data = f_in.read(); 28 M1(fifo1, fifo2); 15 if( p1_en_st3 == true ){ //enabled 1 cycle after FIFO read 13 int temp = data*711; 29 M2(fifo1, fifo3); 16 p1_en_st3 = false; //disables enable signal after use 14 f_out.write(temp); 30 M3(fifo2, fifo4); p1_en_st4 = true; //enable signal propagation 15 } } 31 M4(fifo3, fifo4, data_out); 17 16 32 } 18 temp_st4 = temp_st3; //copies variable for next pipe stage 19 } 20 if( i_st2 < N ){ //2 //loop exit condition (see Sect s.c) Fig. 4. Structure and code for motivating example toy_mpath 21 if( fifo1_rnum != 0 ){ //4 //if FIFO not empty 22 data_st2 = fifo1_arr[fifo1_rptr++]; //5 //FIFO data read 23 fifo1_rnum--; //(see Sect 5.A.3) 01 i= 0; 24 temp_st2 = data_st2 * 711; //6 II comp stmt mapped to st2 for(i = 0; i< N; i++){ 02 while (i < N){ 25 i_st2++; //8 // loop iterator update (see Sect 5.C) #pragma HLS pipeline II=1 03 #pragma HLS pipeline II=1 26 p1_en_st3 =true; //enables if path for later pipe stages int data = f_in.read(); 04 if(f_in.empty()==false){ 27 temp_st3 = temp_st2; //copies variable for next pipe stage int temp = data*711; 05 int data = f_in.read(); 28 ••• } } } f_out.write(temp); 06 int temp = data*711; } 07 f_out.write(temp); 08 i++; Fig. 8. Simulation code that models pipelined loop parallelism for M2 of 09 } } Fig.5 (provides details for line 9 of Fig.7)

Fig. 5. Modified code of M2 in Fig.4 to avoid artificial deadlock 01 void (*MList[M])(); //module func ptr list 02 void (*FList[F])(); //FIFO func ptr list 01 ======03 Mlist[0] = M1_SIM; .... Mlist[3] = M4_SIM; //init 02 + Verbose Summary: Schedule 04 Flist[0] = F1_SIM; .... Flist[3] = F4_SIM; 03 ======05 04 * Number of FSM states : 7 06 while(1){ //scheduler loop 05 * Pipeline : 1 07 ... // loop until until deadlock or all modules finish 06 Pipeline-a : II = 1J D - 5J States - { 2 3 4 5 6 } 08 for(x = 0; x < M; x++) //simulate all modules 07 * Dataflow Pipeline: 0 09 Mlist[x](); 08 10 for(p = 0; p < F; p++) //simulate all FIFOs 09 * FSM state transitions: 11 Flist[p](); 10 1 --> 12 ... 11 2 I true 13 cycle++; 12 2 --> 14 } 13 7 I (!tmp) 14 3 I (tmp) 15 3 --> Fig. 9. Module/FIFO simulation scheduler to model task-level parallelism 16 4 I true 17 4 --> 18 5 I true 19 5 --> 20 6 I true A. FIFO Communication Cycle-Accurate Simulation 21 6 --> 22 2 I true We will describe the properties of FLASH and the corre- 23 7 --> 24 sponding code transformation. Based on these properties, we 25 * FSM state operations: will explain how FLASH achieves FIFO communication cycle- 26 27 State 6 accurate (FCCA) simulation. 28 ... Operation 27 ... --->"call void( ... )* @_ssdm_op_Spec 29 ... Operation 28 ... ---> "call void @_ssdm_op_Write 1) Matching Simulated State of Statements: Let us assume .ap_fifo.volatile.i32P(i32* %f_out_VJ i32 %temp)" that HLS tool schedules a FIFO communication of a C source 30 ... Operation 29 ... ---> "br label %._crit_edge .. . 31 ... Operation 30 ... ---> "%empty_40 =call i32 .. . code to be executed at a particular FSM state (st) of a module. FLASH simulates the FIFO communication at the same st Fig. 6. VHLS scheduling report for M2 of Fig.5 scheduled by the HLS tool. In order to achieve this, we first need to obtain the HLS scheduling information of the 01 void M2_SIM(){ //simulation function for M2 FIFO communication. This is found from parsing FIFO-related 02 static int M2_state = 1;//use “static” var for the next cycle keywords in the scheduling report. For example, the state when 03 ... 04 if(M2_state == 1){ //state conditional block for state 1 FIFO “f out” performs the write operation (line 7 of Fig.5) 05 ... //computation stmt & communication for state 1 op_Write.ap_fifo 06 M2_state = 2; //state transition for state 1 is found to be 6, because and “f out” 07 } keywords are detected in line 29 of Fig.6. Similarly, the FIFO 08 else if(M2_state == 2){ //state conditional block for state 2 09 ... //computation stmt & communication for state 2 read statement (line 5 of Fig.5) is assigned to state 2 from 10 M2_state = 7; //state transition for state 2 the scheduling report (not shown in the figure). 11 } 12 } //exit sim function after simulating one cycle Next, we need to ensure that only the FIFO communica- tion statements assigned to each FSM state are selectively Fig. 7. Simulation function structure for selective simulation of an FSM state simulated at every cycle. We declare an FSM state variable (M2 SIM is simulated at line 9 of Fig.9) (line 2 of Fig.7) for each module and copy statements to IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 1, NO. 1, JANUARY 2020 7 the conditional block that correspond to its simulated state fifo rnum and fifo wnum variables to denote the number (state conditional block). An example can be found for the M2 of data and buffer spaces available in the FIFO. FAPIs in module in lines 4–7 (st = 1) and lines 8–11 (st = 2) of Fig.7. the source code are transformed based on the fourth column After the simulation function of a module has been called, only of TableII. An example is shown in Fig.8, which is the the statements for a single FSM state are simulated, and then transformed simulation code from M2 in Fig.5. Line 7 of the function exits. That is, a single clock event is simulated Fig.5 is transformed to: “fifo3 arr[fifo3 wptr++] = temp st6; by a function entrance and exit. fifo3 wnum−−;” (lines 11-12 of Fig.8). The code difference Since FLASH aims for cycle-accuracy of FIFO communica- between the blocking and the non-blocking FIFO communi- tion, the computation statements do not need to be evaluated cation is that the code for blocking communication is not cycle-accurately. For computation statements, we can assign simulated if the stall condition is satisfied. This is further an arbitrary state as long as it does not violate the timing explained in Section V-A4. causality with the cycle-known FIFO communication that has In addition to decreasing the number of buffer spaces dependency with the computation statement. We group the (fifo3 wnum−−) for FIFO write, we would need to increase computation statements to a few FSM states as much as the number of available data (fifo3 rnum++). But this process possible; if the statements are spread among multiple FSM is delayed until all other statements in the current cycle have states, the variables shared across the states may need to be been simulated. The reason is to match the Xilinx FIFO IP stored in cache or DRAM and loaded back after function exit behavior [33] of allowing data in FIFO to be available for and entrance. This is inefficient if the variable has a short life read, one cycle after it has been written. The details of this and could have been optimized to a CPU register. delayed processing is further provided in Section V-B2. For example, the computation statement in line 6 of Fig.5 4) FSM Stall Modeling: FLASH cycle accurately models has a dependency with both the FIFO read and the FIFO write the FSM stalls due to FIFO being full or empty. If a stall statements. It may be assigned to any state between 2 and 6 condition is met, none of the statements of current FSM without violating the time causality, but to reduce the number state should be simulated, and the simulation function should of FSM states with statements, it should be assigned to either exit. To achieve this, the stall condition is placed at the 2 or 6. We choose to assign it to state 2, following the as- beginning of a state conditional block. The simulation code soon-as-possible scheduling policy, as it tends to reduce the for the stall condition is “fifo rnum==0” for FIFO empty and number of variables being passed between the states. “fifo wnum==0” for FIFO full (TableII). These codes are 2) Cycle-Accurate FSM State: FLASH cycle-accurately also used for the stall conditions of the blocking reads and simulates the FSM state of a module at t (stt). By induction, writes. For example, the stall condition that corresponds to the stt is cycle-accurate if the initial state at t = 1 is known FIFO blocking write in line 7 of Fig.5 is “if(p1 en st6 && (stt=1 = 1) and the state transition ∆t matches the RTL fifo3 wnum==0)”. This condition has been added to line 5 of simulation at 1, 2, ..., t-1. ∆t matches the RTL simulation Fig.8. Also, the function return statement has been added to if the state transition information can be obtained from the line 6 of Fig.8. Note that we add an enable signal “p1 en st6” HLS tool report and a state transition statement that reflects to the stall condition of a pipelined loop because the FIFO this information is evaluated at t. Also, ∆t should be stalled if write occurs at FSM state 6 (more details in Section V-B1). empty or full signals have been asserted when the blocking FLASH can detect a deadlock by checking if a state reads or writes have been evaluated. transition did not occur (stalled) in all modules. It is enabled VHLS provides the state transition information in its by the source-level trigger directive that is explained in Sec- scheduling report. For example, the loop in module M2 in tion VIII-B. Fig.4 is evaluated in states 2 to 6, as shown in line 6 of It is worth noting that applying the classic event-driven Fig.6. The state transition of the loop is composed of intra- simulation approach (e.g., [35]) makes little difference in loop state transition (e.g., state 2 to 3, as shown in line 14 of the simulation speed of FLASH. The reason is that the stall Fig.6) and loop exit (e.g., state 2 to 7, as shown in line 13). condition is placed at the beginning of a state conditional block FLASH obtains this information and inserts the state transition and prevents most of the statements from being evaluated statement into the simulation code. For example, line 10 of when a module is stalled. That is, there is little overhead in Fig.7 reflects the loop exit state transition from state 2 to 7. processing a module without an event that requires simulation, The method used by FLASH to correctly simulate the state and this diminishes the benefit of applying the event-driven transition stalls will be discussed in Section V-A4. approach. VHLS schedules a module to start and finish its execution 5) Correctness of the Variable Reference: As explained in at a particular FSM state. Since FLASH cycle-accurately Section V-A1, all statements that have dependency with cstmt simulates the FSM state of a module, the estimation of a have been evaluated before the simulation of cstmt. The value module’s execution time is cycle-accurate. of variables written by statements with the same st as cstmt is 3) FIFO Behavior Modeling: In the FLASH simulation correctly supplied to cstmt because they are simulated in the code, the FIFO is implemented as a circular buffer with same state conditional block. A problem occurs when reading read/write pointers (fifo rptr and fifo wptr) and an array variables written by statements with FSM states other than st (fifo arr). The array length is set to FIFO buffer size because the simulation function exits after each cycle. This (FIFO SIZE) plus one because one buffer space is kept problem is solved using the static keyword in variable empty in circular buffer implementation [34]. Also, we declare declaration (e.g., line 2 of Fig.7 and line 2 of Fig.8). By IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 1, NO. 1, JANUARY 2020 8 using this technique, the contents of the variables are restored It is important to note that the order of each pipeline stage and saved regardless of the simulation function entrance or conditional block has been reversed (st6, ... st3, st2). This exit. limits the value of enable signals to be copied only to the 6) Proof of FCCA Simulation: We can prove that FLASH is immediate next pipeline stage in simulation of a single cycle. an FCCA simulator by demonstrating that the FIFO input and Even if a same variable is used in different statements of output values of the FAPIs in the FLASH simulation match the the original source code, we cannot assume that they have the values of FIFO input and output ports in the RTL simulation same value if they have been assigned to different pipeline at all clock cycles. Due to the space limitation, only a brief stage conditional blocks. For example, suppose that line 6 of explanation of the proof will be provided—the details can be Fig.5 is performed at FSM state 2, and line 7 is performed at found in [32]. state 6. In a single cycle of the pipelined loop simulation, In FLASH, cstmt is simulated at the same cycle t as the “temp” of line 7 corresponds to loop iteration i, whereas RTL simulation because the simulated st of cstmt matches “temp” of line 6 corresponds to loop iteration i+4. Thus, they the HLS tool (Section V-A1) and st is simulated cycle- would have different values. accurately (Section V-A2 and Section V-A4). cstmt produces For correct simulation, we keep multiple copies of the same the correct value for the FIFO input value of the FAPIs if variable for each pipelined stage of a loop. The variables are FLASH supplies the variables referenced by cstmt (achieved copied through the pipeline like shift registers. For example, by Section V-A5) and the FIFO output value of the FAPIs that the “temp” variable is copied from loop pipeline stage 3 to match the RTL simulation. Correct FIFO output values of the stage 4 at line 18 of Fig.8. Variables “data” and “i” are FAPIs can be supplied at cycle t with accurate FIFO behavior not copied to the next pipelined stage after performing cycle- modeling (Section V-A3) and the FIFO input value of FAPIs based variable liveness analysis (explained in Section VI-A). that matches the RTL simulation at cycle 1, 2, ... t − 1. Then Similar to the enable signals, the content of pipelined variables we can prove that FLASH is an FCCA simulator by induction. is only copied to the immediate next state in a single cycle since the order of the pipeline stage conditional block has been B. Simulation of Parallelism reversed. Optimization of the pipelined variables is discussed 1) Pipelined Parallelism: At each cycle, all statements in a in SectionVI. pipelined loop should be simulated in a pipelined parallelism Because of the duplicated pipelined variables, the read- fashion. The number of FSM states to be simulated corre- ability of the simulation code can be reduced. In order to sponds to the loop iteration latency (IL, also called pipeline diminish this side effect, FLASH places the line number depth). If we simulate only a particular FSM state conditional of the original variable declaration in the source code as a block of a pipelined loop, it would not be possible simulate comment of the duplicated pipelined variable declaration in the this parallelism. simulation code. For example, the line number 6 of the variable To solve this problem, we would need to simulate all FSM declaration of “temp” in Fig.5 is written as a comment states of a pipelined loop. It is possible to make an exception of the duplicated pipelined variable declaration in line 2 of to the simulation structure by traversing through multiple state Fig.8. The original line numbers of the computation and conditional blocks in a single cycle for pipelined loops; but this communication statements are also placed at the comments would over-complicate the simulation structure. For a simpler of the simulation statements (e.g., lines 20-25 of Fig.8). solution, we choose to move all of the pipelined loop’s state 2) Task-Level Parallelism: As discussed in Section V-A1, conditional blocks into the conditional block of a single state. the statements in an FSM state are simulated by calling The reallocated conditional blocks are referred to as pipeline the simulation function of a module. Thus, the task-level stage conditional blocks. As shown in Fig.8, the contents of parallelism can be simulated by calling all simulation functions FSM states 2, 3, and 6 have been moved to pipeline stage in a round-robin fashion. This is processed in the module conditional blocks in lines 20-28, lines 15-19, and lines 9-13. simulation loop shown in lines 8-9 of Fig.9. If a pipelined loop L’s II (IIL) is larger than 1, FLASH As mentioned in Section V-A3, the update of the buffer makes IIL state conditional blocks for this loop. In this case, spaces and the number of available data is delayed until all the pipeline stage conditional blocks for state st are placed at modules in the current cycle have been simulated. The update state (stL + ((st − stL)%IIL)) conditional block, where stL (corresponding code is presented in [32]) is performed in the is L’s first FSM state. FIFO simulation loop (lines 10-11 of Fig.9). We introduce enable signal to decide if the statements inside The module simulation loop and the FIFO simulation loop each pipeline stage conditional block will be evaluated. For form the scheduler loop as shown in lines 6-14 of Fig.9. example, the FIFO write at lines 11-12 of Fig.8 is evaluated if the enable signal “p1 en st6” at line 9 is one. Using an enable signal allow us to selectively simulate statements in a C. Loop and Function Simulation pipelined loop’s prologue/epilogue and invalidate statements The loop initialization statement of loop L is simulated upon in a pipeline bubble (from the artificial deadlock avoidance initial entrance to L’s first FSM state (stL). If L is a pipelined transformation in Section III-B). The enable signal is also loop, the enable signal is set to 0 at stL (Section V-B1) when used to selectively simulate statements of a conditional block. L’s loop iterator of a new iteration does not satisfy the loop The value of enable signals is propagated through the pipeline condition. Since the loop condition for an iteration should be stages as shown in line 17. checked after the loop update statement has been evaluated, IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 1, NO. 1, JANUARY 2020 9

the loop update is evaluated just before transitioning into stL. 01 static bool p1_en[5]; //enable signal array 02 static int temp[5]; //6 //pipelined variable array Recall that L has a number of state conditional blocks that 03 static int ptr_st2 = 4, ptr_st6 - 0; //pipe variable pointers matches II (IIL) of the loop (Section V-B1)—thus, the loop 04 ... 05 else if(M2_state == 2){ update (e.g., line 8 of Fig.5) is evaluated at the end of ( stL + 06 ... IIL − 1). 07 if( p1_en[ptr_st6] ==true ){ 08 p1_en[ptr_st6] =false; //disables enable signal after use Since the loop update is evaluated before the final state 09 fifo3_arr[fifo3_wptr++] - temp[ptr_st6]; //7 //read from of a loop, a dependency problem may occur. For example, 10 //pipelined variable array 11 fifo3_wnum--; suppose that we add a statement between line 7 and line 8 12 } of Fig.5 that is dependent on line 7 and references i. The 13 //conditional blocks for pipeline stages 3, 4, 5 are removed 14 loop index update statement in line 8 is scheduled to state 2 15 if( i_st2 < N ){ //2 ( stL = 2,IIL = 1). Assuming line 7 is scheduled to state 6 16 if( fifo1_rnum != 0 ){ //4 ∵ 17 p1_en[ptr_st2] =true; //enables later pipeline stages from the scheduling report, the new statement between lines 7 18 data_st2 = fifo1_arr[fifo1_rptr++]; //5 and 8 incorrectly references i that has already been updated to 19 fifo1_rnum--; 20 temp[ptr_st2] = data_st2*711; //6 //written to the next iteration. This is solved by copying i to a temporary 21 //pipelined variable array variable before evaluating the loop index update statement and 22 i_st2++; //8 23 } } renaming any reference of i that has the dependency problem 24 ptr_st2 - (ptr_st2 + 1) % 5; II pipelined variable to this temporary variable. 25 ptr_st6 - (ptr_st6 + 1) % 5; II pointers update 26 } } The state transition for pipelined loop exit occurs when the loop condition is not satisfied and all enable signals in the Fig. 10. The code after applying pointer-based variable access optimization pipeline have been invalidated. Simulation of statements inside to the initial code provided in Fig.8 a pipelined loop has been discussed in Section V-B1. The code transformation method of a flattened loop can be found in [22]. A function call is simulated by sending a module enable “temp” from cycles 2 to 6. As a result, only variable “temp” signal to the scheduler loop (Fig.9). Next, the function is copied through the pipeline stages. argument values are copied into the newly called module. B. Pointer-Based Variable Access VI.OPTIMIZATION OF PIPELINED LOOPS SIMULATION One of the problems of declaring a pipelined variable for Pipelined loops typically account for most of the execution each pipeline stage (as in Fig.8) is that the same value is I time of many FPGA designs. To simulate pipelined paral- copied repeatedly. Assuming a pipelined loop has iterations, V IL lelism, we need copies of variables that correspond to the variables, and iteration latency, the complexity of copy- O(I V IL) loop’s IL (Section V-B1). However, a naive implementation ing pipelined variables is × × . could lead to making redundant copies of the variables. This We propose an alternative method of copying the value of section discusses how to optimize this routine. The effect of a pipelined variable only once and changing the pointer to the optimization will be presented in Section IX-B. the pipelined variable. The modification to the initial code (Fig.8) is shown in Fig. 10. We first exploit the fact that the value of the pipelined variable is used in the immediate next A. Cycle-Based Variable Liveness Analysis pipeline stage—thus, the pipeline variable pointer for stage st The pipelined variables are only needed in the pipeline (ptrst) update can be simplified into (ptrst+1)%IL (lines 24– stages where the variables are being accessed. To ensure 25). Next, the pipeline variable pointer is shared among all this, we first perform variable liveness analysis [36] to find variables and enable signals in the same pipeline stage since the range of statements where each pipelined variable is all variables and enable signals are copied together to the next alive. Next, the FSM state of communication statements and pipeline stage if the loop pipeline has not been stalled. An the computation statements are obtained from the scheduling example is shown for the “temp” variable (line 20) and the report and the dependency analysis. From the FSM state “p1 en” enable signal (line 17). Note that this optimization information of statements, the statement liveness range of each has not been applied to variables “i” and “data”, because “i” variable is translated into a cycle liveness range. Based on and “data” are only used in pipeline stage 2 (Section VI-A). this cycle information, we place a limit on the pipeline stages Finally, we remove the pipeline stage conditional blocks that where each pipelined variable is copied. do not evaluate any statement (line 13—pipeline stages 3, 4, For the example in M2 of toy_mpath (Fig.5), we perform and 5 are removed), because variables and enable signals no liveness analysis on each variable and find that variable “data” longer need to be copied. is live in lines 5–6, variable “i” in line 8, and variable “temp” The variable access pattern has a similarity with the register in lines 6–7. Then we assign the states for communication rotation technique in IA-64 architecture [37]. Whereas IA- and computation statements in M2 of toy_mpath as was 64 used this technique to simplify the register allocation in shown in Section V-A1. That is, statements in lines 5, 6, software pipelining, we use the access pattern to reduce the and 8 of Fig.5 are assigned state 2, and the statement in number of variable copies among the simulated statements. line 7 is assigned state 6. Based on this information, the Since the data is copied only once, the complexity of the statement liveness range is converted into a cycle liveness pipelined variable copy is O(I×V). The pipelined variable range—variables “data” and “i” are live at cycle 2 and variable pointers are shared among all variables in the same pipeline IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 1, NO. 1, JANUARY 2020 10

TABLE III DEBUGDIRECTIVESFOR FLASH

Syntax Description Target DEADLOCK Triggered at deadlock Module with dataflow pragma STALL Triggered when module or loop has been stalled Any module or loop MODULE DONE Triggered when a module completes its execution Any module Trigger-related FIFO FULL FIFO= Triggered at FIFO full condition Any FIFO FIFO EMPTY FIFO= Triggered at FIFO empty condition Any FIFO EQUAL VAR= VAL= Triggered when variable equals to value provided Any stmt with a variable reference GREATER VAR= VAL=<..> Triggered when variable is greater than value provided Any stmt with a variable reference DUMP VAR= FILE= Dumps variable data into the specified file Any stmt with a variable reference Data-related COMP VAR= FILE= Triggered when variable differs from golden data in file Any stmt with a variable reference TRIP COUNT Measures the loop trip count (e.g. for data-dependent loop) Any loop EXEC CYCLE Measures the number of execution cycles for module or loop Any module or loop STALL CYCLE Measures the number of stalled cycles for module or loop Any module or loop Perf-related FULL CYCLE Measures the number of cycles when FIFO was full Any FIFO EMPTY CYCLE Measures the number of cycles when FIFO was empty Any FIFO

Vivado HLS Prepro‐ Automated FCCA sim using VIII.SOURCE-LEVEL CORRECTNESS DEBUGGINGAND C design cessing sim file Transformed Vivado HLS SW sim PERFORMANCE DEBUGGING source code (w/ ROSE) generation code for simulation Scheduling info (w/ ROSE) Output FLASH provides an option of enabling various source-level User’s Perf + Input data debugging Vivado HLS debugging correctness and performance debugging features that will be directives synthesis result Output explained in this section.

Fig. 11. Overall simulation framework of FLASH A. Live Capture FPGA tools such as Xilinx’s ChipScope [40] or Intel’s SignalTap [41] capture the data in the FPGA and display stage—thus, the complexity of the pointer update appears to it to users for debugging. One of the problems with these be O(I×IL). However, as mentioned in Section V-A1, we configurable logic analyzers is that additional signals often group statements to only a few FSM states. The pointer update need to be inserted into the capture list to continue tracing is not performed on pipeline stages that do not evaluate any the source of a bug after the initial analysis. This requires statement (e.g., lines 24-25). Assuming there are C stages iterative adjustment of the signal capture list until the bug has with communication statements (CIL), the pointer update been isolated—but the bitstream generation for each analysis complexity is O(I×C). Thus, the overall pipeline variable often takes hours to finish. Many of the hardware-based complexity of the proposed method is O(I×(V+C)), which HLS debuggers described in SectionII also require a long is a large improvement over O(I×V×IL). turnaround time due to similar reasons. Software debuggers, on the other hand, do not require signals to be listed in advance. But due to the lack of cycle accuracy, putting a trigger (breakpoint) on a C source code VII.OVERALL FLOW does not allow the users to observe the signals at a particular cycle of interest. Another problem is that users have limited The overall simulation framework of FLASH is shown visibility. For example, local variables in a function different in Fig. 11. Given an input Vivado HLS (VHLS) C design from the trigger cannot be observed unless the user progresses source code, users specify optional debugging directives such to that function—by which time the content of many variables as module execution cycle measurement or deadlock triggering would have been changed. (to be explained in Section VIII-B). Then FLASH performs To solve these problems, we exploit the fact that FLASH a preprocessing step of adding labels to the source code stores the value of all variables (Section V-A5) and the fact that so that loops and functions can be easily identified. The FLASH runs on top of an established commercial tool, VHLS, transformation step uses the APIs in the ROSE [38] and the that provides software simulation debugging features. Upon Merlin [39] compilers. The transformed code is fed into the detection of a trigger condition (details in Section VIII-B), VHLS for synthesis. Based on the scheduling report given FLASH sets a debug stall flag. All modules are stalled upon by the HLS tool, the input code is automatically transformed the detection of this flag (implementation is similar to the for rapid FCCA simulation (SectionV and SectionVI). The pipelined loop stall modeling in Section V-A4—see step 2 simulation code has been made compatible with the VHLS line 1 of Fig. 12). When the simulation has been stalled software simulator for easy integration with the existing tool. for debugging, users can step into any function and observe It also allows us to utilize the VHLS’s debugging functionality any local variables of interest by adding a new variable into for FLASH’s debugging features (details in Section VIII-A). the VHLS debugger’s expression window (this is similar to As a final output, FLASH provides the total execution cycles the watch window of Microsoft Visual Studio—see step 4 of and other user-specified debugging results in addition to the Fig. 12). The variables to be captured no longer need to be output data that the design is expected to produce. predetermined. IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 1, NO. 1, JANUARY 2020 11

Step 4: User can step into any function and observe the value of any variable with the Vivado HLS debugger void toy_mpath( .. . ){ (but no progress made on simulation because all modules are stalled) Step 1: User places #pragma HLS dataflow ( )- Variables 00 Breakpoint ',', Registers IGtf Expressio~ ~ -' Modules source-level directive -t1#pragma FLASH-DEADLOCKI ., 1: tt- r x~ rrj · for trigger condition ------Ml ( ) ; M2 ( ) ; M3 ( ) ; ~-~------~---Expresston Type Step 5: User may resume simulation } --- (x)= test_HLSSIM_SFOOO_I OOO_fifol_RNUM I int I 0 by resetting debug_stall_flag to 0 in (x): test_HLSSIM_SFOOO_IOOO_fifol_RPTR I int j l I the HLS debugger ' (x)= HLSSIM_P002_EN004 Ichar 1 1 '\001' 1 • 1 1 ' (x)= HLSSIM_V003_temp_OOS int 711 I I 1 (..c)= Variables 00 Breakpoint 1 , Registers Grf Expression '£:3 Modules (x)= HLSSIM_V002_data_002 I int j 3 I ·t -ot: - r ~ f [ (x): state_HLSSIM ~ int j 2 I ' L Expression ------~---~--~Add new e xpreSSJ·on ~ ~ (x)= cycle

(x)= debug_stall_mod_cnt

Add new expression

Fig. 12. An example debugging session for deadlock detection using FLASH

The users can expect variables in the FIFO communication Table III. The functionality includes module execution and and the FSM state variables to match the RTL simulation at all stall cycle measurement, as well as FIFO full and empty cycle cycles (Section V-A). However, the timing when the variables measurement. in the computation statements match the RTL simulation may not be accurate. C. Large Data Debugging In order to pause at the trigger point, FLASH guides the Hardware-based HLS debuggers, such as [11]–[15], [17], users to place a breakpoint on the simulation code where the optimize the storage and transfer of variable data in an FPGA debug stall flag is set (step 3 of Fig. 12). The breakpoint is to be analyzed for correctness. However, the amount of traced detected by the VHLS debugger. To resume the simulation data is limited by the BRAM size and the DRAM bandwidth. after observation, the user can modify the value of the debug Being a software-based debugger, FLASH is not limited by stall flag to 0 using the expression window (step 5). the FPGA hardware resource restriction when performing such data-driven debugging—even for multiple variables. Ex- B. Source-Level Event Trigger and Performance Measurement amples include large data dump and large golden reference comparison—the user directives for these functions are shown In the Xilinx Chipscope [40], users are required to specify in the data-related row of Table III. signal names and their value for the tool to start capturing the data (trigger condition). Since the HLS tool applies several IX.EXPERIMENTAL RESULTS transformations in generating RTL file from a C source code, A. Experimental Setup manually identifying the correct trigger condition from an RTL file may be error-prone for novice users. To ease this process, For HLS synthesis, we use the Vivado HLS 2018.2 [2]. For many hardware-based HLS debuggers [11]–[15], [17] allow FPGA, we Xilinx’s Ultrascale KU060 [42]. The target users to specify variables to be traced or put breakpoints on clock frequency is 250MHz. The simulation is conducted with the source code; however, none of these abstract the trigger a server node that has an Intel Xeon Processor E5-2680v4 [43] condition of events such as deadlock or module/FIFO stall. and 64GB of DRAM. The simulation files have been optimized FLASH provides a set of source-level directives which can by the Vivado HLS software simulator. be specified by users to halt computation upon an event of The experiment is performed on toy_mpath (Fig.4) and interest. The list is given in the trigger-related row of Table III. several dataflow benchmarks: stencil [30], molecular dynamics The directive is always preceded by: #pragma FLASH simulation [18] (Fig.1), matrix multiplication [44], Cholesky . For example, the deadlock detection directive is decomposition [45], Needle-Wunsch [46], LU decomposition #pragma FLASH DEADLOCK (step 1 of Fig. 12). FLASH [47], and sparse matrix-vector multiplication [48]. The bench- automatically converts a directive into a stall condition that marks ([46], [47]) that were not originally designed to execute increments a debug variable that counts the number of stalled modules in parallel with FIFO communication have been modules (step 2). For the case of M2 in Fig.4, it is stalled if modified to incorporate this dataflow optimization. the FIFO is full (line 4 of step 2). If the directive for deadlock The FLASH simulation result is compared to that of Vivado detection is found in the source code, FLASH inserts a code HLS C and RTL simulation (), and Verilator 4.012 that increments the debug variable “debug stall mod cnt” simulation [29]. Since Vivado HLS RTL files contain core upon stall (line 5 of step 2). After simulating each cycle, library calls that cannot be processed by Verilator, we have FLASH checks to see if “debug stall mod cnt” matches the manually replaced them with a behavioral Verilog model. number of all modules in the design. If so, FLASH sets the debug stall flag that pauses the simulation (Section VIII-A). B. Simulation Time FLASH also supports directive-based performance measure- As mentioned in Section VII, preprocessing, HLS synthesis, ment. The list is given in the performance-related row of and simulation file generation steps are needed to prepare the IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 1, NO. 1, JANUARY 2020 12

TABLE IV TABLE VI SIMULATION PREPARATION TIME BREAKDOWN SIMULATION TIME COMPARISON AMONG VHLSC SIMULATION,VHLS RTL SIMULATION,VERILATOR, AND FLASH SIMULATION Benchmark Preproc HLS Synth SimFile Gen Total Toy_mpath 7.6s 25s 7.9s 40s VHLS VHLS Benchmark Verilator FLASH Stencil 19s 68s 30s 117s C Sim RTL Sim MD_sim 9.2s 38s 8.8s 56s Toy_mpath 0.765s 519s 120s 0.441s Mat_mul 8.7s 36s 12s 57s (1.00X) (678X) (157X) (0.576X) Cholesky 15s 98s 37s 150s Stencil 1.92s 101s 119s 1.12s NW 18s 99s 26s 143s (1.00X) (52.6X) (62.0X) (0.583X) LUD 7.1s 21s 8.9s 37s MD_sim 0.0652s 89s 7.3s 0.0565s SpMV 13s 78s 25s 116s (1.00X) (1,370X) (112X) (0.867X) Mat_mul 0.0680s 180s 29.2s 0.0716s (1.00X) (2,650X) (429X) (1.05X) TABLE V Cholesky 0.0124s 90s 27.7s 0.0530s SPEEDUP AFTER APPLYING OPTIMIZATIONS IN SECTION VI( CUMULATIVE (1.00X) (7,260X) (2,230X) (4.27X) SPEEDUPSHOWN) NW 0.136s 68s 27.1s 0.224s (1.00X) (500X) (199X) (1.65X) Avg pipe Var liveness Ptr var acc Benchmark Baseline LUD 0.0319s 129s 16.4s 0.0242s var depth (Section VI-A) (Section VI-B) (1.00X) (4,040X) (514X) (0.759X) 0.522s 0.483s 0.441s Toy_mpath 5.4 SpMV 0.0200s 117s 55.5s 0.0803s (1.00X) (1.08X) (1.18X) (1.00X) (5,850X) (2,780X) (4.0X) 2.43s 1.16s 1.12s Stencil 13 AVG (1.00X) (2,800X) (810X) (1.72X) (1.00X) (2.09X) (2.17X) 0.184s 0.0728s 0.0565s MD_sim 34 (1.00X) (2.53X) (3.26X) 0.0802s 0.0722s 0.0716s Mat_mul 7.8 (1.00X) (1.11X) (1.12X) the copy of pipeline variables and enable signals (this overhead 0.0617s 0.0600s 0.0530s Cholesky 8.7 was reduced by the optimizations in SectionVI as was shown (1.00X) (1.03X) (1.16X) 0.236s 0.224s 0.224s in TableV). However, it is interesting to note in that for some NW 3.2 (1.00X) (1.05X) (1.05X) benchmarks such as Toy_mpath and Stencil, FLASH is 0.0345s 0.0266s 0.0242s LUD 19 even faster than the VHLS C simulation (TableVI). This (1.00X) (1.30X) (1.43X) 0.0853s 0.0808s 0.0803s suggests that there is an unexpected factor which has negated SpMV 10 (1.00X) (1.06X) (1.06X) the simulation speed overhead of the proposed flow. We found AVG - (1.00X) (1.41X) (1.55X) that this is largely attributed to the fact that the VHLS C simulator can allocate an unlimited FIFO buffer (TableI). To model FIFO, the VHLS C simulator uses the C++ Standard files for the proposed simulation. The time breakdown of the Template Library (queue.h), which incurs the overhead of steps is presented in TableIV. dynamically allocating buffer and copying its content. For The effect of optimizations in SectionVI is shown in example, the C simulation time of Toy_mpath reduces from TableV. The baseline version uses the techniques introduced 0.765s to 0.128s if we replace FIFO library calls with fixed- in our earlier publication [22] and does not have cycle-based size arrays (array size is set to the number of total FIFO variable liveness analysis (Section VI-A) and pointer-based elements written). The FLASH simulation flow does not have variable access (Section VI-B) optimizations. The table shows this problem because the FIFO library calls have been replaced that the proposed optimizations result in 1.55X speedup on with array-based communication (Section V-A3). The average average. The speedup is greater for benchmarks that have a slowdown of FLASH compared to the VHLS C simulation is large (>12) averaged pipeline depth among all variables. The 1.72X. average speedup for Stencil, MD_sim, LUD is 2.28X, and the averaged speedup for the rest of the benchmarks is 1.12X. Compared to the RTL simulation, Verilator increases the This is because the proposed optimizations reduce copies of simulation speed by 3.45X (=2,800X/810X). However, as the variables in loop pipelines. mentioned in SectionII, the speedup is limited because it is As explained in Section V-A, FLASH uses the FSM state difficult to completely remove resource allocation and binding assignment information and the FSM state transition infor- information from the RTL file after they have been added. mation. The resource allocation / binding information and FLASH does not have this overhead, and as a result, FLASH the component library that exist in RTL code have been outperforms Verilator by two orders of magnitude while also abstracted in FLASH, and the computation statements are achieving the cycle accuracy. instead simulated natively on the host machine. The result Please note that in our initial research stage, we also of this abstraction can be checked in TableVI. FLASH is evaluated a similar code transformation flow that produces about 1,630X (=2,800/1.72) faster than the RTL simulation. a SystemC simulation file. However, the overhead in the This confirms our initial speculation that simulating based on SystemC simulation environment caused a 2-3X slowdown the scheduling information greatly accelerates the simulation compared to the proposed C-based flow, which motivated us to speed while solving the correctness problems. follow the current approach. Despite the slowdown, SystemC- Since our flow reflects the scheduling information, we can based approach may be more useful to some tool developers expect some slowdown compared to the VHLS C simulation. if compatibility with existing SystemC simulation frameworks The source of overhead includes the frequent FIFO stalls and has a higher priority. IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 1, NO. 1, JANUARY 2020 13

TABLE VII possible deadlock situations. Note that in order to increase the TOTAL EXECUTION CYCLES ESTIMATED BY VHLS SYNTHESIS REPORT readability of the code, we chose to generate the simulation file AND FLASH, AND THEIR ERROR RATE COMPARED TO THE RTL-SIMULATED RESULT by transforming it from the source code; but tool vendors may also choose to generate the simulation file from the LLVM IR Benchmark RTL sim Viv HLS syn rpt FLASH to exploit the LLVM optimizations. Toy_mpath 4,500,010 4,000,019 4,500,010 - (-11%) (0%) One limitation of FLASH is that it does not model the Stencil 524,309 524,299 524,309 stalls from the external memory access. We plan to incorporate - (˜0%) (0%) this functionality in the future to provide accurate perfor- MD_sim 12,089 10,524 12,089 - (-13%) (0%) mance estimation for wider range of benchmarks. Another Mat_mul 330,006 131,075 330,006 limitation is that FLASH serially simulates the pipelined/task- - (-60%) (0%) level parallelism. We plan to parallelize the implementation Cholesky 40,741 34,996 40,741 - (-14%) (0%) using Pthread/OpenMP so that large-scale simulation can be NW 245,725 131,112 245,725 performed by exploiting multicore architecture. - (-47%) (0%) Other future work includes allowing FLASH to provide a LUD 201,260 561,153 201,260 - (180%) (0%) quick performance estimation for design space exploration of SpMV 163,859 395M 163,859 input-dependent benchmarks. Moreover, we hope to incorpo- - (240K%) (0%) rate the Intel HLS flow if their tool’s synthesis report provides detailed scheduling information in the future. C. Accuracy As explained in SectionIV, the correctness problem is ACKNOWLEDGMENT solved by simulating FIFO communication in a cycle-accurate We are grateful to Xilinx for the generous software and manner. The data value and the data ordering has been verified hardware donation. We thank Seonmyeong Bak (Georgia by comparing the output of the FLASH simulator with that of Tech.), Professor Miryung Kim (UCLA), Chaosheng Shi (Xil- the VHLS RTL simulator. inx), and Professor Zhiru Zhang (Cornell Univ.) for many In Table VII we compare the cycle estimation accuracy helpful discussions and suggestions. We would also like to with the VHLS synthesis report after we manuallly specify express our gratitude to the anonymous reviewers for their the maximum loop bound in the source code. The estimation detailed comments and Marci Baun and Janice Wheeler for error rate is small for Stencil, because [30] has a built-in proofreading this paper. mechanism to allocate adequate buffers to avoid FIFO stalls. For the rest of the benchmarks, we have applied a small (1– REFERENCES 2) FIFO depth (e.g., Fig.4). This causes the FIFO buffer to be frequently full and empty and increases execution cycles. [1] J. Cong et al., “High-level synthesis for FPGAs: From prototyping to deployment,” IEEE Trans. Computer-Aided Design of Integrated Circuits Thus, the HLS synthesis report’s estimate is smaller than and Systems, vol. 30, no. 4, pp. 473–491, Apr. 2011. the RTL simulation result. For LUD and SpMV, on the other [2] Xilinx. (2018) Vivado High-level Synthesis (UG902). [Online]. hand, the VHLS tool provides a very large overestimate of Available: https://www.xilinx.com/ [3] Intel. (2019) Intel FPGA SDK for OpenCL Pro Edition. [Online]. the execution cycles. The reason is that these applications Available: https://www.intel.com/ have variable loop bounds, and VHLS generates the cycle [4] O. Segal et al., “Sparkcl: A unified programming framework for estimate based on the maximum possible loop bounds [24]. accelerators on heterogeneous clusters,” ArXiv Preprint, 2015. [Online]. Available: https://arxiv.org/abs/1505.01120 FLASH simulates FIFO stalls and loops with variable bounds [5] E. Sozzo et al., “A common backend for hardware acceleration on in a cycle-accurate fashion, and the estimated execution time FPGA,” in IEEE Int. Conf. Comput. Design, 2017, pp. 427–430. accurately matches that of RTL simulation. [6] C. Yu et al., “S2FA: an accelerator automation framework for hetero- geneous computing in datacenters,” in Proc. Ann. Design Automation Conf., 2018. X.CONCLUDING REMARKS [7] . (2019) ModelSim PE. [Online]. Available: https: //www.mentor.com/ With a new HLS software simulation flow based on the [8] Cadence. (2019) Incisive Enterprise Simulator. [Online]. Available: scheduling information, we were able to solve the correctness http://www.cadence.com issue and also provide accurate performance estimation. A [9] Synopsys. (2019) VCS Functional Verification Solution. [Online]. Available: https://www.synopsys.com/verification/simulation/vcs.html cycle-accurate simulation result was obtained three orders of [10] ——. (2019) ZeBu Fast Emulation. [Online]. Available: https: magnitude faster than from RTL simulation, because the new //www.synopsys.com/verification/emulation.html simulation flow is not slowed by allocation / binding informa- [11] J. Goeders and S. J. Wilton, “Effective FPGA debug for high-level synthesis generated circuits,” in IEEE Int. Conf. Field Programmable tion and component library. We have described an automated Logic and Appl., 2014. code generation flow that enables this new simulation flow. [12] ——, “Using dynamic signal-tracing to debug compiler-optimized HLS We hope that the promising results presented in this work circuits on FPGAs,” in IEEE Ann. Int. Symp. Field-Programmable Custom Computing Machines, 2015, pp. 127–134. will motivate the HLS commercial tool industry to provide [13] ——, “Signal-tracing techniques for in-system FPGA debugging of additional routines that simulate based on the scheduling in- high-level synthesis circuits,” IEEE Trans. Computer-Aided Design of formation only. This will substantially decrease the validation Integrated Circuits and Systems, vol. 36, no. 1, pp. 83–96, Jan. 2017. [14] J. S. Monson and B. L. Hutchings, “New approaches for in-system debug time of the customers who wish to rapidly estimate cycle- of behaviorally-synthesized FPGA circuits,” in IEEE Int. Conf. Field accurate performance, obtain correct output data, or detect Programmable Logic and Appl., 2014. IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 1, NO. 1, JANUARY 2020 14

[15] ——, “Using source-level transformations to improve high-level synthe- [44] J. Cong and J. Wang, “PolySA: polyhedral-based systolic array auto sis debug and validation on FPGAs,” in Proc. ACM/SIGDA Int. Symp. compilation,” in Proc. IEEE/ACM Int. Conf. Computer-Aided Design, Field-Programmable Gate Arrays, 2015, pp. 5–8. 2018. [16] Y. Choi and J. Cong, “HLScope: High-level performance debugging for [45] J. Liu and J. Cong, “Dataflow systolic array implementations of matrix FPGA designs,” in IEEE Ann. Int. Symp. Field-Programmable Custom decomposition using high level synthesis,” in Proc. ACM/SIGDA Int. Computing Machines, 2017, pp. 125–128. Symp. Field-Programmable Gate Arrays, 2019, pp. 187–187. [17] A. Verma et al., “Developing dynamic profiling and debugging support [46] B. Reagon et al., “Machsuite: Benchmarks for accelerator design and in OpenCL for FPGAs,” in Proc. Ann. Design Automation Conf., 2017, customized architectures,” in Proc. IEEE Int. Symp. Workload Charac- pp. 56–61. terization, 2014, pp. 110–119. [18] J. Cong, Z. Fang, H. Kianinejad, and P. Wei, “Revisiting FPGA [47] L. Pouchet. (2015) PolyBench/C. [Online]. Available: http://web.cse. acceleration of molecular dynamics simulation with dynamic data ohio-state.edu/∼pouchet.2/software/polybench/ flow behavior in high-level synthesis,” ArXiv Preprint, 2016. [Online]. [48] Y. Zhang et al., “FPGA vs. GPU for sparse matrix vector multiply,” in Available: https://arxiv.org/abs/1611.04474 IEEE Int. Conf. Field-Programmable Technology, 2009, pp. 255–262. [19] S. Dai, M. Tan, K. Hao, and Z. Zhang, “Flushing-enabled loop pipelining for high-level synthesis,” in Proc. Ann. Design Automation Conf., 2014. [20] S. Lahti, P. Sjovall,¨ and J. Vanne, “Are we there yet? A study on the Young-kyu Choi (M’10) is a postdoctoral scholar state of high-level synthesis,” IEEE Trans. Computer-Aided Design of in computer science at the University of California, Integrated Circuits and Systems, vol. 38, no. 5, pp. 898–911, May 2019. Los Angeles, where he received his Ph.D. degree [21] P. Coussy et al., “An introduction to high-level synthesis,” IEEE Design in 2019. He received his B.S. and M.S. degrees in & Test of Comput., vol. 26, no. 4, pp. 8–17, Jul. 2009. electrical engineering from Seoul National Univer- [22] Y. Chi, Y. Choi, J. Cong, and J. Wang, “Rapid cycle-accurate simu- sity. He developed TV receivers at LG Electronics lator for high-level synthesis,” in Proc. ACM/SIGDA Int. Symp. Field- from 2008 to 2011 and FPGA performance estima- Programmable Gate Arrays, 2019, pp. 178–183. tion tools at Falcon Computing Solutions in 2017. [23] A. Canis et al., “From software to accelerators with LegUp high-level His current research interests include performance synthesis,” in Proc. Int. Conf. Compilers, Architectures and Synthesis debugging, simulation, and optimization with FPGA for Embedded Systems, 2013, pp. 18–26. high-level synthesis tools. [24] Y. Choi, P. Zhang, P. Li, and J. Cong, “HLScope+: Fast and accurate performance estimation for FPGA HLS,” in Proc. IEEE/ACM Int. Conf. Computer-Aided Design, 2017, pp. 691–698. [25] M. Nanjundappa et al., “SCGPSim: A fast SystemC simulator on GPUs,” Yuze Chi received his B.S. degree in electronic in Proc. Asia and South Pacific Design Automation Conf., 2010, pp. engineering from Tsinghua University and started 149–154. pursuing a Ph.D. degree in computer science in [26] M. Chung, J. Kim, and S. Ryu, “SimParallel: A high performance 2016. Yuze’s current research interests include soft- parallel SystemC simulator using hierarchical multi-threading,” in IEEE ware/hardware co-optimization and high-level pro- Int. Symp. Circuits and Systems, 2014, pp. 1472–1475. gramming infrastructure for heterogeneous systems. [27] T. Schmidt, G. Liu, and R. Domer,¨ “Exploiting thread and data level parallelism for ultimate parallel SystemC simulation,” in Proc. Ann. Design Automation Conf., 2017. [28] A. Mahapatra, Y. Liu, and B. C. Schafer, “Accelerating cycle-accurate system-level simulations through behavioral templates,” Integration, vol. 62, pp. 282–291, Jun. 2018. [29] W. Snyder. (2017) Verilator: Speedy Reference Models, Direct from Jie Wang received his B.S. degree from Tsinghua RTL. [Online]. Available: https://www.veripool.org/ University, Beijing, China, in 2015. He is currently [30] Y. Chi, J. Cong, P. Wei, and P. Zhou, “SODA : stencil with optimized pursuing the Ph.D. degree at the University of dataflow architecture,” in Proc. IEEE/ACM Int. Conf. Computer-Aided California, Los Angeles. His current research inter- Design, 2018. ests include domain-specific architecture design and [31] S. Kundu and I. F. Akyildiz, “Deadlock free buffer allocation in closed polyhedral compilation. queueing networks,” Queueing Systems, vol. 4, no. 1, pp. 47–56, Mar. 1989. [32] Y. Choi, “Performance debugging frameworks for high-level synthesis,” Ph.D. dissertation, University of California, Los Angeles, Oct. 2019. [33] Xilinx. (2017) FIFO Generator v13.2 (PG057). [Online]. Available: https://www.xilinx.com/ [34] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein, Introduction to Algorithms, 2nd ed. Cambridge, MA: The MIT Press, 2005. Jason Cong (F’00) received his B.S. degree in computer science from Peking University in 1985, [35] L. Soule and T. Blank, “Parallel logic simulation on general purpose his M.S. and Ph. D. degrees in computer science machines,” in Proc. ACM/IEEE Design Automation Conf., 1988, pp. from the University of Illinois at Urbana-Champaign 166–171. in 1987 and 1990, respectively. Currently, he is a Optimizing compilers for modern architec- [36] R. Allen and K. Kennedy, Distinguished Chancellor’s Professor at the Com- tures: a dependence-based approach . San Francisco, CA: Morgan puter Science Department, also with joint appoint- Kaufmann, 2002. ment from the Electrical Engineering Department, of [37] Intel. (2000) Intel IA-64 Architecture Software Developer’s Manual . University of California, Los Angeles, the director [Online]. Available: https://www.intel.com/ of Center for Domain-Specific Computing (CDSC), [38] ROSE. (2019) ROSE compiler infrastructure. [Online]. Available: and the director of VLSI Architecture, Synthesis, http://rosecompiler.org/ and Technology (VAST) Laboratory. He served as the chair the UCLA [39] Falcon Computing Solutions. (2019) Merlin Compiler. [Online]. Computer Science Department from 2005 to 2008. Dr. Cong’s research inter- Available: https://www.falconcomputing.com/merlin-fpga-compiler/ ests include novel architectures and compilation for customizable computing, [40] Xilinx. (2012) ChipScope Pro Software and Cores (UG029). [Online]. synthesis of VLSI circuits and systems, and highly scalable algorithms. He Available: https://www.xilinx.com/ has close to 500 publications in these areas, including 15 best paper awards, 3 [41] Intel. (2019) Quartus Prime Pro Edition Handbook. [Online]. Available: Ten-Year Most Influential Paper Awards, and one inducted to the FPGA and https://www.intel.com/ Reconfigurable Computing Hall of Fame. He was elected to an IEEE Fellow [42] Xilinx. (2019) UltraScale architecture and product data sheet: overview in 2000, an ACM Fellow in 2008, and a member of the National Academy (DS890). [Online]. Available: https://www.xilinx.com/ of Engineering in 2017. [43] Intel. (2016) Intel Xeon Processor E5-2680 v4. [Online]. Available: www.intel.com/