Performed the Most Often. in FPGA Design Flow, Functional and Gate
Total Page:16
File Type:pdf, Size:1020Kb
performed the most often. In FPGA design flow, functional and gate-level timing simulation is typically performed when designers suspect that there might be a mismatch between RTL and functional or gate-level timing simulation results, which can lead to an incorrect design. The mismatch can be caused for several reasons discussed in more detail in Tip #59. Note that the nomenclature of simulation types is not consistent. The same name, for instance “gate-level simulation”, can have slightly different meaning in simulation flows of different FPGA vendors. The situation is even more confusing in ASIC simulation flows, which have many more different simulation types, such as transistor-level, and dynamic simulation. The following figure shows simulation types designers can perform during Xilinx FPGA synthesis and physical implementation process. Figure 1: Simulation types Xilinx FPGA designers can perform simulation after each level of design transformation from the original RTL to the bitstream. The following example is a 12-bit OR gate implemented in Verilog. module sim_types(input [11:0] user_in, output user_out); assign user_out = |user_in; endmodule XST post-synthesis simulation model is implemented using LUT6 and LUT2 primitives, which are parts of Xilinx UNISIMS RTL simulation library. wire out, out1_14; LUT6 #( .INIT ( 64'hFFFFFFFFFFFFFFFE )) out1 ( .I0(user_in[3]), .I1(user_in[2]), .I2(user_in[5]), .I3(user_in[4]), .I4(user_in[7]), .I5(user_in[6]), .O(out)); LUT6 #( .INIT ( 64'hFFFFFFFFFFFFFFFE )) out2 ( .I0(user_in[9]), .I1(user_in[8]), .I2(user_in[11]), .I3(user_in[10]), .I4(user_in[1]), .I5(user_in[0]), .O(out1_14)); LUT2 #( .INIT ( 4'hE )) out3 ( .I0(out), .I1(out1_14), .O(user_out) ); Post-synthesis simulation model can be generated using the following command: $ netgen -w -ofmt verilog -sim sim.ngc post_synthesis.v Post-translate simulation model is implemented using X_LUT6 and X_LUT2 primitives, which are parts of Xilinx SIMPRIMS simulation library. SIMPRIMS library is used for gate-level timing simulation. wire out; wire out1_14; X_LUT6 #( .INIT ( 64'hFFFFFFFFFFFFFFFE )) out1 ( .ADR0(user_in[3]), .ADR1(user_in[2]), .ADR2(user_in[5]), .ADR3(user_in[4]), .ADR4(user_in[7]), .ADR5(user_in[6]), .O(out) ); X_LUT6 #( .INIT ( 64'hFFFFFFFFFFFFFFFE )) out2 ( .ADR0(user_in[9]), .ADR1(user_in[8]), .ADR2(user_in[11]), .ADR3(user_in[10]), .ADR4(user_in[1]), .ADR5(user_in[0]), .O(out1_14)); X_LUT2 #( .INIT ( 4'hE )) out3 ( .ADR0(out), .ADR1(out1_14), .O(user_out) ); Post-translate simulation model can be generated using the following command: $ netgen -w -ofmt verilog -dir netgen/synth -sim sim.ngd post_translate.v Post-map simulation model is also implemented using X_LUT6 and X_LUT2 primitives. The difference from the post-translate model is that post-map model also includes placement information, and additional timing information in the Standard Delay Format (SDF). initial $sdf_annotate("netgen/map/sim_types_map.sdf"); wire out; wire out1_82; X_LUT6 #( .LOC ( "SLICE_X4Y2" ), .INIT ( 64'hFFFFFFFFFFFFFFFE )) out2 ( .ADR0(user_in_9_IBUF_0), .ADR1(user_in_8_IBUF_0), .ADR2(user_in_11_IBUF_0), .ADR3(user_in_10_IBUF_0), .ADR4(user_in_1_IBUF_0), .ADR5(user_in_0_IBUF_0), .O(out1_82)); X_LUT6 #( .LOC ( "SLICE_X4Y2" ), .INIT ( 64'hFFFFFFFFFFFF0000 )) out3 ( .ADR0(1'b1), .ADR1(1'b1), .ADR2(1'b1), .ADR3(1'b1), .ADR4(out), .ADR5(out1_82), .O(user_out_OBUF_78)); X_LUT6 #( .LOC ( "SLICE_X7Y2" ), .INIT ( 64'hFFFFFFFFFFFFFFFE )) out1 ( .ADR0(user_in_3_IBUF_0), .ADR1(user_in_2_IBUF_0), .ADR2(user_in_5_IBUF_0), .ADR3(user_in_4_IBUF_0), .ADR4(user_in_7_IBUF_0), .ADR5(user_in_6_IBUF_0), .O(out)); Post-map simulation model can be generated using the following command: $ netgen -s 3 -pcf sim_types.pcf -sdf_anno true -sdf_path "netgen/map" -w -dir netgen/map -ofmt verilog -sim sim.ncd post_map.v Post-par simulation model differs from the post-map in that post-par also includes all the routing information. Post-par simulation model can be generated using the following command: $ netgen -s 3 -pcf sim_types.pcf -sdf_anno true -sdf_path "netgen/par" -insert_pp_buffers true -w -dir netgen/par -ofmt verilog -sim sim.ncd post_par.v Post-par simulation model includes all the information to perform an accurate timing simulation. 58. Improving Simulation Performance FPGA design and verification engineers spend a significant part of their time performing functional simulations. For a large FPGA design, running a system level simulation can easily take several hours. If an engineer performs, on average, ten simulations a day that take thirty minutes each, then a 30% speed improvement will allow the performing of fourteen additional simulations of the same design. More simulation turns allow for the achieving of verification goals faster. In general, improving simulation performance involves the following types of changes: - choosing the right simulation type - performing profiling to identify simulation bottlenecks - optimizing device under test, testbench, and simulation models - reducing simulation database size - reducing the number of simulation events - enabling optimization options in simulation tools - reducing verbose and visibility level of a simulated design - following simulation tool best practices - selecting the right simulator, simulation flow, and environment Researching different simulation performance options and implementing them is a long-term effort. It is important to thoroughly document all the findings, establish a set of recommendations, and disseminate and actively enforce them across the design teams. Simulation type Choosing the right simulation type can have a dramatic effect on performance. The performance difference between gate level, RTL, and transaction level simulations is in one, or even two, orders of magnitude. Transaction level simulation is the fastest, because it is compiled into the smallest database and contains the least amount of simulation events. On the other side of the spectrum is gate level simulation, which contains timing, placement, and routing information of the final FPGA design. This point is discussed in more detail in Tip #57. Profiling Simulation profiling is the process of getting detailed information on how the simulation engine is utilizing the CPU. For the duration of the simulation run, profiling utility collects the information on how much time and how many CPU cycles were spent in different design units, or any HDL statement in the profiled code. Profiling results are used to identify design units or code sections that use the most CPU resources and subsequently optimize those sections to improve simulation performance. The profiling process is usually non-intrusive and doesn’t require modification of the source code. The profiling is controlled with compile and simulation options. Device under test code optimizations Designers can use the following methods to optimize device under test (DUT) code. Compiler directives, such as `ifdef SIMULATION or `ifndef SIMULATION, can be used to conditionally enable portions of the code for simulation or synthesis. For example, if the design has several interfaces, such as PCI Express, USB, and Ethernet, and the test case only exercises USB, there is no reason for PCI Express interface and Ethernet MAC code to participate in the simulation. Other candidates for including under conditional compilation are debug modules such as Xilinx Integrated Logic Analyzer (ILA) and Virtual IO (VIO) cores. The effort of implementing conditional compilation directives and tying off large modules is negligible compared with the time gained by decreased simulation runtime. When using conditional compile, it’s important to make sure to exclude files that are not part of the simulation from the filelist. That will also save some compile time. A simple change that significantly reduces simulation time is to shorten or completely bypass device or IP core initialization cycle during the simulation. For example, it might take several milliseconds of simulation time, which takes a very long time for a large design, to perform PCI Express link training, SerDes calibration, or memory controller initialization. Most of the IP cores provide a parameter to bypass or shorten the initialization. Alternatively, designers can use Verilog force statement to force internal signals to certain state. Another optimization is to reduce the period of timers in the design. Timers are used for different purposes; for example, measuring delays or sending periodic events. The circuit might not require simulating a timer with a long period if it produces equivalent results when the timer is using a much lower period. Making more complex code changes to a working design in order to improve simulation performance is not desirable. It might decrease code readability, cause the design not to meet timing constraints, or even introduce new bugs. Testbench optimizations One approach for testbench optimization is raising the level of abstraction. Transaction-level simulations that focus on cycle-based signals and transactions perform faster than event-based RTL that use transitions in time. SystemVerilog defines the clocking block, which is a key element in a cycle-based methodology. A testbench written in Verilog can take advantage of Programming Language Interface (PLI) and the Verilog Procedural Interface (VPI) routines. The PLI/VPI is an extension mechanism that allows functions written in other programming languages to access low- level design information. Carefully designed PLI/VPI extensions can significantly