IEICE TRANS. INF. & SYST., VOL.E96–D, NO.8 AUGUST 2013 1602

PAPER Special Section on Reconfigurable Systems FPGA Design Framework Combined with Commercial VLSI CAD

Qian ZHAO†a), Nonmember, Kazuki INOUE†, Student Member, Motoki AMAGASAKI†, Masahiro IIDA†, Morihiro KUGA†, and Toshinori SUEYOSHI†, Members

SUMMARY The most widely used open-source field programmable gate array (FPGA) placement and routing tool is the Versatile Packing, Placement and Routing (VPR) software developed at the University of Toronto, Canada. VPR calculates area and timing using target FPGA ar- chitecture and physical information. However, it cannot be used in FPGA IP design efficiently for two reasons. First, VPR cannot directly support most newly developed FPGA architectures, and modifying the C-coded VPR so that it can be used to evaluate a number of new architectures is time consuming. Second, the accuracy of the VPR performance results is inadequate for the evaluation of a complete FPGA IP in a design that tar- Fig. 1 SoC with embedded FPGA IP. gets the production of LSI. We propose an FPGA design framework that is focused on improving FPGA IP design efficiency. A novel FPGA rout- ing tool is developed in this framework, namely the EasyRouter which uses SoC products [1], [2]. A powerful ARM-based processor the C# language. When an object-oriented programming method is used, and FPGA fabrics are integrated into one chip to reduce there is less source code and it is easier to manage compared to VPR, thus shortening the development time. By using simple HDL code templates, power, cost, and board size. However, the FPGA IP cores EasyRouter can automatically generate the entire HDL code for a chip and from these companies are not provided to other SoC design- the configuration bitstream. With these files, the FPGA IP can be evaluated ers. Menta is providing domain-specific synthesizable and with commercial VLSI CAD systems with high accuracy and reliability. hard macro eFPGA core IPs [3]. However, Menta’s CAD key words: FPGA, CAD, routing tools are only designed for their commercial eFPGA IPs. Therefore, CAD tools and a design flow for FPGA IP re- 1. Introduction search and design are necessary. The -to-Routing (VTR) project is a well-known Embedded systems play an increasingly important part in FPGA design flow [4] in academic researches. The place- electronic products. In particular, system-on-a-chip (SoC) ment and routing (P&R) tool, the Versatile Packing, Place- technology has developed rapidly. A variety of functions ment and Routing (VPR) software developed at the Univer- can be implemented by embedding various hard IP cores in sity of Toronto, Canada [5], in this flow is directly related a single silicon die. However, a new product must be fabri- to the physical architecture of the FPGA. However, VPR cated with an entirely new mask. Even if only small changes cannot be used directly in FPGA IP design for two rea- are made to a product to improve functionality, a huge cost sons. First, the VPR cannot directly support most newly de- is incurred. The embedded field-programmable gate array veloped FPGA architectures such as the three-dimensional (FPGA) IPs can be used to solve this problem because of (3D) FPGA and hierarchical routing topology. Second, most their programmability after manufacture. Figure 1 shows an of the FPGA IP design needs to be implemented with stan- image of SoC with embedded FPGA IP. The logic function dard cells using the VLSI back-end design flow. The simple of an FPGA IP can be changed rapidly by downloading a evaluation model built into VPR cannot provide acceptable configuration bitstream of soft IPs. One SoC product can accuracy for delays in FPGA IP design. Further, VPR does easily be used for different applications by implementing not provide any function that links FPGA design flow with functions that need to be renewed frequently or customized commercial VLSI CADs. on an embedded FPGA IP core. There is no need to fabri- The contribution of this paper is to propose an FPGA cate a new chip for minor function changes, thereby saving design framework that specifically improves the design ef- the cost of the mask. We will focus on such FPGA IP design ficiency of FPGA IP for SoC. The FPGA IP that produced in this paper. by the proposed framework can be directly adopted in SoC and have released their programmable design flow as an IP core. In order to solve the two iden- Manuscript received November 9, 2012. tified problems of VPR, we need to develop a simple and Manuscript revised March 8, 2013. † automatic FPGA IP design framework that combines FPGA The authors are with the Graduate School of Science and design tools with commercial VLSI CADs. Technology, Kumamoto University, Kumamoto-shi, 860–8555 Japan. The remainder of this paper is organized as follows. a) E-mail: [email protected] Section 2 introduces related FPGA design tools and design DOI: 10.1587/transinf.E96.D.1602 flows. The novel router tool EasyRouter is introduced in

Copyright c 2013 The Institute of Electronics, Information and Communication Engineers ZHAO et al.: FPGA DESIGN FRAMEWORK COMBINED WITH COMMERCIAL VLSI CAD 1603

Sect. 3. Section 4 describes the proposed FPGA IP design limited to the supported island style architectures. For much flow. In Sect. 5, we first compare the performance of Easy- of our research, such as on a 3D-FPGA, we have to mod- Router with the conventional VPR and then discuss the eval- ify the VPR to implement various routing architectures. It uation method and evaluation results for the proposed flow. consumes considerable development time to master, modify, Finally we show the simplicity and expandability of Easy- and debug the C-coded VPR. Router with a 3D-FPGA case study. Conclusions are given Second, the VPR is integrated with a simple delay in Sect. 6. model to facilitate timing-driven routing and post-routing timing analysis. The final timing report consists of the logic 2. Related Work and routing delays, which are calculated in different ways. Therefore, although the relative values of VPR delay results 2.1 FPGA Design CAD Tools can fairly evaluate FPGA architectures, the absolute value has low accuracy. For FPGA IP design, an accurate entire Xilinx ISE [6] and Altera Quartus [7] are commercial CAD chip static timing analysis (STA) with a standard cell library tools used to implement circuits on their FPGAs. On the is necessary. other hand, open source design flows like VTR project [4] are used for academic FPGA researches. The VTR project 3. EasyRouter (Fig. 2) consists of the placement and routing tool VPR [5], the synthesis tool ODIN II [8], and technology mapping tool We propose a novel EasyRouter to solve the issues dis- ABC [9]. VPR [5] is the CAD tool that directly related to the cussed in Sect. 2.2. Based on the similar routing and re- FPGA physical architecture. porting functions of VPR, EasyRouter has some improved Because VPR cannot be used for unsupported archi- features. First, we developed EasyRouter in C# language on tectures, many other FPGA design frameworks have been the .net framework with full object-oriented programming developed for various devices. Grant et al. [10] employed coding style. This reduced the amount of code and com- a typical FPGA design flow together with a new placing, plexity, making it easier to understand and modify. Owing routing, and scheduling tool for their coarse-grained archi- to the benefits of the open-source Mono runtime environ- tecture. Ababei et al. [11] and Miyamoto et al. [12] proposed ment [13], EasyRouter can be executed in most operating design flows for a 3D-FPGA. The authors of [11] developed systems. We then developed a script-based architecture def- their TPR on the basis of VPR 4.0, while those of [12] used inition mechanism by considering the code file itself to be a modified VPR for 3D-FPGA. the architecture definition file. The mechanism offers users maximum flexibility in implementing new architectures. Fi- 2.2 Issues of Traditional Design Flows nally, we developed HDL codes and bitstream generation methods to facilitate the evaluation of the designed FPGA We now discuss the issues of VPR since it is directly related using commercial VLSI CADs. In this section, we intro- to the physical architecture of the FPGA. VPR 6.0 offers duce the structure of the proposed EasyRouter. Post-routing the combination of timing-driven packing, new architecture device implementation and performance analysis are intro- description file support, and other new features. However, it duced in Sect. 4. has two issues for FPGA IP design: The block diagram of EasyRouter is shown in Fig. 3. First, the description-file-based architecture definition EasyRouter consists of a routing-resource graph (RRGraph) method provides flexibility for various struc- tures. However, the flexibility of routing structure is still

Fig. 2 Flow in the VTR project [4]. Fig. 3 EasyRouter block diagram. IEICE TRANS. INF. & SYST., VOL.E96–D, NO.8 AUGUST 2013 1604 building block, routing block, bitstream generation block, based architecture definition method has its limits. When HDL codes generation block, and report generation block. implementing a high-flexibility-file-based architecture defi- The core blocks of a router are the RRGraph building nition, the RRGraph generation codes are highly complex. block and the routing block. The routing block implements Therefore, we abandoned the XML based architecture de- the conventional breadth-first path finder algorithm, which scription file of VPR and used the C# script file itself as routes all nets according to a given RRGraph. The routing the architecture description file, which was designed to be block is architecturally independent. Therefore, to make the more generic to implement various FPGA architectures. As router suitable for new architectures, we need to improve Fig. 3 shows, the RRGraph building block reads the FPGA the RRGraph builder. We now describe each of the blocks architecture script file to generate an RRGraph. The actual in detail. architectural dependent codes such as architecture and phys- ical parameters setup, netlist and placement files import, and 3.1 RRGraph Building Block RRGraph building are implemented in the RRGraph gener- ation script files. The architecture and physical parameters The RRGraph describes the target FPGA architecture with setup block sets parameters of one FPGA architecture like routing resources (nodes) and their connection relation- the VPR architecture file does. New FPGA architecture can ships [14]. We describe the RRGraph with a graph data be implemented by modifying the RRGraph building block. structure. Each routing resource in the RRGraph is called The architecture script only returns architectural indepen- an RRNode. The RRGraph is a collection of all neces- dent RRGraph to the routing block. The dynamic script sup- sary RRNodes. Figure 4 shows the mapping relationship port is implemented with the Dynamic Language Runtime between a simple FPGA tile structure and its RRGraph. (DLR) of the .net framework. With this feature, the FPGA RRGraph is designed in an architecturally independent data architecture to be evaluated by EasyRouter can be changed structure. In1 and in2 are logic block input pins, out1 and by switching the RRGraph generation script input file. out2 are logic block output pins, CHANX0 and CHANX1 Therefore, new FPGA architecture can be implemented are routing wires on the X direction channel, and CHANY0 easily using the EasyRouter. When evaluating many archi- and CHANY1 are routing wires on the Y direction chan- tectures, it is easy to switch between them without recom- nel. Each wire and logic block pin in the left physical tile piling the main EasyRouter program. structure becomes an RRNode in the right RRGraph. Often, the input pins and output pins of a logic block are separate 3.2 Routing Block logically equivalent pins. This means that a router can com- plete a connection from the routing track to the interior of EasyRouter implements a conventional breadth-first path- the logic block through any input pins. Similarly, any out- finder routing algorithm [14], [15]. The timing-driven al- put pin can be used to complete a connection from interior gorithm is not considered at this stage because the timing- of the logic block to the routing track. In Fig. 4, in1 and driven algorithm can improve delay of routing result when in2 are logically equivalent, and out1 and out2 are logically implementing customer circuits, however, it is not neces- equivalent. This logical equivalence in the RRGraph is of- sary for FPGA scale exploration and implementation. We ten modeled by adding a virtual source (VSource) at which will provide timing-driven router to improve delay of the all nets begin, and virtual sink (VSink) nodes at which all customer circuit implementation in the future. The rout- net terminals end. This simple one-tile example is used to ing algorithm and parameters are the same as those im- explain how a physical structure is mapped in the RRGraph. plemented by VPR. For example, we have implemented For a realistic FPGA, the RRGraph consists of the RRNodes bounding box based routing area limitation and set default of all the tiles. bounding box factor of three [15]. We note that the most Next, we compare the architecture definition methods executed codes pertain to operations of the routing-resource of VPR and EasyRouter. VPR reads an architecture file heap (RRHeap). RRHeap is an implementation of a heap to generate an RRGraph. The architecture-description-file sort. In fact for a given algorithm, the performance of a router is mainly decided by the efficiency of the heap sort operations. The performance comparison of RRHeap in EasyRouter and VPR will be given in Sect. 5.1.

3.3 HDL Codes Generation Block and Bitstream Genera- tion Block

As previously discussed, the FPGA IP design requires the developed device to be evaluated with great accuracy us- ing commercial VLSI CAD tools. The key problem is how to efficiently link the academic FPGA design flow with the commercial VLSI CAD tools. In the conventional imple- Fig. 4 Example of a routing-resource graph. mentation method, all HDL codes are written manually af- ZHAO et al.: FPGA DESIGN FRAMEWORK COMBINED WITH COMMERCIAL VLSI CAD 1605 ter the architecture is determined. However, according to Router. The device array size, minimum channel width, the our knowledge, if the architecture is to be changed, espe- quantity of all routing resources, and the number of used cially the routing part, a large number of HDL codes need routing resources are included in this exported report. These to be modified and tested. On the other hand, an application data are derived directly from a routed RRGraph, and are configuration bitstream is also necessary to evaluate the tar- useful for device performance analysis. get device. Researchers commonly write their own scripts Moreover, area and delay performance can be calcu- to generate the logic and routing part bitstreams according lated from the physical information of one FPGA tile, be- to the netlist file and routing results. For FPGA IP designs, cause common FPGAs are composed of tiles of the same these steps are very time consuming and may need to be re- structure. We first lay out a tile structure with VLSI design peated frequently. Moreover, it is necessary to evaluate var- flow and obtain its area. The device area can be obtained ious architectures simultaneously. Therefore, architecture from the product of the tile area and ArraySize×ArraySize. exploration and evaluation of the flow should be efficient, We then perform timing analysis using a simplified tile delay fast, and automatic. model. In the simplified tile delay model, we extract some We developed EasyRouter using all the FPGA HDL representative paths such as SB to SB, Channel to LB, and codes and the user circuit configuration bitstream generation BLE input to output, and set their delay to values according functions to solve this problem, since the routing algorithm to tile STA results. The critical path and its delay are ob- stores a large amount of architecture information that can be tained from the timing analysis using the routed RRGraph used to generate HDL codes and bitstreams. When Easy- and these represent delays of the paths. The area and delay Router operates in the evaluation mode, the channel width performance analysis at this stage is less accurate. How- (CW) and array size, which are input parameters, are fixed. ever, it is fast and has sufficient precision for architecture Using the netlist file, placement result file, HDL codes tem- exploration. We will prove this in Sect. 5.2. When eval- plates, and architecture parameters, EasyRouter can gener- uating large devices or special VLSI technology (such as ate all the FPGA HDL codes and an application bitstream. 3D-VLSI) that cannot be implemented easily, this fast per- The HDL codes and bitstream consist of a logic part, rout- formance analysis method can be used. ing part, and hard blocks. The HDL codes and bitstream of hard blocks (if required) need to be prepared manually since 4. Proposed FPGA IP Design Flow we discuss only the common parts of FPGA design in this study. Conventional FPGA architecture exploration and imple- First we introduce HDL code generation. The logic mentation processes involve two separate flows. The FPGA part contains three levels of codes: the logic cell, basic architecture is determined by academic FPGA design flows. logic element (BLE), and logic cluster (with a local con- However, in the implementation phase, commercial VLSI nection block). For most FPGA architectures, these struc- design flows are used which gives rise to two problems. One tures are homogeneous for all reconfigurable tiles. There- is that the academic design flow cannot provide high accu- fore, the logic components of HDL codes can easily be pre- racy area, delay and power estimates. The other is that if pared manually. The routing components of HDL codes are design defects are found in the VLSI design phase, then it is generated automatically with simple templates. The tem- necessary to restart from the FPGA design flow and a large plate consists of the structure of the switch box (SB), con- number of HDL codes needs to be revised. nection block (CB), and I/O block (IOB). The final routing We propose an FPGA IP design flow that combines the HDL codes are generated according to the channel width FPGA and VLSI design flows, to solve the above problems. and other routing parameters such as Fc in, Fc out and The proposed FPGA IP design flow consists of three parts: Fs[14]. Routing resources and their connections can be the conventional FPGA design flow, VLSI back-end design generated automatically according to the information main- and analysis flow and novel HDL codes, and a configuration tained in the RRGraph of the router. bitstream generation tool, the EasyRouter, to bridge the two Next, we discuss bitstream generation. The logic ele- flows. By employing the proposed IP design flow, architec- ment bitstream consists of the logic cell lookup table (LUT) ture exploration can be performed with high accuracy and and the configuration memory bit of the output multiplexer. within a reasonable execution time. The output multiplexer selects the output of the BLE directly from the LUT or through a register [14]. The logic element 4.1 FPGA Scale Exploration bitstream is generated according to the netlist after technol- ogy mapping. The routing bitstream contains configuration Since the FPGA IP core has limited on-chip area, FPGA memory values of the SB, CB, LCB, and IOB, which are scale exploration is necessary. The objective of FPGA scale generated according to the actual routing results. exploration is to find a rational FPGA tile array size and routing channel width by implementing application circuits. 3.4 Report Generation Block FPGA scale exploration is performed with conven- tional academic FPGA design flows. Figure 2 shows a The report generation block exports routed circuit informa- typical FPGA CAD flow for FPGA scale exploration with tion on the target device as the final execution stage of Easy- VTR [4]. The synthesis tool ODIN II reads and optimizes IEICE TRANS. INF. & SYST., VOL.E96–D, NO.8 AUGUST 2013 1606

Fig. 5 Proposed framework: FPGA scale exploration. an HDL-described application circuit. The output of ODIN II is a Blif netlist as it is the standard format used to pass cir- cuit information between academic FPGA tools. Blif format Fig. 6 Proposed framework: Device implementation. circuits (ex. MCNC benchmarks) can be directly inputted into ABC. The technology mapping tool ABC maps the netlist logic circuits into FPGA logic elements, which are typically k-input LUTs. In the case of VPR 6.0, the logic elements are first packed into clusters. The clustered logic blocks are then placed in an n × n tile array. Finally, we use EasyRouter to make the connections for the I/O pins of all logic circuits and I/O ports of the FPGA IP. Figure 5 shows how we link EasyRouter with VTR to perform FPGA scale exploration. Placement and routing are performed ten times for each circuit since different seeds (from 0 to 9) of the simulated annealing placement algorithm generate different placement solutions. The routing result for each circuit is the average of the results of ten placement seeds. Fig. 7 Proposed framework: Fast performance analysis. During the FPGA scale exploration step, EasyRouter can find the minimum-channel-width for each circuit with the channel width (CW) exploring mode. The minimum- provided in Sect. 5.2. channel-width represents the least routing resources for a Some features of FPGA back-end design should be successful routing of target circuits. We then select the mentioned here. First, the common FPGA architecture con- largest minimum-channel-width and array size of all circuits sists of several repeated tiles and IOBs. The hierarchy de- as max channel width and max array size, which represent sign method is recommended. Second, a large number of minimum necessary FPGA scale to implement all circuits. combination loops exist in the FPGA design. The timing Finally, in order to provide sufficient resources for customer loops will adversely affect the timing analysis for all CADs applications, the target FPGA channel width and array size and should be broken. One method is to use the set disable are set to 1.2 times the max channel width and max array timing command to break any potential loops, another is to size. import a bitstream with a set case analysis command.

4.2 FPGA IP Design and Performance Analysis with 4.3 Fast Performance Analysis with EasyRouter Commercial VLSI CADs The full back-end design of a large scale FPGA device is After the architecture is determined, we run EasyRouter in an intensely time consuming process. On the other hand, the evaluation mode to generate the device HDL codes and special VLSI process devices such as the 3D-FPGA can- each circuit’s bitstream, which is shown in Fig. 6. When not presently be implemented easily because of the lack of all the FPGA HDL codes and an application bitstream are available CADs support and process technology. For these generated, we can start the back-end design with commer- reasons, the evaluation flow presented in Fig. 6 is not appli- cial VLSI design CAD tools. Back-end design flows differ cable. Therefore, we developed a fast performance analysis according to the technique used and the researcher’s design function for EasyRouter to evaluate these devices. experience. However, in general, the steps shown in Fig. 6 Figure 7 shows the flow when using EasyRouter for fast are necessary. Details of our implementation method will be performance analysis. When the target device architecture is ZHAO et al.: FPGA DESIGN FRAMEWORK COMBINED WITH COMMERCIAL VLSI CAD 1607 determined with the method described in Sect. 4.1, we can make HDL code for one tile of the target device. We then implement the one tile HDL code with VLSI design flow and obtain the physical information such as area and delays of representative paths, as shown in Fig. 7 (a). Finally, as shown in Fig. 7 (b), in the fast performance analysis mode with this physical information, EasyRouter executes the area reports and timing results.

5. Evaluation

In this section, we first report on the performance of Easy- Router, which include the execution time and minimum Fig. 8 Normalized execution time of router. channel width for each benchmark. We then evaluate the proposed post-routing performance evaluation flow with a homogeneous FPGA case study. Finally, we show the ex- pandability of EasyRouter with a 3D-FPGA case study.

5.1 EasyRouter Performance Evaluation

5.1.1 Evaluation Conditions

In this evaluation, we compared the execution time and min- imum channel width of EasyRouter and VPR. The target FPGA architecture used was the conventional island style FPGA that is supported by VPR. The LUT size was four, and there were four logic elements in one logic block. The number of input pins in one logic block was 10 [14]. The SB Fig. 9 Island style FPGA channel widths. structure was the Wilton type. We used unidirectional one segment wires for the routing tracks. The parameter Fc in was set to 0.5, while Fc out was set to 1.0. circuits like frisk, pdc, and clma, EasyRouter was about 5 The evaluation was performed on a machine with an times slower. This is because for large circuits, the heap sort Xeon X5470 cpu and 32 GB memory. The operating operations dominate the execution time to a greater extent. system was CentOS 5.8. VPR was compiled with gcc 4.1.2, We examined the s298, alu4, and pdc circuits, and the cpu and EasyRouter was run on Mono-2.10. To make fair com- instruction sampling results showed that the execution time parisons, we used an RRGraph compare program to ensure ratio of the heap function were 65.8%, 76.1%, and 83.2%. that the RRGraph structures generated by EasyRouter and Therefore, for large circuits, the execution time overhead VPR were the same. of EasyRouter was close to the performance difference be- tween the C and C# implementations. 5.1.2 Evaluation Results Figure 9 shows the minimum channel widths of Easy- Router and VPR. We can see that the routing performance of The most time-consuming function of a router is the heap both tools were similar. A reason the channel width of both sort. We tested the same heap sort algorithm in C and C#. differ in some circuits, is that during the RRGraph searching The basic test operation involves adding numbers from 0 to step, the expansion order of the RRNode with the same cost 999,999 to a min-heap and then deleting it to empty from value will influence the routing results. However, because of the top. The basic test operation was repeated for 30 times. this, the influence of the minimum channel width was only Then we compared the execution time for the two imple- about a factor of two (the minimum change step for unidi- mentations. The results showed that the C# implementation rectional routing architecture). Therefore, EasyRouter has a was around 5.0 times slower than the C implementation, be- capability that is almost identical to that of VPR. cause of the performance difference of C# and C language. This implies that when implementing a given routing algo- 5.2 Post-Routing Performance Evaluation rithm, the C# program will be at least 5.0 times slower than the C program. In this section, we first introduce the target FPGA archi- We evaluated the execution time of 17 benchmarks, as tecture we used for the post-routing performance evaluation shown in Fig. 8. The results were normalized by the execu- and then describe the evaluation conditions and provide de- tion time of VPR. According to the results, EasyRouter was tails of the implementation method. Finally, the evaluation 8.4 times slower than VPR on average. However, for large results and results of a comparison with the conventional IEICE TRANS. INF. & SYST., VOL.E96–D, NO.8 AUGUST 2013 1608

Fig. 11 Verilog file hierarchy. Fig. 10 Homogeneous FPGA architecture.

Table 1 Target device for evaluation. process. A tile of the target FPGA was synthesized and lay- outed with the same back-end design flow. The tile area was Item Value array size 15 × 15 derived from the GDS after layout. Delays within the LB logic cluster 4-LUT × 4 were extracted with the STA. The wire RC model were ana- # of LB inputs 10 lyzed with the HSpice. All physical parameters were written SB Wilton (Fs = 3) into the architecture file in the VPR format. CB normal (Fc = 0.5) # of routing tracks (single line) 50/channel #ofI/Opins 120 5.2.3 Implementation Method # of configuration bits 110,250 The common design flow is described in Sect. 4. Because the FPGA IP designs have limited die size, we used a device VTR system are given. array size of 15 × 15 to introduce the generation of HDL codes and bitstreams, and post-routing evaluation methods. 5.2.1 Target FPGA Architecture We selected the six circuits from the 20 largest MCNC benchmarks to evaluate the target device, because they can We employed homogeneous FPGA architecture [16] for the be implemented with a target device of array size of 15×15. evaluation, as shown in Fig. 10. In this device, all tiles have We implemented the target device with the Verilog the same structure, unlike the island-style FPGA architec- HDL language. Figure 11 shows the hierarchy of the Ver- ture [14], which is composed of several types of different ilog files. The source code management followed the device tiles. Therefore, the homogeneous FPGA architecture can hierarchy shown in Fig. 10. The Device def file defined de- be easily produced and tested. We developed the P&R tool vice parameters that were used by other files and defined the for the homogeneous FPGA using VPR 5.0. The details and data width corresponding to a channel width, the configura- performance of this architecture have been described in a tion bit width of modules, etc. The CONF FFs provided previous paper [16]. a configuration module for all reconfigurable modules. The As shown in Fig. 10 (a), the device used for the evalua- MUXes file defined all the multiplexers used in the design. tion contains two types of block, the tile and the IOB. The The darker blocks were generated by EasyRouter with a tile consists of an SB, CB X, CB Y, and an LB. The IOB simple template that enables the final device HDL codes to connects the IO pins to routing channels with programmable be generated with given channel width and array size pa- switches. The LB contains an LCB and a logic cluster which rameters. consists of four clustered BLEs with a 4-LUT cell. Other We used a simple 2-bit full adder circuit to test the parameters of the target device are listed in Table 1. validity of the previous flow. A Verilog-coded 2-bit full adder was synthesized with ODIN-II, mapped with ABC, 5.2.2 Evaluation Conditions clustered and placed with VTR, and finally routed with EasyRouter. EasyRouter generates device Verilogs and bit- The device was designed using e-Shuttle 65 nm CMOS tech- streams of the benchmarks. We then used ModelSim to per- nology. The functional simulation tool was ModelSim 6.5b. form functional simulation with device Verilogs and the bit- The design was synthesized with Design Com- stream of the adder circuit. After we successfully obtained piler F-2011.09-SP2. The layout was performed using Ca- the expected simulation results, we passed the Verilogs and dence EDI system 10.13. We checked the gate level netlists bitstream to the back-end design flow. outputted from the Design Compiler and EDI with Formal- The first step was the synthesis. We employed the ity A-2008.03-SP3. Finally, the STA was performed with bottom-up hierarchy design method with the Synopsys De- PrimeTime F-2011.12-SP1. sign Compiler. The tile and the IOB blocks were firstly For the comparison, the area and delay physical param- compiled using a clock of 100 MHz, with the uniqui f y com- eters of VPR were derived in the same flow and technology mand. The gate level Verilog and timing constraint sdc file ZHAO et al.: FPGA DESIGN FRAMEWORK COMBINED WITH COMMERCIAL VLSI CAD 1609

Table 2 Target device area information. Item Area (µm2) Tile size 20,736 IOB size 2,116 FPGA size 9,000,000

Fig. 12 FPGA IP layout. was then created. The module FPGA was the top level de- sign with instances of tiles and IOBs, without logic circuits. Fig. 13 Max-Average-Min delay values of ten placement seeds. Therefore, we read tile.gate.v and IOB.gate.v and then ex- ported FPGA.gate.v without compile and uniqui f y com- mands. The second step was the layout with the CADENCE EDI system. We used a top-bottom hierarchy design for this. First, we read FPGA.gate.v and considered the tile and IOB as a blackbox. We then created a partitioned floor plan. Sub- sequently, we implemented the layout of the tile, IOB and FPGA. Finally, we assembled all the hierarchies and de- rived the flattened FPGA level layout. Figure 12 shows the image of the device after layout. The entire FPGA Verilog and parasitic RC .spef file could be exported at this time. Before the final timing analysis, we used formal ver- ification to performe equivalence check of processed HDL codes with original HDL codes. Because the top level mod- ule assembled the tile and IOB modules, we only used the Fig. 14 Delay results. Formality tool to verify the tile and IOB modules. The ver- ification was performed with the original RTL Verilog de- sign, the synthesized design, and the design after layout. If target device, which is presented in Table 2. the formal verification failed, the back-end implementation (2) Critical path delay of the VLSI design flow should be reviewed as the tradi- tional application-specific integrated circuit (ASIC) flow. Figure 13 shows the max-average-min delay values of ten After verification was successfully performed, we used placement seeds. We can see that the delay results were PrimeTime to perform STA. Note that the bitstream gen- stable when the placement seed changed. Figure 14 shows erated from EasyRouter supported the format of the Prime- the critical path delay calculated by the flow of EasyRouter Time script. All the values of registers and control signals with full FPGA VLSI back-end design and STA (Full FPGA were set with the set case analysis command. We reported STA), EasyRouter fast performance analysis (EasyRouter), the maximum delay timing between inputs, outputs, and reg- and VPR. We believe the critical path delay of the full isters. The maximum delay path and value were the critical FPGA STA was an accurate delay value because the eval- path and critical path delay, respectively. uation of commercial VLSI design flow with a standard cell library has the highest simulation accuracy in industry. 5.2.4 Evaluation Results The delay value accuracy calculated by VPR was 8.9 times lower than that obtained from the full FPGA STA on average. This was because the delay model of VPR was (1) Area pessimistic and had low accuracy. For example, all rout- The area calculation model of VPR multiplies the area of ing segment delays were calculated with the same wire RC one tile by the number of tiles in the array. With an accurate model. In an actual final layout, the placement was opti- tile area after layout, this module is reliable. Therefore, we mized and the physical delays were different. However, we only provided the physical area information of the designed can see that VPR correctly reflected the performance rela- IEICE TRANS. INF. & SYST., VOL.E96–D, NO.8 AUGUST 2013 1610

Fig. 15 Target 3D-FPGA architecture. tionship between the circuits. This shows the reliability of VPR as a fast architecture exploration tool. The result accuracy calculated by EasyRouter fast per- Fig. 16 Area result for 3D-FPGA. formance analysis was 1.7 times lower than that obtained from the full FPGA STA on average. This result showed that EasyRouter improved delay accuracy 5.1 times than VPR as the routing area. The tile area of the logic layer was the on average. This was because, although EasyRouter used sum of the logic and routing areas. The channel width of the a similar pessimistic model as VPR, all representative path routing layer was calculated by allocating approximately the delays were calculated with the high accuracy STA process. same routing area as the logic layer tile area. Therefore, the On the other hand, the routing delay and logic delay of VPR device structure was only determined by the channel width was calculated with different models. Therefore, we con- of the logic layer, whose minimum value was calculated by clude that the EasyRouter fast performance analysis method the EasyRouter CW exploring mode 4.1. is reliable for fast high accuracy device evaluation. By dividing routing resources into two layers, we achieved a smaller tile. A smaller tile means a higher logic 5.3 3D-FPGA Case Study density, shorter routing wire, and faster signal transporta- tion. Therefore, the routing performance could be improved. EasyRouter is designed to implement new FPGA architec- Moreover, the proposed 3D-FPGA was realistic, because the tures easily. In this section, we show the expandability of number of inter-layer connections within one tile was equal EasyRouter by evaluating a novel 3D-FPGA architecture to the number of input and output pins of the LB. Compared that was developed in previous work [17]. The new 3D- to conventional the 3D-FPGA based on the 3D-SB, which FPGA architecture script file was modified from conven- required four times the number of channel width inter-layer tional 2D-FPGA architecture script file by adding only few connections, the proposed architecture significantly reduced codes for vertical connections of 3D-VLSI technology. the requirement for inter-layer connections. We compared the area and critical path delay perfor- mance of the homogeneous 2D-FPGA [16] and the novel 5.3.2 Evaluation Conditions and Results 3D-FPGA. Because the proposed 3D-FPGA had only two layers, in this evaluation we employed the face-down The process technology and CAD tools used in this evalu- 3D stacking technique to avoid using through-silicon vias ation were the same as Sect. 5.2. We simply define the de- (TSVs). lay of one vertical connection between logic layer and rout- Figure 15 (a) and (b) shows the tile image and the de- ing layer as the same delay of one segment wire. We suc- tail of the proposed 3D routing architectures. The two lay- cessfully implemented the 3D-FPGA architecture on Easy- ers in the proposed 3D-FPGA were the logic and routing Router in a relatively short development time. The FPGA layers. The tiles on the logic layer had a LB and a small scale exploration was performed with the flow that we intro- part of the routing resources, while the tiles on the routing duced in Sect. 4.1. The performance analysis was performed layer had only routing resources. The tiles for the two lay- using the method that we described in Sect. 4.3. ers were designed within approximately the same area. Dif- Figure 16 shows the evaluation results for the area. We ferent from conventional 3D routing architectures with 3D- can see that the proposed 3D-FPGA used half the package SBs, we made the 3D connections on the input and output area of 2D-FPGA by allocating nets on two layers. This pins of the LB. The router chose one net to be routed on means the logic density had improved by about a factor of either the logic layer or the routing layer. two. The critical path delay also improved about 4% on average, as shown in Fig. 17. This is because the increased 5.3.1 Target 3D-FPGA Architecture channel width has better routability, and the smaller tile has shorter routing wire length. Next, we discuss the method to determine the channel width With this 3D-FPGA case study, we can say various ar- of two layers. First, the channel width of the logic layer was chitectures can be implemented on the EasyRouter frame- set to an initial value by the EasyRouter. Then the area of the work within a relatively short development time. High ac- CB and SB of the logic layer was calculated and determined curacy area and delay performance analysis can also be per- ZHAO et al.: FPGA DESIGN FRAMEWORK COMBINED WITH COMMERCIAL VLSI CAD 1611

[5] J. Luu, I. Kuon, P. Jamiseson, T. Campbell, A. Ye, M. Fang, and J. Rose, “VPR 5.0: FPGA CAD and architecture exploration tools with single-drive routing, heterogeneity and process scaling,” Proc. 2009 ACM/SIGDA International Symposium on Field Pro- grammable Gate Arrays, pp.133–142, Feb. 2009. [6] “Xilinx ISE design suite,” http://www.xilinx.com/products/ design-tools/ise-design-suite/ [7] “Altera quartus II software,” http://www.altera.com/products/ software/quartus-ii/about/qts-performance-productivity.html [8] P. Jamieson, K. Kent, F. Gharibian, and L. Shannon, “Odin II-An open-source verilog HDL synthesis tool for CAD research,” IEEE Annual International Symposium on Field programmable Custom Computing Machines, pp.149–156, May 2010. [9] Berkeley and Verification Group, “ABC: A system for sequential synthesis and verification,” http://www.eecs.berkeley. Fig. 17 Critical path delay result for 3D-FPGA. edu/˜alanmi/abc/, 2009. [10] D. Grant, C. Wang, and G.G.F. Lemieux, “A CAD framework for MALIBU: An FPGA with time-multiplexed coarse-grained ele- formed with the proposed framework. ments,” Proc. 2011 ACM/SIGDA International Symposium on Field Programmable Gate Arrays, pp.77–86, Feb. 2011. 6. Conclusions [11] C. Ababei, H. Mogal, and K. Bazargan, “Three-dimensional place and route for FPGAs,” IEEE Trans. Comput. Aided Des. Integr. Cir- cuits Syst., vol.25, no.6, pp.1132–1140, June 2006. In this paper, we proposed a novel FPGA routing tool, Easy- [12] N. Miyamoto, Y. Matsumoto, H. Koike, T. Matsumura, K. Osada, Router, and an FPGA IP design flow that combines conven- Y. Nakagawa, and T. Ohmi, “Development of a CAD tool for 3D- tional FPGA design tools with VLSI CADs. EasyRouter fa- FPGAs,” Proc. 2010 3D Systems Integration Conference, pp.1–6, cilitates easy modeling of new FPGA architectures without Nov. 2010. any limitations, which can significantly shorten the devel- [13] “The mono project: Cross platform, opensource .NET Development // opment cycle. EasyRouter can also automatically generate Framework,” http: www.mono-project.com, 2012. [14] V. Betz, J. Rose, and A. Marquardt, Architecture and CAD for Deep- device HDL codes and configuration bitstream files of the Submicron FPGAs, Kluwer Academic Publishers, 1999. implemented circuits that can be processed by VLSI CADs. [15] J.S. Swartz, V. Bets, and J. Rose, “A fast routability-driven router for With this design flow, accurate physical information as well FPGAs,” Proc. 1998 ACM/SIGDA Sixth Iternational Symposium on as STA can be reported on when a new FPGA IP architec- Field Programmable Gate Arrays, pp.140–149, Feb. 1998. ture is evaluated with reliable commercial VLSI CADs. For [16] K. Inoue, M. Koga, M. Iida, M. Amagasaki, Y. Ichida, M. Saji, J. Iida, and T. Sueyoshi, “An easily testable routing architecture and FPGA architectures that cannot be easily implemented with prototype chip,” IEICE Trans. Inf. & Syst., vol.E95-D, no.2, pp.303– present VLSI process, EasyRouter provides a fast perfor- 313, Feb. 2012. mance analysis flow, which improved delay accuracy 5.1 [17] Q. Zhao, Y. Iwai, M. Amagasaki, Y. Ichida, M. Saji, J. Iida, and times than VPR on average. We have also evaluated the T. Sueyoshi, “A novel reconfigurable logic device base on 3D stack proposed FPGA design flow with three different devices to technology,” Proc. 3D Systems Integration Conference, P-2-14, Feb. show its performance and expandability. 2012.

Acknowledgments

This work was supported by the VLSI Design and Educa- tion Center (VDEC) of the University of Tokyo in collab- oration with Synopsys, Inc., Cadence Design System, Inc., Qian Zhao received the B.E. degree in and , Inc. Control Engineering and Science from Qingdao University of Science and Technology, China, References in 2007. Further, he received the M.E. degree in Computer Science and Electrical Engineering [1] “Zynq All Programmable SoC Architecture,” 2012. from Kumamoto University in 2011. He is now http://www.xilinx.com/products/silicon-devices/soc/index.htm. a doctoral student at Kumamoto University. His [2] “SoC FPGAs: Integration to Reduce Power, Cost, and Board Size,” research interest is reconfigurable device archi- 2012. http://www.altera.com/devices/processor/soc-fpga/ tecture. proc-soc-fpga.html [3] “eFPGA Core IP: The embedded Field Programmable Gate Array IP,” 2012. http://www.menta.fr/down/ ProductBrief eFPGA Core.pdf [4] J. Rose, J. Luu, C.W. Yu, O. Densmore, J. Goeders, A. Somerville, K.B. Kent, P. Jamieson, and J. Anderson, “The VTR Project: Archi- tecture and CAD for FPGAs from verilog to routing,” Proc. 2012 ACM/SIGDA International Symposium on Field Programmable Gate Arrays, pp.77–86, Feb. 2012. IEICE TRANS. INF. & SYST., VOL.E96–D, NO.8 AUGUST 2013 1612

Kazuki Inoue received the B.E. degree in Toshinori Sueyoshi received the B.E. and Computer Science from Kumamoto University M.E. degrees in Computer Science and Com- in 2008. Further, he received the M.E. degree munication Engineering from Kyushu Univer- in Computer Science and Electrical Engineering sity in 1976 and 1978 respectively. From 1978 from Kumamoto University in 2010. He is now to 1987, he was a research associate at Kyu- a doctoral student at Kumamoto University. His shu University, where he received D.E. degree research interest is reconfigurable device archi- in 1986. From 1987 to 1989, he was an asso- tecture. He is a student member of IEEE. ciate professor of Information System at Kyu- shu University. From 1989 to 1997, he was an associate professor of Artificial Intelligence at Kyushu Institute of Technology. Since 1997 he has been a professor of Computer Science at Kumamoto University. He is Motoki Amagasaki received the B.E. also a guest professor at Kyushu University. His primary research interests and M.E., degrees in Control Engineering and include computer architecture, reconfigurable systems, parallel processing. Science from Kyushu Institute of Technology, He served as Chair of the Technical Committee on Reconfigurable Systems Japan in 2000, 2002, respectively. He was a de- of the IEICE, Chair of the Technical Committee on Computer Systems of sign engineer at NEC Micro Systems Co., Ltd. the IEICE, Chair of the IEEE Computer Society Fukuoka Chapter, and Di- from 2002–2005. He received the D.E. degree rector of the IPSJ Kyushu Chapter. Currently he also serves as Chair of from Kumamoto University, Japan, in 2007. He the IEEE CAS Society Fukuoka Chapter and Chief Director of the FPGA has been an assistant professor in the Depart- Consortium Japan. ment of Computer Science at Kumamoto Uni- versity since October 2007. His research inter- ests reconfigurable system and VLSI design. He is a member of IPSJ and IEEE Computer Society.

Masahiro Iida received his B.E. degree in Electronic Engineering from Tokyo Denki Uni- versity in 1988. He was a research engineer at Mitsubishi Electric Engineering Co., Ltd. from 1988 to 2003. He received his M.E. degree in Computer Science from Kyushu Institute of Technology in 1997. Further, he received his D.E. degree from Kumamoto University, Japan, in 2002. He has been an associate professor in the Department of Computer Science at Kuma- moto University since February 2003. During 2002-2005, he held an additional post as a researcher at PRESTO, Japan Science and Technology Corporation (JST). His current research inter- ests include high-performance low-power computer architectures, recon- figurable computing systems, VLSI devices and design methodology. He is amember of the IPSJ.

Morihiro Kuga received his B.E. degree in electronics from Fukuoka University in 1987 and M.E. and D.Eng. degrees in information systems from Kyushu University in 1989 and 1992. From 1992 to 1998, he was a lecturer at the center for Microelectronic Systems, Kyushu Institute of Technology. He has been an asso- ciate professor of comuter science at Kumamoto University since 1998. His research interests include parallel processing, computer architec- ture, reconfigurable system, and VLSI system design. He is a member of IPSJ.