Mind The (Synthesis) Gap: Examining Where Academic FPGA Tools Lag Behind Industry

Eddie Hung Department of Computing Imperial College London, England [email protected]

Abstract—Firstly, we present VTR-to-Bitstream v2.0, the latest and Torc [4], which not only provide abstraction layers for version of our open-source toolchain that takes input and manipulating netlists but also contain limited, non produces a packed, placed — and now routed — solution that can timing-driven CAD functionality (e.g. a basic router with be programmed onto the Xilinx commercial FPGA architecture. Secondly, we apply this updated tool to measure the gap between no conflict resolution). GoAhead [5] also exists to provide academic and industrial FPGA tools by examining the quality of additional support for dynamic partial reconfiguration tasks results at each of the three main compilation stages: synthesis, that are not available through the vendor toolchain. packing & placement, routing. Our findings indicate that the Bridging both of our contributions is Titan [6], a similar delay gap (according to Xilinx static timing analysis) for academic flow but for devices. By leveraging the vendor’s (closed- tools breaks down into a 31% degradation with synthesis, 10% with packing & placement, and 15% with routing. This leads source) Quartus II tool for front-end synthesis but continuing us to believe that opportunities for improvement exist not only to use VPR for back-end compilation, their hybrid flow enables within VPR, but also in the front-end tools that lie upstream. experimentation with large, complex, and mixed-language netlists that are not supported by the open-source Odin II [7]. I.INTRODUCTION Their flow is also used to make comparisons between academic FPGA research into architecture design, and on the computer- and commercial tools, finding that the VPR back-end alone aided design (CAD) algorithms to map circuit descriptions produces circuits that are 53% slower than Quartus II. Crucially, onto these architectures, has typically been evaluated using since Titan continues to route for an approximate model of the VPR academic tool that is now packaged with the Verilog- the commercial architecture, this also means their result is To-Routing (VTR) toolflow [1]. The open-source VTR flow not backed by an industrial static timing analysis tool, nor consists of three main tools: Odin II for synthesising circuits realisable on a physical FPGA device. designed in Verilog into generic lookup table (LUT) resources, The original version 1.0 of VTR-to-Bitstream [2] presented ABC for technology-mapping those resources into architecture- a Verilog to packed-and-placed netlist flow for Xilinx devices, specific LUTs, and VPR for packing, placement, and routing. used vendor tools to route the circuit, and found that the total For architecture design, a key strength of VPR is that it allows critical-path delay gap across the entire flow was 110%. As an researchers to easily experiment with new theoretical FPGA initial step towards closing this gap, in our second contribution architectures — for example, in deciding the optimal number we seek to identify how this total gap breaks down across the of inputs for a lookup table, or the best number of LUTs to individual CAD stages. group into a logic cluster. For CAD research, however, any new innovations can only be evaluated on those same theoretical II.XILINX ARCHITECTURAL MODEL architectures as opposed to commercially available devices due The Xilinx architectural model used in this paper is an to the closed-source nature of vendor tools. extension of that in prior work [2], with a number of changes First contribution: is to present VTR-to-Bitstream (VTB) to support hardened adder functionality and carry-chains. The v2.0, an open-source extension to the VTR toolchain that augmented cluster model is shown in Fig. 1. provides a path not just from Verilog to a packed, placed, and In contrast to the default VTR architecture which possess a routed netlist on a realistic FPGA architecture, but also onto a dedicated full-adder, devices from the Xilinx Virtex-6 family valid bitstream that can be programmed onto a physical Xilinx have only a 2:1 carry multiplexer (referred to as MUXCY in device. In contrast to the original version of VTB that only Xilinx literature) and a 2-input XOR sum gate (XORCY) in supported packing and placement [2], timing-driven routing is hard-logic, requiring the remaining functionality for a full adder now supported, along with carry-chains and the integration of (another 2-input XOR gate for deciding whether the carry-in a new Verilog synthesis tool. This extension is provided as a should be propagated, and an AND gate for generating a new patch available from http://eddiehung.github.io. carry) to be implemented in soft-logic [8]. Second contribution: is to use the extended toolflow to The cluster (CLB) model adopted in this work is made up make a fair and robust comparison between the quality of of two logic slices, shown in Figure 1. Within the Virtex-6 results (QoR) gained by academic and industrial offerings. architecture, a logic slice holds four basic logic elements (BLEs) By applying a varying combination of both flows to compile each of which contains a 6-input LUT that can be fractured bitstreams for the same FPGA device, we are able to infer into two 5-input LUTs (with all inputs shared), followed by the gap between the individual tools responsible for synthesis, two flip-flops. A bypass input (AX) can be used to reach either packing & placement, and routing. flip-flop, or to feed the XADDER carry-in directly without Related work: Relating to our first contribution of providing passing through the LUT. Three outputs exist for each BLE: open-source CAD for commercial FPGAs is RapidSmith [3] a combinational output (A) from the primary LUT output O6, COUT L=16 (bi-dir) N AQ L=4 {O6,O5,AX} FF A6:A1 Fracturable 6LUT x4 W E A6 XADDER MUXCY A O6 O6 5LUT XORCY S {O6,O5, AMUX 5LUT CIN3 COUT} O5 L=2 AX FF L=4 BLE x4 x4 CIN3 L=4 x4 BLE L=1 L=2 BLEBLE x3 L=2 SLICE x2 x7 x4 x4 x2 CIN x6 L=1 Fig. 1: Virtex-6 logic cluster model adopted in this work. x4 x4 x4 L=2 (XADDER represents our new hardened adder black-box). x4 FPGA Tile L=4 x4 L=16 (bi-dir) TABLE I: VTR-to-Bitstream’s Virtex-6 architectural support. x2 x5 x5 Supported: L=2 x4 x7 • Basic logic elements with 6-input LUTs, fracturable into two 5-input x4 x4 LUTs followed by two flip-flops and a carry-chain x4 L=4 • Block RAMs fracturable between a 36K or 18K memory, simple or true dual-port, and full aspect ratio support: from 1b×32K – 72b×512 Fig. 2: Inferred routing connectivity of a Virtex-6 tile. L is • DSPs only as a 25x18 combinational multiplier the manhattan distance that the wire traverses. Diagonal wires • Multiple global clock buffers (BUFG) for low skew clock distribution Not yet supported: with L = 2 represents a span from (x,y) to (x±1,y±1). • Logic slice clock enable, set/reset, nor wide multiplexers employing the MUXF7/F8 hardened resources Architecture .xml Verilog-to-Routing (VTR v7) • .rrg.gz Block RAMs as hardened FIFOs, nor distributed (LUT) memory Description VPR available in SLICEM resources • All other DSP48E1 functionality (such as 48-bit accumulation) .shim .blif .net .place

xdlrc2vpr VTB v1 [2] (one-off)

RapidSmith e a sequential output (AQ) and a multi-purpose output (AMUX) Torc t u

.xdl o shared between the secondary flip-flop, both LUT outputs O6 r route2xdl . & O5, and the carry out or sum from the adder. Each slice .inode2tw also contains one carry input and one output (CIN and COUT) VTR-to-Bitstream Torc v2.0 connected as a chain through all four BLEs. .xdl To commercial Table I details the main Virtex-6 features that our flow xdl2ncd DRC supports through leveraging VTR. Given that we are constrained flow by that which is supported by VTR, in future should any Fig. 3: Components of the VTR-to-Bitstream v2 flow. unsupported features be inferred into black-boxes, it would be straightforward to add support for those too. Although we target a Virtex-6 device in this work, with the correct specifications source code shows that the only requirement is a directed graph our approach can be extended to any other device from any where each node represents a resource containing its starting vendor — currently, we are constrained to those architectures and ending coordinates, its type (e.g. a source, output pin, supported by the XDL functionality inside the ISE toolflow; horizontal wire, etc.) as well as its timing cost. unfortunately, Xilinx has since deprecated XDL in their latest We generate this new routing graph by stitching the graph Vivado toolflow (targeting the UltraScale family and beyond). generated by VPR’s model to one extracted from the Xilinx device database (inclusive of LUT route-throughs) acquired III.EXTENDING VPRTO ROUTE FOR XILINX DEVICES from Torc. Our tool, xdlrc2vpr performs this task only The challenge with extending VPR to route for an exact once per architecture, where specifically it stitches each Xilinx Xilinx architecture is that VPR’s model can only generate output pin to be the only fanout from its corresponding VPR relatively simple routing networks: supporting only horizontal output pin. The equivalent task is also performed on all input and vertical (but not diagonal) wires, regular and symmetric pins, and all other resources are transformed into the VPR across both channels, throughout the device. This limitation coordinate space and assigned the appropriate timing cost. has been an obstacle in prior comparisons with industrial By stitching, we preserve the interface of the routing graph tools [6]. In the context of the commercial architecture that through which VPR interacts, whilst modifying its internals we are targeting, we find that the actual Xilinx Virtex-6 to represent the true graph that exists on silicon. Later, when routing network contains a number of features — in particular VPR has finished routing the circuit, we invoke our second diagonal and L-shaped bent wires — that VPR cannot currently tool route2xdl to translate the .route output from VPR model. Through querying Torc [4], we have inferred that the back into the programmable routing switches (PIPs) used by connectivity of each tile appears to be that shown in Figure 2. the Xilinx netlist, described in the Xilinx Design Language We add routing support by preprocessing the Xilinx routing (XDL) format. graph and importing it directly into VPR. This approach is Figure 3 illustrates how these two tools integrate into the made possible because VPR’s router is not tied to operating entire VTR-to-Bitstream v2.0 flow. This flow is an extension only with routing graphs generated internally: analysis of the of prior work: VTR-to-Bitstream v1.0 [2] (which itself uses Vanilla ISE Vanilla VTR Verilog HDL Architecture Xilinx ISE Yosys – Synthesis (new) Description VTR xst – Synthesis ABC – T.map .blif (e.g. cluster model, Odin II – Synthesis placement sites, .edif wire delays, etc.) .blif ngdbuild – Merge ngdbuild – Merge ABC – Tech. map .blif map – Pack & Place map – Pack & Place VPR – Pack & Place VPR – Pack & Place

VTR-to-Bitstream v2 VTB v1

xdl2ncd DRC xdl2ncd .ncd par – Route par – Route par – Route VPR – Route (new) VPR – Route (new) par .ncd VTR-to-Bitstream v2 VTR-to-Bitstream v2

xdl2ncd DRC xdl2ncd DRC Flow 2 Flow 3 Flow 5 Flow 1 .ncd Flow 4 Prior work [2] bitgen trce – STA Flows proposed by this paper .bit .twr Fig. 4: Industrial comparison experiments (where flows 2–5 contain new synthesis/routing support over prior work).

RapidSmith [3] to interact with the XDL format) that is tile grid. In all experiments using the Xilinx tools, we set responsible for converting the .blif netlist, the .net packing an aggressive 1ns clock constraint to encourage maximum result, and the .place result into an equivalent XDL netlist. CAD effort, matching behaviour in the academic flows. We This netlist would be similiar to the output from ISE’s map conduct experiments across multiple placement seeds using the tool. route2xdl then adds routing PIPs to this placed netlist, Imperial College HPC service [9], on servers with two generating an output similar to ISE’s par. Xeon X5650 processors and 48GB RAM. However, to ensure accurate runtime metrics, we ran each experiment on its own IV. METHODOLOGY exclusive server and measured using /usr/bin/time. Experiments were conducted using the five different flows For the academic flow, we use Verilog-to-Routing v7.0 [1], shown in Figure 4. Flow 1 describes vanilla ISE, in which along with Yosys v0.5 [10]. We evaluate across six benchmarks, a Verilog benchmark is compiled as normal entirely using consisting of the four largest benchmarks provided as part of industrial tools, primarily: xst for front-end synthesis, map VTR that fitted onto our FPGA (revisiting those reported in [2]) for packing and placement, and par for routing. Once a legal as well as des50, a smaller variant of the des90 benchmark circuit has been generated, static timing analysis and bitstream from Titan [6] that does fit on our target device, and our own generation can occur. AES x3 benchmark composed of a chain of three 128-bit AES Flows 2 to 4 describe an alternative flow for generating encoders [11] followed by three decoders. Xilinx circuits, with various tool stages replaced by academic Yosys front-end: Currently, a weakness of the academic VTR equivalents. We apply the same ABC script used by VTR across flow lies with its Verilog elaboration capabilities — the task of flows 2–5:“resyn; resyn2; if;”. For flow 2 only, we converting circuits described in this language into logic gates, also enable the -u switch in map which prevents unused flip-flops, and other resources — that is currently performed by logic from being recursively removed to allow for a fairer Odin II [7], which does not fully support all language constructs comparison with VPR, which has no such ability. such as those found in the des50 and AES x3 benchmarks. In flows 3 through 5, we use our VTR-to-Bitstream tool In flows 2–5, we replace Odin II with Yosys [10], which to translate VTR’s various output files into a fully placed has gained traction as part of Qflow: an open-source digital and/or routed XDL netlist. During conversion of the XDL synthesis flow for ASIC design. Yosys has extensive support format into Xilinx’s proprietary NCD format (necessary for for the synthesisable features of Verilog-2005, and is capable timing analysis and bitstream generation), the netlist must of outputting identical netlists in BLIF (for use in academic pass Xilinx’s design rule check (DRC) — which checks that tools) as well as industry-standard EDIF. Internally, Yosys calls all resources are legally placed, full connectivity from all net upon the same ABC tool as VTR for technology-mapping. sources to all net and component sinks exists, and that there are Timing model: Although Xilinx provide the complete routing no routing violations — thus providing us with confidence that resource graph in their XDLRC, no timing costs are given. By we are creating valid netlists. For these same flows involving analysing static timing analysis reports (which lists only the VPR, we apply the “--allow_unrelated_clustering total net delay), we were able to infer an approximate figure off” option to prevent maximum density packing, which was for each group of resources (e.g. all A6 LUT inputs, or all previously found to lead to unrouteable circuits [2]. length-2 wires, etc.) through linear regression. We target a mid-range Virtex-6 device (xc6vlx240t) found on the ML605 evaluation kit, using Xilinx ISE v14.4. This FPGA V. INDUSTRIAL COMPARISON RESULTS device is built on 40nm technology, and contains approximately Figure 5 compares the area utilisation, total tool runtime, 150K LUTs arranged into 37K logic slices. This logic, along and critical path delay of all six benchmark circuits (and their with RAM and DSP resources, are spread over a 102x240 geometric mean) across flows 1–5, when operating from the .–. ie lwrta o goen35) hsis This 3.5X). (geomean 1 flow than slower times 1.6–7.8 (which 1 flow vendor’s the over area 82%) (geomean 19–540% hnmvn rmflw o2 hr nytesynthesis the only where 2, to 1 flows from moving when a eotdb iixssai iigaayi tool analysis timing static Xilinx’s by reported (as e h olwn eut loso hti otiue othe to contributes it that show also results following the yet oorcam mn h snhss gap’. (synthesis) & the credence packing ‘mind lends for claim: result 15%, our and This to respectively. 10% routing to exists and opposed (31%) placement as gap changed, biggest is tool The 1. flows flow between baseline gap the delay over incremental 2–4 the illustrates 5e Figure graph. routing Xilinx imported attributed the be of a perhaps nature can irregular is and this the industrial there though to router, the addition, 4) between (flow In runtime academic performance. in difference circuit significant in annotated) gap (as largest budget runtime total a the consumes synthesis of that proportion is reducing obvious Immediately runtime. route, stages. routing to CAD total seconds the have 200 of (70%) each to majority up that the requiring latter nets consuming the sinks, fanout where this 40,000 high 5, that around two revealed & gap investigation 4 contained significant Further flows benchmark routing. a and for 3, mcml, VPR & for uses typical 2 recommended, reflect that flows between not not find exists do We is over- that usage. 1ns runtimes that excessive industrial noting to cause worth tools can concern is and Xilinx secondary it the a However, often constraining tools. is academic runtime for since surprising, not as decisions (either packing by netlist different logical respectively) made same formats the BLIF with or provided EDIF are 4, & 3 flows fully extra the an that consume uses seen of that be number circuits can the produces it as 4 circuit, such flow a area, open-source by the occupied merit. at slices of figures closely logic three more all looking in (#1) (#4) In flow flow academic commercial the the outperforms Unsurprisingly, input. Verilog same aty iue5 rvdsave ftema ici delay circuit mean the of view a provides 5d Figure Lastly, main three the over distribution runtime the shows 5c Figure between be can 4 flow that find we metrics, runtime the For c utm distribution. Runtime (c) xst Flow 1 map sisfotedsnhssto) lhuhflw2 and 2, flow Although tool). synthesis front-end its as

(13%) Flow 5 i.5 ult-fRslssmay(ema cos1 ed,nraie oflw1 ail ISE). vanilla 1: flow to normalised seeds, 10 across (geomean summary Quality-of-Results 5: Fig. n P euti ifrn lc utilisations. slice different in result VPR and (7%) proportion Synthesis (6%) a rautilisation. Area (a) (4%)

ODIN II failed

d rtclpt ea fo iixSTA Xilinx (from delay Critical-path (d) Flow 1

Flow 5 trce ODIN II failed Geomean whilst ) elkonVRflwta a nipoe eio front-end, Verilog improved an has that flow VTR well-known h nigta h a slreta h rn-n wihalso (which front-end the at largest is gap the that finding The iixVre- PA hscnrbto spoie sa second as Our provided http://eddiehung.github.io. is at for contribution download bitstreams This for valid patch FPGA. creating Virtex-6 for Xilinx support a back-end as well as C of YssOe Ytei ut, http://www.clifford.at/yosys/. Suite,” SYnthesis Open “Yosys [11] Wolf, C. [10] Xilinx. and EP/I020357/1) & EP/I012036/1 (grants EPSRC UK too. levels, opportunities upper that the but at VPR, on exist efforts as also their such stages) focus tools three researchers back-end the should improving across only runtime not routing. that of indicates amount for least 15% synthesis the and the consumes placement, at & 31% packing (as is for gap tool) delay STA 10% the commercial stage, that a finding main by flow, three backed compilation the the across comparison of tools piecewise stages industrial rigorous, and a academic make between to was contribution [9] [7] [8] [6] [5] [4] [3] [2] [1] nti ok epeetda pnsuc xeso othe to extension open-source an presented we work, this In A ODIN II failed CKNOWLEDGEMENTS FPGA’11 TRETS 12, e 2012. Feb v1.2),” lea Avne ytei okok”http://www.altera.co.uk/ Cookbook,” Service,” Synthesis literature/manual/stx “Advanced Computing Altera, Performance “High highperformancecomputing. College, http://www.imperial.ac.uk/ict/services/teachingandresearchservices/ (UG364 Guide Imperial User Block Logic Configurable FPGA “Virtex-6 in Xilinx, Research,” CAD for tool Jamieson CAD,” P. Commercial and Academic Between Gap the Exploring and Murray E. K. in Beckhoff C. Steiner N. in FPGAs,” Lavin C. in Devices,” Xilinx on Hung E. FPGAs,” for Luu J. Flow 1 EEFCCM’12 IEEE o.8 o ,p.1:–01,Mr2015. Mar 10:1–10:18, pp. 2, no. 8, vol. , tal. et

tal. et Flow 5 e 01 p 41–44. pp. 2011, Feb , trce tal. et tal. et VR70 etGnrto rhtcueadCDSystem CAD and Architecture Generation Next 7.0: “VTR , EEFPL’11 IEEE tal. et tal. et C TRETS ACM Ecpn h cdmcSnbx elzn P Circuits VPR Realizing Sandbox: Academic the “Escaping , tal. et Rpdmt:D-tYusl A ol o Xilinx for Tools CAD Do-It-Yourself “RapidSmith: , Tr:Twrsa pnSuc olFo, in Flow,” Tool Open-Source an Towards “Torc: ,

tool). ODIN II failed Geomean Oi I-A pnsuc eio D Synthesis HDL Verilog Open-source An - II “Odin , GAed ata eofiuainFramework,” Reconfiguration Partial A “GoAhead: , p 02 p 37–44. pp. 2012, Apr , Tmn rvnTtn nbigLreBenchmarks Large Enabling Titan: Driven “Timing , I C VI. okokpf uy2011. July cookbook.pdf, b oa runtime. Total (b) EEFCCM’13 IEEE R e 01 p 349–355. pp. 2011, Sep , o.7 o ,p.6163,Jl 2014. Jul. 6:1–6:30, pp. 2, no. 7, vol. , EFERENCES : EEFCCM’10 IEEE ONCLUSION rtflfrtespotfo the from support the for Grateful

ODIN II failed p 03 p 45–52. pp. 2013, Apr , +31% e ea gap. Delay (e) 00 p 149–156. pp. 2010, , +10% ODIN II failed Geomean +15% ACM ACM