ZUMA: an Open FPGA Overlay Architecture
Total Page:16
File Type:pdf, Size:1020Kb
2012 IEEE 20th International Symposium on Field-Programmable Custom Computing Machines ZUMA: An Open FPGA Overlay Architecture Alexander Brant Guy G.F. Lemieux Dept. of ECE Dept. of ECE UBC UBC Vancouver, BC Vancouver, BC Email: [email protected] Email: [email protected] Abstract—This paper presents the ZUMA open FPGA the ZUMA architecture is less than one third of the generic overlay architecture. It is an open-source, cross-compatible design, and less than half of previous research attempts [4], embedded FPGA architecture that is intended to overlay on at as little as 40 host LUTs per ZUMA embedded LUT. top of an existing FPGA, in essence an ”FPGA-on-an-FPGA.” This approach has a number of benefits, including bitstream Throughout this paper, the term embedded LUT (eLUT) is compatibility between different vendors and parts, compati- used to differentiate between the LUTs of the host FPGA bility with open FPGA tool flows, and the ability to embed and the LUTs in the ZUMA architecture, and resource usage some programmable logic into systems on FPGAs without the per eLUT is used to compare densities. need for releasing or recompiling the master netlist. These The key contributions of this paper are: options can enhance design possibilities and improve designer productivity. Previous attempts to map an FPGA architecture • adoption of configurable LUTRAMs as the basis for into a commercial FPGA have had an area penalty of 100x implementing both programmable LUTs and routing at best [4]. Through careful architectural and implementation MUXs choices to exploit low-level elements of the host architecture, ZUMA reduces this penalty to as low as 40x. Using the VTR • design of a Clos-style Input Interconnect Block (IIB) (VPR6) academic tool flow, we have been able to compile the network for improved area efficiency of the internal entire MCNC benchmark suite to ZUMA. We invite authors crossbar of the cluster of other tool flows to target ZUMA. • design of a resource-efficient configuration controller Keywords -Field programmable gate arrays; Productivity; • architectural modeling to determine the most efficient Reconfigurable architectures; parameters for mapping to a given host architecture. I. INTRODUCTION A. Motivation FPGAs are used widely in research and industry, but The development and improvement of an FPGA-like FPGA devices themselves are predominantly designed by overlay architecture is motivated by a variety of applications a few large companies, and access to low-level details is and research needs, as well as the opportunity for improved often limited or simply unavailable. There have been calls designer productivity. ZUMA can help address these goals for a completely open and portable tool flow that allows by providing open access to all of the underlying details of true portability of designs across all FPGA devices offered an FPGA architecture, as an instrument for study and for by major vendors. Due to factors including the complex and implementation. proprietary nature of FPGA designs, vendors have been slow ZUMA can act as a compatibility layer, allowing interop- to respond. This paper presents the ZUMA embedded FPGA erability of designs and bitstreams between different vendors architecture. ZUMA is a free, open, and cross-compatible and parts, in an analogous manner to a virtual machine in a embedded FPGA (eFPGA) architecture that compiles onto computing environment. ZUMA can also be used to embed Xilinx and Altera host FPGAs. It is designed as an open small amounts of programmable logic into existing FPGA architecture for open CAD tools and user designs which based systems without relying upon the vendor’s underlying must be independent of the underlying device architecture. partial reconfiguration infrastructure. The embedded logic Through this approach, we hope to enable exploration of new will be customizable for specific tasks. More importantly, it programmable logic implementations in both commercial can be made portable across different FPGA vendors that and research applications. have different mechanisms for partial reconfiguration. This Initially a generic architecture was created in pure Verilog, also allows for sections of the design, such as glue logic, which was used as a baseline for the development of the to be reconfigured without going through an entire CAD ZUMA architecture tailored for implementation on modern iteration with vendor-specific tools. FPGAs. A CAD flow is in place for this architecture utilizing ZUMA is also a tool for FPGA research and education. the VTR (VPR 6) project [3]. The final resource usage of ZUMA’s design is intended as a prototyping vehicle for 978-0-7695-4699-5/12 $26.00 © 2012 IEEE 93 DOI 10.1109/FCCM.2012.25 LUT Mode Input Block k rd addr S-Block data in from k 2k data out configwr addr Config Bits k controller 2 we K- Decoder Two Stage LUT FF Crossbar Figure 2. LUTRAM used as LUT Network the unique resources available on a modern FPGA to mini- Logic Cluster mize the area overhead. The design has been implemented on both Xilinx and Altera FPGAs. Figure 1. ZUMA tile layout and logic cluster design (note: inputs are distributed on all sides in real design) A. LUTRAM Usage future programmable logic architectures. Although incom- The ZUMA architecture takes advantage of the repro- patible with some architectural features such as bidirectional grammability of LUTs in the host architectures to create routing, ZUMA is useful for testing features that go be- space efficient configurable logic and routing. In new gen- yond simple density or speed improvements and offer new erations of Altera and Xilinx FPGAs, logic LUTs can be functionality. Finally, the open nature of the design allows configured as distributed RAM blocks, called LUTRAMs. for an open tool flow from HDL-to-bitstream, amenable to Both vendors allow fully combinational read paths for these current FPGA CAD research. Though a compilation flow RAMs, permitting them to be used as normal k-input LUTs is in place with VTR, a new approach to ZUMA tools, eg. (figure 2). These are useful for directly implementing based on JHDL, Lava, Torc, OpenRC, or RapidSmith can the programmable eLUTs of the ZUMA architecture, but also be created. For example, one key application would be they can also be used to improve the implementation effi- ultra-lightweight place-and-route tools that can run on a soft ciency of the routing network. By limiting the LUTRAM processor implemented in the same FPGA as ZUMA (not configurations to a simple pass-through of one input, a implemented in ZUMA, although that would be possible as k-input LUTRAM becomes equivalent to a k:1 routing well). multiplexer. These MUXs will be the backbone of the global and local interconnection networks of the ZUMA II. GENERIC BASELINE ARCHITECTURE FPGA. These LUTRAM MUXs consume fewer resources than MUXs implemented with generic Verilog constructs, In designing the ZUMA FPGA architecture, we first as they require fewer host LUTs and configuration flip- created a purely generic ’vanilla’ LUT-based FPGA archi- flops due to the lack of configuration bits. As well, to save tecture supported by VPR. The initial generic architecture power by preventing unneeded switching when a routing design was created as a parameterized Verilog description. MUX is inactive, each MUX in the generic version needs to The design is similar to the standard architectures used in be able to be configured to output ground, which can also classical VPR experiments [5]. Since bidirectional routing increase resource usage, while the LUTRAMs can simply is not suitable to implement or emulate in modern FPGAs, be configured by setting all configuration bits to zero. The a unidirectional routing architecture is used instead. As difference in resource usage is illustrated in figure 3. unidirectional routing has only one driver per wire, the routing S block and output C block must be combined [2]. B. Clos Network Local Interconnect The generic logic cluster is comprised of a first stage depopulated input block for cluster inputs, followed by a The design of the internal connection block of the FPGA fully populated internal crossbar, which is connected to the cluster is driven by a need for area efficiency, flexibility, N K-LUTs. A total of I inputs are fed into the cluster by A. MUX B. Synthesized MUX C. Single LUTRAM MUX the input block, while all I inputs and N feedback signals a are available to any basic logic element (BLE) input pin. b 4 y LUT The BLEs are single K-LUTs, followed by a single flip-flop a 4 a b y b which can be bypassed using a MUX. This generic version c LUT c c y d 4 d contains configuration bits stored in shift registers. Resource d LUT usage is detailed in table I. 0 0101010101010101 000 000 III. ZUMA ARCHITECTURE equivalent LUTRAM select bits select bits config bits The implementation of the ZUMA FPGA takes the base- line architecture outlined in the previous section, and utilizes Figure 3. A: 4 to 1 MUX, B: 4-to-1 MUX synthesized from 4-LUTS, C: 4-to-1 MUX created one 4 input LUTRAM 94 11 11 1 1 Reduced Two Stage Network ZUMA I+N N*k eLUTs n m r r m n Inputs Outputs 11 11 1 11 11 1 1 K-LUT k k P N k n m r r m n 11 11 1 K-LUT k k P N k 11 11 1 1 n m r r m n 11 11 1 r n x m m r x r r m x n K-LUT crossbars crossbars crossbars k k P N k Figure 4. Clos network P k x k k P x N N k-input LUTRAMs LUTRAMs LUTs and CAD compatibility.