NAND-NOR: a Compact, Fast, and Delay Balanced FPGA Logic Element

NAND-NOR: A Compact, Fast, and Delay Balanced FPGA Logic Element Zhihong Huangy Xing Weiy Grace Zgheibz Wei Liy Yu Liny Zhenghong Jiangy Kaihui Tuy Paolo Iennez Haigang Yangy yChinese Academy of Sciences Institute of Electronics, Beijing, China {huangzhihong, liw, linyu, kelv, yanghg}@mail.ie.ac.cn zEcole Polytechnique Fédérale de Lausanne (EPFL) School of Computer and Communication Sciences, 1015 Lausanne, Switzerland {grace.zgheib, paolo.ienne}@epfl.ch ABSTRACT Random Access Memory) FPGAs and have an island-style The And-Inverter Cone has been introduced as an alterna- architecture. In both commercial FPGAs as well as aca- tive logic element to the look-up table in FPGAs, since it demic research, the FPGA architecture relies mainly on the improves their performance and resource utilization. How- Look-up Tables (LUTs) as the main logic elements in the ever, further analysis of the AIC design showed that it suf- clusters [2]. A K-input LUT can implement any combina- fers from the delay discrepancy problem. Furthermore, the tional logic function with K inputs. However, the flexibility existing AIC cluster design is not properly optimized and of the LUTs comes at the expense of circuit area and delay. has some unnecessary logic that impedes its performance. Undeniably, the improvement of both the area efficiency and Thus, we propose in this work a more efficient logic element the performance of reconfigurable logic architectures has al- called NAND-NOR and a delay-balanced dual-phased mul- ways been one of the main targets of academic and industrial tiplexers for the input crossbar. Our simulations show that research [1, 6]. the NAND-NOR brings substantial reduction in delay dis- Inspired by modern synthesis tools [3], a new logic ele- crepancy with a 14% to 46% delay improvement when com- ment was recently proposed as an alternative logic element pared to AICs. And, along with the other modifications, for FPGAs [12]. Having a conic structure and composed it reduces the total cluster area by about 27%, when com- of multi-level configurable AND and inverter gates, this el- pared to the reference AIC cluster. Testing the new archi- ement is called And-Inverter Cone (AIC). When compared tecture on a large set of benchmarks shows an improvement with LUTs, AICs have many advantages: (1) the AIC has a of the delay-area product by about 44% and 21% for the high number of inputs and outputs, which allows it to imple- MCNC and VTR benchmarks, respectively, when compared ment larger, multi-output functions; (2) the AIC structure is to LUT-based cluster. This improvement reaches 31% and similar to the regular expression of Boolean algebra, which 19%, respectively, when compared to the AIC-based archi- allows it to satisfy the logic synthesis requirements and abide tecture. by its optimizations for an improved performance; (3) the AIC's area and delay increase linearly and logarithmically with the number of inputs, as opposed to the exponential 1. INTRODUCTION and linear increase of the LUTs, respectively; (4) intermedi- Since their first introduction in 1984, Field-Programmable ate results can be directly reused through the intermediate Gate Arrays (FPGAs) have increasingly become the main outputs (known as side outputs) of the AIC, reducing the computing power in various fields due to their flexibility logic duplication and improving the overall circuit area. and programmability. Having a short time-to-market, re- Taking all these advantages into consideration, the AIC configurable capabilities and fast programmability makes is presented as a promising alternative to LUTs in FPGAs. the FPGA a reliable computing device in modern digital Zgheib et al. [15] presented several AIC cluster architectures, systems [4, 11]. compared them with a state-of-the-art LUT-based FPGA Most current commercial FPGAs are SRAM-based (Static architecture, and concluded that both the AIC and LUT- based architectures have their respective advantages and disadvantages, depending on the application. A new depth- Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed constrained technology-mapping tool was also proposed [7] for profit or commercial advantage and that copies bear this notice and the full cita- to optimize the circuits for AICs and improve their results, tion on the first page. Copyrights for components of this work owned by others than especially in terms of used area. However, a potential issue ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re- publish, to post on servers or to redistribute to lists, requires prior specific permission in the AIC design has been ignored in the existing research, and/or a fee. Request permissions from [email protected]. so far: the input-to-output delay of the AIC can vary a lot, FPGA ’17, February 22-24, 2017, Monterey, CA, USA depending on the cone configuration. This variation can c 2017 ACM. ISBN 978-1-4503-4354-1/17/02. $15.00 have a major impact and is of high importance, since, given DOI: http://dx.doi.org/10.1145/3020078.3021750 135 Enhanced AIC SRAM Cell Inputs i0 i1 i2 i3 i4 i5 i6 i7 Element (EAE) a b Level 1 B BBB Level 2 A A a b a b O0 O1 y y Level 3 A a O2 Basic AIC y Element (BAE) b Figure 1: A 3-level AIC (AIC3) with its two types of nodes: Enhanced AIC Element (EAE) and Basic AIC Element (BAE). Figure 2: The transistor-level implementation of a BAE. its multi-level conic structure, any delay variation that ex- Table 1: Delay of the AIC, with up to 6 levels, for both the ists in a single AIC node propagates from one level to the best and worst-case scenarios (ps). other and gets accumulated in the process before reaching Delay of IO Delay of IO the final output. The delay glitch caused by this discrepancy AIC Average path with path with ∆T σ in the cone affects the stability of the overall function and level delay may induce additional dynamic power consumption. no inversions all inversions In this paper, we aim at solving the AIC's delay discrep- AIC6 389.6 577.6 483.6 188 94 ancy problem and propose a more efficient logic element with AIC5 341.6 504.8 423.2 163.2 81.6 balanced delays, called NAND-NOR. We also present an im- AIC4 274.4 410.4 342.4 136 68 proved cluster architecture that reduces the area overhead AIC3 215.2 323.2 269.2 108 54 added with the introduction of AICs into the logic clusters. AIC2 148.32 228.8 188.56 80.48 40.24 The paper starts by presenting the original AIC structure AIC1 90.64 143.92 117.28 53.28 26.64 and the delay discrepancy problem, in Section 2. Then, Sec- tion 3 proposes the NAND-NOR logic element and analyzes it characteristics. We then introduce a DDM input crossbar and optimized NAND-NOR-based cluster architecture AIC cluster with a 40nm standard CMOS technology us- in Section 4. We compare the new architecture, with all its ing the Cadence Virtuoso platform in custom design flow. new features, against an LUT-based architecture [15], sim- The delay is measured using Spectre simulator, in typical ilar to Stratix-IV, and AIC-based architecture [15] in Sec- technology corner. To be able to compare our results with tion 5. Finally, we summarize our results and conclude in the latest work on AICs [15], we kept the same design en- Section 6. vironment and simulation parameters. Table 1 shows the simulated delay of up to a 6-level AIC, in the two extreme cases: (1) when no programmable inversion is used in any of 2. LIMITATIONS IN THE AIC DESIGN the levels and (2) when all the programmable inversions are The And-Inverter Cone is a multi-level binary tree of cells used in every level, between the input and output. It also where each cell is composed of an AND gate with pro- shows the difference in delay (∆T ) between the worst and grammable inversions [15], as shown in Figure 1. The AIC best-case scenarios, as well as the standard deviation. has two types of nodes: the Enhanced AIC Element (EAE) The results show that, as the number of levels increases, and the Basic AIC Element (BAE), where the EAE is basi- the delay difference between the two extreme cases increases. cally a BAE with programmable input inversions [13]. This is due to the accumulation effect mentioned earlier, Figure 2 shows the transistor implementation of the BAE which makes the delay discrepancy a critical problem for which can be configured as either an AND or NAND gate AICs. For the 6-level AIC, the delay difference can reach by selecting either the inverted or non-inverted output, re- 188ps, which means that the delay of the AIC increases by spectively. Despite their many advantages, AICs still have about 50% when all the inversions as added, as opposed limitations, some of which are already know [15]. However, to the case when none of the inversions is used. This delay in this section, we highlight the delay discrepancy problem difference is further aggravated in the case of cascaded multi- as well as the inefficiency in the AIC-based cluster design. level AICs. In combinational circuits, having different arrival times 2.1 Delay discrepancy problem of input signals that must transit simultaneously can re- Analysis of the AIC shows that the propagation delay of sult in signal competition and cause signal spikes known as a single node varies with the configuration of the AIC.

Load more