Highly Pipelined Asynchronous Fpgas

Highly Pipelined Asynchronous FPGAs John Teifel and Rajit Manohar Computer Systems Laboratory Cornell University Ithaca, NY 14853, U.S.A. teifel,rajit @csl.cornell.edu f g ABSTRACT While the first FPGA was introduced by Xilinx in 1985 We present the design of a high-performance, highly pipelined and the first asynchronous microprocessor was designed at asynchronous FPGA. We describe a very fine-grain pipelined Caltech in 1989 [16], very little work in the last two decades logic block and routing interconnect architecture, and show has successfully combined asynchronous and programmable how asynchronous logic can efficiently take advantage of this circuit technology. Previously proposed asynchronous FPGA large amount of pipelining. Our FPGA, which does not architectures [5, 9, 12, 18] based on clocked programma- use a clock to sequence computations, automatically \self- ble circuits were not efficient at prototyping pipelined asyn- pipelines" its logic without the designer needing to be ex- chronous logic and did not demonstrate significant advan- plicitly aware of all pipelining details. This property makes tages over clocked FPGAs. However, recent work by the our FPGA ideal for throughput-intensive applications and authors has developed programmable asynchronous circuits we require minimal place and route support to achieve good that are inherently pipelined, efficiently implement asyn- performance. Benchmark circuits taken from both the asynchronous logic, and are competitive with clocked circuits [21]. chronous and clocked design communities yield throughputs In this paper we design a high-performance FPGA archi- in the neighborhood of 300{400 MHz in a TSMC 0.25µm tecture that is optimized for asynchronous logic and features process and 500{700 MHz in a TSMC 0.18µm process. pipelined switch boxes, heterogeneous pipelined logic blocks, and pipelined early-out carry chains. The logic block con- tains a 4-input LUT and additional asynchronous-specific Categories and Subject Descriptors logic to efficiently implement asynchronous computations. B.7.1 [Integrated Circuit]: Types and Design Styles| The asynchronous FPGA interconnect is compatible with Gate Arrays, VLSI ; B.6.1 [Logic Design]: Design Styles| the VPR place and route tool [1], and we show that this Parallel Circuits FPGA architecture achieves high-throughput performance for automatically placed and routed asynchronous designs. General Terms The main benefits of our pipelined asynchronous FPGA include: Design Ease of pipelining: enables high-throughput logic Keywords • cores that are easily composable and reusable, where Asynchronous circuits, concurrency, correctness by construc- asynchronous handshakes between pipeline stages en- tion, pipelining, programmable logic. force correctness (not circuit delays or pipeline depths as in clocked circuits). 1. INTRODUCTION Event-driven energy consumption: automatic shut- We investigate the properties of pipelined FPGA architec- • down of unused circuits (perfect \clock gating") be- tures from the perspective of an asynchronous circuit imple- cause the parts of an asynchronous circuit that do not mentation. Asynchronous circuits, whose temporal behavior contribute to the computation being performed have is not synchronized to a global clock, allow logic computa- no switching activity. tions to proceed as concurrently as possible. Concurrent asynchronous logic, in the context of pipelined FPGAs, en- Robustness: automatically adaptive to delay varia- • ables a different architectural design point than what is pos- tions resulting from temperature fluctuations, supply sible in a more traditional clock-based design. voltage changes, and the imperfect physical manufac- turing of a chip, which are increasingly difficult to control in deep submicron technologies. Permission to make digital or hard copies of all or part of this work for Tool compatibility: able to use existing place and personal or classroom use is granted without fee provided that copies are • route CAD tools developed for clocked FPGAs. not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to This paper is organized as follows. Section 2 reviews asyn- republish, to post on servers or to redistribute to lists, requires prior specific chronous logic and pipelined circuits. In Sections 3, 4, and 5 permission and/or a fee. FPGA’04, February 22-24, 2004, Monterey, California, USA. we describe the architecture, logic block, and interconnect Copyright 2004 ACM 1-58113-829-6/04/0002 ...$5.00. of our pipelined asynchronous FPGA. Section 6 discusses L1 2 2 PipelinePipeline C R1 Ld Rd C Stage Stage WCHB L (sender) (receiver) e Re Pipeline (a) channel Stage C R0 Le Re dual-rail L data (a) 0 2 Cd PipelinePipeline Precharge Evaluation Circuit Stage Stage pc (sender) Ce (receiver) R1 R (b) enable 0 L1 pull-down data rails L0 computation C1 R PipelinePipeline pc V 2 2 C L R 0 d d Stage Stage LV PCHB (sender) Ce (receiver) Handshake Control Circuit Pipeline Re (c) enable pc C Stage RV Le C Le Re (b) LV Figure 1: Dual-rail four-phase asynchronous pipelines: (a) channel abstraction, (b) handshake Figure 2: Asynchronous pipeline stages: (a) weak- abstraction, and (c) signal level. condition half-buffer (WCHB) and (b) precharge half-buffer (PCHB). asynchronous logic synthesis, Section 7 presents experimen- tal results, and Section 8 reviews related work. We discuss future directions for asynchronous FPGA design in Section 9 stage is the time required to complete one four-phase hand- and conclude in Section 10. shake. The throughput, or the inverse of the cycle time, is the rate at which tokens travel through the pipeline. 2. ASYNCHRONOUS LOGIC 2.1 Pipelined Asynchronous Circuits The class of asynchronous circuits we consider in this pa- High-throughput, fine-grain pipelined circuits are critical per are quasi-delay-insensitive (QDI). QDI circuits are de- to efficiently implementing logic in an asynchronous FPGA signed to operate correctly under the assumption that gates architecture. Fine-grain pipelines contain only a small a- and wires have arbitrary finite delay, except for a small num- mount of logic (e.g., a 1-bit adder) and combine computa- ber of special wires known as isochronic forks [15] that can tion with data latching, removing the overhead of explicit safely be ignored for the circuits in this paper. Although we output registers. This pipeline style has been used in several size transistors to adjust circuit delays, this only affects the high-performance asynchronous designs, including a micro- performance of a circuit and not its correctness. processor [17]. We design asynchronous systems as a collection of concur- Figure 2 shows the two types of asynchronous pipelines rent hardware processes that communicate with each other that are used in our asynchronous FPGA.1 A weak-condition through message-passing channels. These messages consist half-buffer (WCHB) pipeline stage is the smaller of the two of atomic data items called tokens. Each process can send circuits and is useful for token buffering and token copying. and receive tokens to and from its environment through com- The half-buffer notation indicates that a handshake on the munication ports. Asynchronous pipelines are constructed receiving channel, L, cannot begin again until the handshake by connecting these ports to each other using channels, where on the transmitting channel, R, is finished [11]. A precharge each channel is allowed only one sender and one receiver. half-buffer (PCHB) pipeline stage has a precharge pull-down Since there is no clock in an asynchronous design, pro- stack that is optimized for performing fast token computa- cesses use handshake protocols to send and receive tokens tions. Since WCHB and PCHB pipeline stages have the on channels. Most of the channels in our FPGA use three same dual-rail channel interfaces, they can be composed to- wires, two data wires and one enable wire, to implement a gether and used in the same pipeline. four-phase handshake protocol (Figure 1). The data wires Asynchronous pipelines can be used in programmable logic encode bits using a dual-rail code, such that setting \wire- applications by adding a switched interconnect between its 0" transmits a \logic-0" and setting \wire-1" transmits a pipeline stages, which configures how their channels con- \logic-1". A dual-rail encoding is a specific example of a nect together. Figure 3 shows one such programmable in- 1ofN asynchronous signaling code that uses N wires to en- terconnect, where connection boxes and switch boxes are code N values, such that setting the nth wire encodes data used to connect channels between logic blocks. However, value n. The four-phase protocol operates as follows: the the throughput of an asynchronous pipeline is severely de- sender sets one of the data wires, the receiver latches the data and lowers the enable wire, the sender lowers all data 1The C-element is an asynchronous state-holding circuit wires, and finally the receiver raises the enable wire when it that goes high when all its inputs are high and goes low is ready to accept new data. The cycle time of a pipeline when all its inputs are low. graded if its channels are routed through a large number of Connection Nin Nout Box non-restoring switches. For example, a pipeline that con- SBSBSB Wout tains four switches between pipeline stages is 59% slower than its full-custom implementation [21]. Using SPICE we Logic LBLB measured only a 4% improvement on an asynchronous inter- Win Block connect when these non-restoring switches were interspersed SB SB SB with restoring switches. As a result, we instead designed a Switch pipelined interconnect architecture that uses pipelined switch LBLB boxes and guarantees at most two non-restoring switches be- Box tween pipeline stages. This ensures a high throughput asyn- SB SBSB chronous FPGA and minimizes the performance lost due to (a) (b) dual-rail channels using a switched interconnect.

Load more