<<

mrFPGA: A Novel FPGA Architecture with Memristor-Based Reconfiguration

Jason Cong Bingjun Xiao Department of Computer Science University of California, Los Angeles {cong, xiao}@cs.ucla.edu

Abstract— In this paper, we introduce a novel FPGA architecture dominant part in FPGA) gains some improvement, the gap between with memristor-based reconfiguration (mrFPGA). The proposed FPGAs and ASICs will be significantly reduced. architecture is based on the existing CMOS-compatible memristor fabrication process. The programmable interconnects of mrFPGA Conventional FPGA suffers a lot from its programmable intercon- use only memristors and metal so that the interconnects can nects partly due to extensive use of SRAM-based programming bits, be fabricated over logic blocks, resulting in significant reduction of overall area and interconnect delay but without using a 3D die- multiplexers (or pass ), and buffers in the interconnects. stacking process. Using memristors to build up the interconnects can One SRAM cell contains as many as six transistors, but can store also provide capacitance shielding from unused routing paths and only one-bit data. The low density of SRAM-based storage increases reduce interconnect delay further. Moreover we propose an improved the area overhead of FPGA programmability and consequently, leads architecture that allows adaptive buffer insertion in interconnects to longer routing paths and larger interconnect delay. In addition to achieve more speedup. Compared to the fixed buffer pattern in conventional FPGAs, the positions of inserted buffers in mrFPGA SRAM is a type of – this means that it contributes are optimized on demand. A complete CAD flow is provided for to excessive power consumption during stand-by. mrFPGA, with an advanced P&R tool named mrVPR that was developed for mrFPGA. The tool can deal with the novel routing , especially emerging non-volatile memory structure of mrFPGA, the memristor shielding effect, and the (NVM) technologies, lead to opportunities of circuit improvement. algorithm for optimal buffer insertion. We evaluate the area, perfor- Emerging NVMs typically have a smaller cell size than that of mance and power consumption of mrFPGA based on the 20 largest SRAMs. They also have the desirable property of non-volatility, MCNC benchmark circuits. Results show that mrFPGA achieves 5.18x area savings, 2.28x speedup and 1.63x power savings. Further which means that they can be turned off during stand-by to save improvement is expected with combination of 3D technologies and power. Among them, memristor, also often recognized as resistive mrFPGA. random-access memory (RRAM), is considered to be the most Keywords-FPGA; memristor; ASIC; reconfiguration; promising one. Since HP Labs presented the first experimental realization of memristor in 2008 [8], rapid progress on the fabrication of high-quality memristor has been achieved in the past few years. I. INTRODUCTION The current leading memristor technology for the time being has The performance/power efficiency of an application implemented in outperformed many other emerging NVMs in some important aspects an application specific (ASIC) can be as much as and performs similar to them in the other aspects. It has been six orders of magnitude higher than its counterpart coded in CPU demonstrated that memristor is scalable below 30nm [9], can be [1]. However the rapid increase of non-recurring engineering cost fabricated with device size as small as F2 (F is the feature size) with and design cycle of an ASIC in nanometer technologies makes it full compatibility of CMOS [10], and can be programmed within impractical to implement most applications in ASIC. This makes 5ns at 180nm technology node [11]. Due to the advantages of its the Field Programmable Gate Array (FPGA) increasingly more device property over SRAM and other emerging NVMs, numerous popular. FPGA is a desirable type of integrated circuit that can be research efforts have been made to adopt memristors for system- reconfigured to realize a large range of arbitrary functions according level integration [12–15]. However almost all of these studies use to customer demands. The programmability makes FPGA a quick and memristors as a new type of memory. reusable hardware implementation platform for specific applications. This paper presents a novel FPGA architecture with memristor- Compared to ASIC, FPGA has sacrificed its performance to some based reconfiguration (mrFPGA). With the novel use of memristors extent for programmability [2] but FPGA can still have orders of in mrFPGA interconnects rather than memory cells, the proposed magnitude improvement in performance/power efficiency compared architecture reduces the gap between FPGAs and ASICs in area, to CPU [1, 3]. The flexibility and performance of FPGA compared to delay and power. ASIC and CPU makes FPGA an important component in the realm of customizable heterogeneous computing platforms [3]. This paper is organized as follows. Section II introduces the It is well known that the programmable routing structure in FPGA conventional FPGA architecture and review recent research work is the principal source of the performance inferiority of FPGA on FPGAs with emerging technologies. Section III presents the when compared to ASIC [4–7]. It is reported that the programmable unique structure of memristor and its electrical characteristics and interconnects in FPGA can account for up to 90% of the total area [4], circuit model. Section IV describes the architecture of mrFPGA up to 80% of the total delay [5, 6] and up to 85% of the total power from high-level overview to detailed design. Section V provides a consumption [7]. The study in [2] measures the gap between FPGA complete CAD flow and evaluation method for mrFPGA. Section VI and ASIC. It shows that the area, delay and power consumption of presents detailed evaluation results of mrFPGA using the largest FPGA are ∼21 times, ∼4 times and ∼12 times as much as those twenty MCNC benchmarks. Finally we draw some conclusions in of ASIC respectively. We see that if FPGA routing structure (the Section VII.

978-1-4577-0994-4/11/$26.00 c 2011 IEEE 1 II. BACKGROUND AND RELATED WORK 3D-FPGA      Memory A. Conventional FPGA      Layer      CMOS Layer   Fig. 1 shows a conventional FPGA architecture. It is made up of            

      SB CB SB LB LB LB       CB CB LB LB LB              SB CB SB LB LB LB  

(a) Monolithic stacking [17]. (b) 3D integration with TSVs [18].

Fig. 1: Conventional FPGA architecture. an array of tiles, and each tile consists of one logic block (LB), two connection blocks (CB) and one switch block (SB). Each logic block contains a cluster of basic logic elements (BLEs), typically look- (c) Replace SRAMs with emerg- (d) Nanowire crossbar [19]. up tables (LUTs), to provide customizable logic functions. LBs are ing NVMs [14]. connected to the routing channels through CBs and the segmented Fig. 2: Recent work on FPGA using emerging technologies. routing channels are connected with each other through SBs. Fig. 1 also depicts two typical circuits designs of CBs and SBs based on multiplexers (MUXs) and buffers described in [16]. The selector pins However one potential problem is that the maximum density of of each multiplexer are connected to a group of 6- SRAM TSVs currently achievable might fail to satisfy the dense connections cells to have their connectivity determined. The circuits in Fig. 1 required between the routing structure die and logic structure die. have copies of up to the number of pins per side of LBs in a single In [19], a 3D CMOS/Nanomaterial hybrid FPGA was proposed. CB, and up to the number of tracks per channel in a single SB. Again, it separates the transistors of interconnects from those of logic We see that CBs and SBs make up of the interconnects of FPGAs blocks and redistributes them into different dies as shown in Fig. 2d. with much larger area and higher complexity compared to the direct The difference is that it uses nanowire crossbars and face-to-face interconnects of ASIC. 3D integration technology to provide connections between the dies. It provides a 2.6x performance gain compared to the conventional B. Recent Work on FPGAs using Emerging Technologies FPGA architecture with certain power overhead brought by the large With the recent development of emerging technologies, a number capacitance of crossbar array. In addition the crossbar structure may of novel FPGA architectures based on these technologies have been consist of a certain percentage of defective components [19]. To sum- proposed in the past few years. A 3D FPGA architecture was pro- marize all, these works try to stack interconnects over logic blocks for posed in [17]. It partitions the transistors in the routing structure and the sake of reduction of area and interconnect delay. However they logic structure of conventional FPGA into multiple active layers and all suffer from a common problem. They still require transistors in stacks the layers via monolithic stacking, a 3D integration technology interconnects. To achieve the stacked architectures, these works have the author proposes to use as shown in Fig. 2a. It provides 1.7x to reorganize the interconnects and logic blocks into separate dies performance gain due to the smaller tile area and shorter interconnect and introduce 3D die-stacking (or less mature monolithic stacking). distance between tiles. However 3D stacking without proper thermal Otherwise the high temperature during fabrication of the transistors solution, will result in the excessive increase of heat density and the in an upper layer would ruin the existing transistors and metal wires corresponding degradation of performance [19]. Also applications below in the same die. Die-stacking introduces undesirable problems of monolithic stacking are currently limited due to the reason that in terms of manufacturability, reliability and limit of the vertical high temperature required by the fabrication of transistors in an integration density. Some researchers replace the pass transistors upper layer is likely to destroy the metals wires and transistors in the routing structure of FPGA with emerging memory elements which have been fabricated in the layers below [18]. In [14] and integrated in metal layers with back-end-of-line (BEOL) compatible [18] two FPGA architectures are proposed using emerging NVMs fabrication [20, 21]. But the improvement is limited due to the small and through-silicon-via (TSV) based 3D integration. Similar to the portion of pass transistors in the routing structure. There are also monolithic stacking in [17], they also move all the transistors in the studies on new reconfigurable circuit structures, such as CMOL [22] routing structure of FPGA to the top die over the logic structure and CNTFET-based fine-grain cell matrices [23], with the aim of die with TSV connections between them as shown in Fig. 2b. In taking the place of FPGA. Lots of efforts are still needed to realize addition they replace SRAM in FPGA with emerging NVMs, such these revolutionary ideas in products. as phase-change RAM (PCRAM) [18] or memristor [14], to save the area usage of storing one programming bit and stand-by power as III. MEMRISTOR TECHNOLOGY shown in Fig. 2c. 1.1x and 2x overall performance gains before and In this work, we propose to use memristors and metal wires but no after 3D integration respectively are reported in these two works. transistors to build up the routing structure in mrFPGA. We first intro-

2 2011 IEEE/ACM International Symposium on Nanoscale Architectures duce the memristor technology in this section. CMOS-compatibility is key property to integrate memristor into the mrFPGA architecture and several works have demonstrated CMOS-compatible memristor fabrication processes [10, 24, 25]. For example, Fig. 3 is a schematic view and cross-sectional transmission electron microscopy (TEM) image of memristor fabricated by a CMOS-compatible process [24] for example. As shown in Fig. 3a memristor is a two-terminal

Fig. 4: mrFPGA architecture: The programmable routing structure is built up with metal wires and memristor which are fabricated on top of logic blocks in the same die according to existing memristor fabrication structures.

to the total area of the logic blocks only, which takes only 10% to 20% of the conventional FPGA area [4]. Fig. 5b is a detailed (a) Schematic view. (b) TEM image. design of the connection blocks and switch blocks in mrFPGA using Fig. 3: Fabrication structure of memristor integrated with CMOS in memristors and metal wires only. In this design we use five metal 5 9 the same die [24]. layers M to M without resource conflict. The circuits of logic blocks in mrFPGA will use the metal layers below them, i.e., M1 to M4, which are sufficient for practical FPGAs, e.g. Virtex-6 [26]. resistance-like device which can be fabricated between the top metal The memristor layer is designed to locate between the M9 and M8 layer and the other metal layers on top of the transistor layer in the layers. We see that the positions of the layers are in the same order as same die. As explained in [24, 25], since no high temperature takes the memristor fabrication structure reported in [24] so as to guarantee place during their memristor fabrication processes, the processes can the feasibility of mrFPGA fabrication. Though memristors are located be embedded after back-end-of-line process and will not ruin the close to the top layer in this design, electrical signals do not always transistors and metal wires which have been fabricated below the have to go through all metal layers to reach memristors, since switch memristors, as shown in Fig. 3b. We also see from Fig. 3b that blocks are in M5 ∼ M9. The signals need to reach M1 or M2 from the structure of memristor is quite simple, resulting in the easy the memristor layer only when they access the pins of logic blocks. achievement of small cell size. In spite of the simple structure of Moreover, as stated in Section I, the size of a memristor cell can memristor, it achieves the storage of one-bit data successfully. Mem- be close to or even smaller than the width of a metal [9, 10]. ristor can be programmed by applying specific programming voltage So memristors can easily fit into the design in Fig. 5b without extra or current at its two terminals to switch its resistance value at normal area overhead. The memristor array can be programmed efficiently operation between low state and high state. The resistance values at using V/2 biasing scheme and two-step write operation as proposed the two states are usually more than two orders of magnitude apart in [15]. [24], or even up to six orders of magnitude apart [25]. In addition memristor can keep the resistance value set before being powered off B. Interconnects with Adaptive Buffer Insertion without supply voltage. With this programmable property, memristor One potential problem of the basic mrFPGA routing structure is is used as a promising type of NVM with two different states: low that there are no buffers in the structure. It leads to a quadratic resistance state (LRS) and high resistance state (HRS). In our work increase of interconnect delay with the increase of the interconnect the important thing is that memristor acts like a programmable switch distance. This problem would undermine the benefit brought by the which can be reconfigured to determine whether its two terminals are significant reduction of the overall area and interconnect delay in connected or not. If the two terminals need to be connected to each the case of large FPGAs. To address this problem, we propose an other, we just program the memristor to LRS. Otherwise, we program improved FPGA architecture which allows adaptive buffer insertion the memristor to HRS. This special mechanism is quite different from in the interconnects as shown in Fig. 6. A limited number of buffers those of conventional memories and provides a unique opportunity are prefabricated in routing channels and can be connected to the to build up a new FPGA routing structure with the novel use of tracks in channels via memristors. Whether to insert the prefabricated memristors. buffers or not and which track uses a buffer depend on the placement IV. mrFPGA STRUCTURE result of the circuit to implement on mrFPGA. For a routing path long enough to exhibit the quadratic increase of RC delay, we will A. Basic Structure let some buffers be connected to the path at proper positions. To get As discussed in the previous section, there is opportunity to build the optimal solution of insertion choice for each buffer in the routing up an FPGA routing structure based on memristors only in addition network, we borrow an idea from the buffer placement algorithm in to metal wires. Also memristor-based routing structure of FPGA ASIC [27]. This algorithm can decide the best positions of buffers in will be placed over the transistor layer in the same die, just like the interconnects of an ASIC using dynamic programming to achieve the routing structure of ASICs. The basic schematic view of this the minimum interconnect delay. It constructs delay/capacitance pairs efficient FPGA architecture with memristor-based reconfiguration which correspond to different buffer options in a routing tree of an (mrFPGA) is shown in Fig. 4. With the organization of logic blocks ASIC via a depth-first search and produces the optimal option in and programmable interconnects in Fig. 4, the overall mrFPGA time complexity of O(B2) where B is the number of legal buffer architecture will become a highly compact array as shown in Fig. 5a. positions. For mrFPGA applications, we let the buffer at certain From the figure we see that the area of mrFPGA will be reduced position be inserted to the routing track if the algorithm decides to

2011 IEEE/ACM International Symposium on Nanoscale Architectures 3 (a) (b) Fig. 5: The proposed mrFPGA architecture. (a) Overview of mrFPGA where the FPGA area is contributed by logic blocks only, and (b) A detailed design of the connection blocks and switch blocks in mrFPGA using the memristors and metal wire resources in existing memristor fabrication structures.

Fig. 6: An improved mrFPGA architecture with adaptive buffer insertion in the programmable interconnects.

Fig. 7: Comparison between fixed buffer patten in conventional FPGA place a buffer at this position. Then whether or not to insert one and adaptive buffer insertion in mrFPGA. buffer at one routing node in mrFPGA is optimized on demand. In contrast in conventional FPGA the connections between buffers and routing tracks are predetermined during fabrication. A case study in in the same channel as shown in Fig. 7. When the output pin of Fig. 7 shows the benefit brought by the adaptive buffer insertion in a logic block happens to be connected to a wire segment that is mrFPGA when compared to the fixed buffer pattern in conventional buffered right after the connection node, just like the block “start” FPGA. To simplify the problem, we assume in this case that the RC in Fig. 7 connected to track3 in the upper channel, the routing result delay of a wire with the length of one block is 0.5RwireCwire =1ns, will deviate from the optimal. The problem can be even worse during and that the buffers in interconnects are considered as ideal buffers the switch from one track to another. A timing-driven design, such with infinite drive, no input/output capacitance, and a fixed intrinsic as [28], has to always buffer the between tracks to avoid the delay of 9ns. The optimal length of a wire between two adjacent potential large RC delay caused by unbuffered connection between buffers would be the two longest unbuffered track segments, in this case as large as r T 36ns if the two track1s in the channels are connected without a buffer k = buffer =3 for example. It drives the routing result farther from the optimal. The 0.5R C wire wire table in Fig. 7 lists all the possible delays from the logic block “start” In conventional FPGA, the pattern of pre-fabricated buffers in inter- to the logic block “end” according to the settings in a conventional connects will follow this result to distribute buffers evenly with a FPGA discussed above. The first column is the track ID in the upper distance of three blocks (as shown in Fig. 7). The problem is that channel, and the first row is the track ID in the right channel. The it does not know the starting point of each net (the offset can be buffers at the output pin of the “start” block and the input pin of 0, 1 or 2). So it just staggers the patterns among different tracks the “end” block are not counted in the delay calculation. As shown

4 2011 IEEE/ACM International Symposium on Nanoscale Architectures in Fig. 7 the worst case can be 22% slower than the best case in a conventional FPGA. On the contrary, the adaptive buffer insertion in mrFPGA will not suffer from the deviation from the optimal buffer placement in interconnects as does the conventional FPGA. We see from Fig. 7 that in mrFPGA buffers are inserted just exactly one per three blocks, resulting in a total delay of only 45ns. It is even better than the best case in a conventional FPGA.

V. CAD FLOW AND EVALUATION OF mrFPGA A. CAD Flow Fig. 9: Extracted equivalent circuit model of the routing structure of To implement a circuit design in FPGA, a complete CAD flow is mrFPGA and comparison with conventional FPGA. needed. The input of the flow is the design to be implemented and the output of the flow includes the placement and routing (P&R) result used for programming the design into FPGA as well as an estimation it is to think about the wire segment with the length of one tile. of area, delay and power consumption for the implementation of the In mrFPGA the wire segment with the length of one tile is just the design in FPGA. We propose to use a timing-driven CAD flow for our short segment between two logic blocks, of which the length is close mrFPGA as shown in Fig. 8. A circuit design first goes through the to zero compared to the tile side. However the length of one tile does not disappear. It is counted in the switch block according to the mrFPGA architecture shown in Fig. 5b.

C. Shielding Effect of Memristor During the development of the circuit models for the routing structure of mrFPGA, we find out that memristor has a shielding effect to help reduce the interconnect delay of mrFPGA further. The shielding effect of memristors at a high resistance state (HRS) can remove the downstream capacitance of unused paths from the total capacitance at one node and then reduce the RC delay to reach this node. Fig. 10 shows an example of the effect in a switch block when compared to a conventional FPGA. In the conventional FPGA, in spite of the

Fig. 8: CAD flow for mrFPGA. design tool ABC [29] to perform logic optimization and technology mapping. A mapped netlist consisting of K-LUTs will be obtained. Then the mapped netlist is fed into the design tool T-VPACK [30] to pack LUTs to logic blocks and then into the design tool mrVPR developed by us to accomplish P&R. At last we use the generated basic-cell (BC) netlist with delay and capacitance information to run the power estimation tool fpgaEVA LP2 [7]. Since we try to reuse most parts of the existing CAD flow for conventional FPGAs, the development of mrFPGA CAD flow is quite easy. Only the P&R tool is specifically developed for mrFPGA. We will introduce the (a) Conventional FPGA. (b) mrFPGA. features of this advanced tool in the last part of this section. Fig. 10: Memristors provide capacitance shielding from unused paths. B. Equivalent Circuit Model for Interconnect To perform the timing and power analysis for mrFPGA, we first fact that the lower two paths are not used as shown in Fig. 10a, the develop an equivalent circuit model for the routing structure of downstream capacitances of these two paths are still added to the total mrFPGA. Fig. 9 shows the equivalent circuit of a representative capacitance at node A. In contrast, in mrFPGA only the downstream routing path from the output pin of a logic block to the input pin capacitance of a path in use will be added to the total capacitance at of another logic block in mrFPGA and makes a comparison with node A. The huge resistance value of memristors at HRS provides that of a conventional FPGA. In the circuit model, the MUXs in capacitance shielding from unused paths. We use HSPICE simulation connection blocks and switch blocks are replaced by the memristors to verify whether the resistance of memristor at HRS is high enough at low resistance state (LRS). The wire segments are still modeled to be qualified as an ideal open switch for the shielding effect. In the as distributed RC lines. Notice that the overall resistance value mrFPGA circuit shown in Fig. 10b, we set R as 1×103Ω, and C and −12 and capacitance value of wire segments have changed due to the Cload as 1×10 F. For the actual memristor resistance value, we set 3 3 reduction of the tile area in mrFPGA. Also, according to the mrFPGA Rlrs as 1×10 Ω according to [24, 25] and sweep Rhrs from 1×10 Ω, 6 architecture in Fig. 5a, the length of a wire segment is one tile shorter the same value as Rlrs,to1 × 10 Ω. At this starting point of sweep, than its counterpart in a conventional FPGA. The easiest way to see the three paths are equivalent and the three downstream capacitances

2011 IEEE/ACM International Symposium on Nanoscale Architectures 5 will be all added to node A in the path from node “start” to node and provide connections for logic blocks nearby in a different “end” just like a conventional FPGA. We want to see how large Rhrs way from that of a conventional FPGA shown in Fig. 12a. needs to be to make the path delay from node “start” to node “end” 2) mrVPR can reflect the memristor shielding effect. In VPR all close enough to that in the case where Rhrs are replaced by ideal the device capacitances with prefabricated connections to a open switches. The result obtained by the HSPICE simulations is node will be added to the capacitance of that node regardless shown in Fig. 11. We see that with the increase of Rhrs, the delay of whether the paths which these devices belong to will be included in the routing result or not. In mrVPR, only when a routing path is used, the capacitances on that path will be 6 added to the corresponding routing nodes. shielded by open switch 5.5 3) mrVPR is integrated with the algorithm to generate the optimal shielded by memristor option for adaptive buffer insertion. We have implemented the 5 algorithm in mrVPR and it shows where buffers are needed for insertion in the final routing result as shown in Fig. 13. We 4.5 see that the positions for buffers to insert are marked with the

delay (ns) Actual Memristor Region black delta shape. We also see that the interconnect view of 4 the routing result in Fig. 13b is quite similar to that of ASIC. We expect to see the good performance of mrFPGA in the next 3.5 section.

3 3 4 5 6 10 10 10 10 R (Ω) hrs Fig. 11: The impact of the memristor shielding effect on the path delay.

approaches to the value with Rhrs replaced by ideal open switches. The resistance value of an actual memristor at HRS is more than two orders of magnitude higher than that at LRS as reported in [24, 25]. We mark the actual memristor region according to this criterion and see that the delay within this region is very close to the ideal situation. This proves that the shielding effect of memristor can help reduce path delay further in mrFPGA. (a) Routing resource view. (b) Routing result view Fig. 13: View of the optimal option for buffer insertion in mrVPR. D. mrVPR: an advanced P&R Tool for mrFPGA To deal with the novel routing structure of mrFPGA in the P&R step of CAD flow, we develop an advanced tool named mrVPR (VPR for VI. EXPERIMENTAL RESULTS memristor-based reconfiguration) on the base of the commonly-used A. Settings FPGA P&R tool VPR [30]. The main contributions of this tool are We choose Xilinx Virtex-6 FPGA platform as our conventional FPGA model baseline in experiments and conduct evaluation of mrFPGA at the same technology node. The architecture logic specification of Virtex-6 needed by the CAD flow in Fig. 8 are collected from Virtex-6 FPGA data sheets (Configurable Logic Block part) [26]. As for the architecture routing specification of Virtex-6, such as the channel width and the distribution of wire segment lengths (Single, Double, Quad, Long and Global), are extracted with the help of FPGA Editor in Xilinx ISE environment. The RC delay values of each type of wire segments are calculated from the unit resistance and capacitance values in the lookup tables of ITRS2010 [31]. The timing parameters of buffers inserted in mrFPGA routing structure, including driving resistance, input capacitance and intrinsic delay, are (a) VPR for conventional FPGA (b) Our mrVPR for mrFPGA obtained via HSPICE simulation using PTM device models [32, 33]. The memristor model is extracted from the measurement results of Fig. 12: Overview of routing resource in the two tools VPR and the memristors fabricated by the process in [24], the same process on mrVPR. which mrFPGA is based. The 20 largest MCNC benchmark circuits are used as the input of the CAD flow for conventional FPGA and as follows: mrFPGA respectively and comparisons are made on the output of the 1) mrVPR can deal with the routing graph of mrFPGA as shown CAD flow from the aspects of area, delay and power. in Fig. 12b. In Fig. 12b the gray blocks are I/O blocks and logic blocks of FPGA and the diagonal wires are the routing B. Evaluation Results paths available in switch blocks. We see that the switch blocks Fig. 14 shows the comparison of the tile area in three FPGA and connection blocks in mrVPR are placed over logic blocks architectures: the baseline Virtex-6 FPGA, its mrFPGA counterpart

6 2011 IEEE/ACM International Symposium on Nanoscale Architectures without buffer insertion (a fictitious case to show the impact of mrFPGA use memristors and metal wires only. The routing structure buffer insertion) and mrFPGA with buffer insertion respectively. As is then able to be placed over logic blocks in the same die according to the existing memristor fabrication structure. An improved architecture with adaptive buffer insertion is proposed to further reduce the (a) Virtex-6 baseline. Total area = 13993μm2. interconnect delay. A complete CAD flow is provided for mrFPGA with an advanced P&R tool named mrVPR developed for mrFPGA. The tool can deal with the novel routing structure of mrFPGA, the memristor shielding effect, and the algorithm of optimal buffer (b) mrFPGA unbuffered. (c) mrFPGA. Total area = 2547μm2. insertion. An evaluation of mrFPGA is done on the 20 largest MCNC Fig. 14: Area comparison of a single tile in three architectures (unit: benchmark circuits. Results show that mrFPGA achieves a 5.18x area μm2). (LB: logic block; CB: connection block; SB: switch block; savings, a 2.28x speedup and a 1.63x power savings. Buf: pre-fabricated buffers to insert on demand in mrFPGA). REFERENCES [1] P. Schaumont and I. Verbauwhede, “Domain-specific codesign for em- discussed in the previous sections, in the case of mrFPGA unbuffered, bedded security,” Computer, vol. 36, no. 4, pp. 68–74, Apr. 2003. the area contribution of the interconnects is reduced to zero since the [2] I. Kuon and J. Rose, “Measuring the Gap Between FPGAs and ASICs,” interconnects are purely made up of memristors and metal layers IEEE Transactions on Computer-Aided Design of Integrated Circuits located on top of logic blocks. For mrFPGA with buffer insertion, and Systems, vol. 26, no. 2, pp. 203–215, Feb. 2007. we have to count the area of pre-fabricated buffers in the channels [3] J. Cong, V. Sarkar, G. Reinman, and A. Bui, “Customizable Domain- Specific Computing,” IEEE Design and Test of Computers, vol. 28, no. 2, according to the architecture in Fig. 6. The area contributed by buffer pp. 6–15, Mar. 2011. insertion in mrFPGA is pessimistically calculated as the case of a [4] V. George, “Low Energy Field-Programmable Gate Array,” Ph.D. dis- uniform distribution of pre-fabricated buffers over channels, with the sertation, UC Berkeley, 2000. number of buffers per channel set as the maximum number of buffers [5] A. DeHon, “Reconfigurable Architectures for General-Purpose Comput- required by mrVPR to insert in one channel of mrFPGA over the 20 ing,” Ph.D. dissertation, MIT, Dec. 1996. [6] E. Ahmed and J. Rose, “The Effect of LUT and ClusterSize on Deep- benchmarks. We see in Fig. 14 that the area overhead brought by Submicron FPGA Performance and Density,” IEEE Transactions on pre-fabricated buffers is low compared to the interconnect area of VLSI Systems, vol. 12, no. 3, pp. 288–298, Mar. 2004. conventional FPGA. That is mainly due to the on-demand property [7] F. Li et al., “Power modeling and characteristics of field programmable of buffer solution in mrFPGA. Results in Fig. 14 shows more than gate arrays,” IEEE Transactions on Computer-Aided Design of Integrated 5.5x saving of total area for mrFPGA. Table I shows the performance Circuits and Systems, vol. 24, no. 11, pp. 1712–1724, Nov. 2005. comparison between Virtex-6 baseline and mrFPGA. It shows a small [8] D. B. Strukov, G. S. Snider, D. R. Stewart, and R. S. Williams, “The missing memristor found.” , vol. 453, no. 7191, pp. 80–3, May performance degradation before buffer insertion in mrFPGA and a 2008. 2.28x speedup after buffer insertion. The total speedup primarily [9] Y. S. Chen et al., “Highly scalable hafnium oxide memory with stems from the reduction of interconnect delay. In the routing improvements of resistive distribution and read disturb immunity,” in structure of mrFPGA, the reduction of tile area by ∼5.5x results in IEDM Technical Digest, Dec. 2009, pp. 1–4. 4 2 a shortening of wire segments in the programmable interconnects by [10] C.-h. Wang et al., “Three-Dimensional F ReRAM Cell with CMOS ∼ Logic Compatible Process,” in IEDM Technical Digest, 2010, pp. 664– 2.35 times. Then both the resistance and capacitance values of wire 667. segments decrease by ∼2.35 times, contributing to the total reduction [11] S.-s. Sheu et al., “A 5ns Fast Write Multi-Level Non-Volatile 1 K of RC delays of the wire segments by ∼5.5 times. The shielding bits RRAM Memory with Advance Write Scheme,” in VLSI Circuits, effect of memristor also helps to improve performance. The quadratic Symposium on, 2009, pp. 82–83. increase of the RC delay of a pure RC network cancels out the benefit [12] M. Mahvash and A. Parker, “A memristor SPICE model for designing memristor circuits,” in MWSCAS, no. 2, 2010, pp. 989–992. in mrFPGA without buffer insertion. We see in Table I that for some [13] S. Yu, J. Liang, Y. Wu, and H.-S. P. Wong, “Read/write schemes analysis benchmarks with long interconnect paths between logic blocks, the for novel complementary resistive switches in passive crossbar memory pure RC delay without buffers is kind of large. However in mrFPGA arrays.” Nanotechnology, vol. 21, no. 46, p. 465202, Oct. 2010. with buffer insertion the interconnects perform much better than [14] M. Liu and W. Wang, “rFPGA: CMOS-nano hybrid FPGA using RRAM Virtex-6 baseline and mrFPGA unbuffered. The good performance components,” in NanoArch, Jun. 2008, pp. 93–98. also stems from the adaptive buffer insertion in mrFPGA. Table II [15] C. Xu, X. Dong, N. P. Jouppi, and Y. Xie, “Design Implications of Memristor-Based RRAM Cross-Point Structures,” in DATE, 2011. shows the power consumption by Virtex-6 basline and mrFPGA [16] G. Lemieux and D. Lewis, “Circuit design of routing switches,” Inter- respectively. An average of 40% power savings is achieved. The national Symposium on FPGAs, p. 19, 2002. power savings primarily come from the replacement of transistors in [17] M. Lin, A. El Gamal, Y.-C. Lu, and S. Wong, “Performance Benefits of the programmable interconnects, such as MUXs and SRAMs, with Monolithically Stacked 3-D FPGA,” IEEE Transactions on Computer- the simple structure of memristors. Both dynamic power and static Aided Design of Integrated Circuits and Systems, vol. 26, no. 2, pp. 681–229, Feb. 2007. power are reduced due to less capacitance on the routing paths and [18] Y. Chen, J. Zhao, and Y. Xie, “3D-nonFAR: Three-Dimensional Non- the non-volatility of memristor. Notice that no die-stacking is needed Volatile FPGA ARchitecture Using Phase Change Memory,” in ISLPED, to achieve this degree of area reduction and speedup. If 3D integration 2010, p. 55. technology is introduced in the future to stack several mrFPGAs [19] C. Dong, D. Chen, S. Haruehanroengra, and W. Wang, “3-D nFPGA: together, at least two more times of improvements in density and A Reconfigurable Architecture for 3-D CMOS/Nanomaterial Hybrid Digital Circuits,” IEEE Transactions on Circuits and Systems I: Regular speedup, i.e. 10x and 4.5x respectively, are both expected, according Papers, vol. 54, no. 11, pp. 2489–2501, Nov. 2007. to the experimental results on 3D architecture in [17]. [20] C. Chen et al., “Efficient FPGAs using nanoelectromechanical relays,” VII. CONCLUSIONS International Symposium on FPGAs, p. 273, 2010. [21] P.-e. Gaillardon et al., “Emerging memory technologies for reconfig- This work presents a novel FPGA architecture with memristor-based urable routing in FPGA architecture,” in ICECS. IEEE, Dec. 2010, pp. reconfiguration named mrFPGA. The programmable interconnects of 62–65.

2011 IEEE/ACM International Symposium on Nanoscale Architectures 7 TABLE I: Critical Path Delay and Comparison (unit: ns) Virtex-6 Baseline mrFPGA unbuffered mrFPGA Interconnect Delay Total Delay Interconnect Delay Total Delay Interconnect Delay Total Delay alu4 6.62 9.54 5.53 (1.19x) 8.76 (1.08x) 1.52 (4.35x) 4.75 (2.00x) apex2 7.84 10.4 6.96 (1.12x) 10.2 (1.01x) 1.52 (5.15x) 5.12 (2.04x) apex4 6.79 9.71 7.76 (0.87x) 10.3 (0.94x) 1.35 (4.99x) 4.58 (2.11x) bigkey 10.9 13.3 37.7 (0.29x) 40.1 (0.33x) 3.98 (2.75x) 4.97 (2.68x) clma 12.0 14.6 22.7 (0.53x) 24.3 (0.60x) 2.80 (4.31x) 5.27 (2.78x) des 15.1 17.9 40.8 (0.37x) 43.6 (0.41x) 5.59 (2.71x) 8.02 (2.23x) diffeq 4.11 5.37 2.72 (1.50x) 5.09 (1.05x) 0.34 (11.7x) 3.27 (1.63x) dsip 9.61 10.9 27.2 (0.35x) 28.5 (0.38x) 2.52 (3.80x) 4.89 (2.22x) elliptic 5.93 8.36 10.9 (0.54x) 13.0 (0.64x) 1.72 (3.43x) 4.09 (2.04x) ex1010 14.2 16.8 22.1 (0.64x) 25.0 (0.67x) 3.63 (3.90x) 6.92 (2.42x) ex5p 7.23 10.1 5.42 (1.33x) 8.34 (1.21x) 0.76 (9.43x) 4.30 (2.35x) frisc 8.97 9.92 11.9 (0.75x) 13.2 (0.74x) 0.71 (12.5x) 4.87 (2.03x) misex3 6.20 9.06 5.73 (1.08x) 8.65 (1.04x) 1.04 (5.91x) 4.58 (1.97x) pdc 11.1 14.4 18.6 (0.59x) 22.3 (0.64x) 2.25 (4.93x) 6.22 (2.32x) s298 8.39 11.1 8.29 (1.01x) 10.1 (1.09x) 0.85 (9.86x) 4.27 (2.59x) s38417 8.39 11.1 8.29 (1.01x) 10.1 (1.09x) 0.85 (9.86x) 4.27 (2.59x) s38584.1 8.39 11.1 8.29 (1.01x) 10.1 (1.09x) 0.85 (9.86x) 4.27 (2.59x) seq 7.45 10.6 7.84 (0.95x) 10.3 (1.03x) 1.40 (5.30x) 4.63 (2.30x) spla 10.8 13.8 11.9 (0.91x) 15.1 (0.91x) 1.88 (5.77x) 5.48 (2.52x) tseng 4.90 7.27 7.62 (0.64x) 9.99 (0.72x) 0.94 (5.16x) 3.31 (2.19x) average - - - (0.83x) - (0.83x) - (6.29x) - (2.28x)

TABLE II: Power Consumption and Comparison (unit: mW) Virtex-6 Baseline mrFPGA unbuffered mrFPGA Interconnect Power Total Power Interconnect Power Total Power Interconnect Power Total Power alu4 34.4 52.0 6.57 (5.24x) 24.1 (2.15x) 10.8 (3.18x) 28.4 (1.83x) apex2 40.1 58.5 7.36 (5.44x) 25.8 (2.26x) 11.4 (3.50x) 29.9 (1.95x) apex4 26.5 42.8 4.37 (6.07x) 20.6 (2.07x) 6.77 (3.92x) 23.0 (1.85x) bigkey 49.6 375. 6.34 (7.83x) 332. (1.13x) 12.9 (3.82x) 338. (1.10x) clma 75.0 139. 9.24 (8.12x) 73.2 (1.89x) 16.1 (4.65x) 80.1 (1.73x) des 112. 558. 19.3 (5.79x) 465. (1.19x) 31.5 (3.56x) 477. (1.16x) diffeq 15.7 32.5 2.16 (7.29x) 18.9 (1.71x) 3.81 (4.13x) 20.6 (1.57x) dsip 79.9 409. 11.4 (6.97x) 341. (1.20x) 20.9 (3.82x) 350. (1.16x) elliptic 52.7 157. 7.28 (7.23x) 111. (1.40x) 12.9 (4.07x) 117. (1.33x) ex1010 61.3 109. 7.92 (7.74x) 56.3 (1.94x) 13.6 (4.48x) 62.1 (1.76x) ex5p 20.6 34.7 3.40 (6.06x) 17.4 (1.98x) 5.47 (3.77x) 19.5 (1.77x) frisc 28.1 60.5 2.87 (9.76x) 35.2 (1.71x) 5.25 (5.34x) 37.6 (1.60x) misex3 33.4 50.4 6.31 (5.29x) 23.2 (2.16x) 10.0 (3.33x) 26.9 (1.86x) pdc 57.3 97.1 8.48 (6.75x) 48.2 (2.01x) 13.8 (4.14x) 53.6 (1.81x) s298 16.0 28.6 2.30 (6.95x) 14.9 (1.92x) 3.81 (4.20x) 16.4 (1.74x) s38417 16.0 28.6 2.30 (6.95x) 14.9 (1.92x) 3.81 (4.20x) 16.4 (1.74x) s38584.1 16.0 28.6 2.30 (6.95x) 14.9 (1.92x) 3.81 (4.20x) 16.4 (1.74x) seq 37.1 55.6 6.75 (5.49x) 25.2 (2.20x) 10.9 (3.39x) 29.4 (1.88x) spla 51.1 87.1 8.28 (6.17x) 44.2 (1.96x) 13.0 (3.90x) 49.0 (1.77x) tseng 20.4 70.8 2.50 (8.19x) 52.8 (1.34x) 4.93 (4.14x) 55.3 (1.28x) average - - - (6.81x) - (1.80x) - (3.99x) - (1.63x)

[22] D. B. Strukov and K. K. Likharev, “CMOL FPGA: a reconfigurable [28] I. Kuon and J. Rose, “Area and delay trade-offs in the circuit and architecture for hybrid digital circuits with two-terminal nanodevices,” architecture design of FPGAs,” in International Symposium on FPGAs, Nanotechnology, vol. 16, no. 6, pp. 888–900, Jun. 2005. 2008, p. 149. [23] P.-E. Gaillardon, F. Clermidy, I. O’Connor, and J. Liu, “Interconnection [29] “Berkeley Logic Synthesis and Verification Group, ABC: A System scheme and associated mapping method of reconfigurable cell matrices for Sequential Synthesis and Verification, Release 61225.” [Online]. based on nanoscale devices,” in NanoArch, Jul. 2009, pp. 69–74. Available: http://www.eecs.berkeley.edu/∼alanmi/abc/ [24] K. Tsunoda et al., “Low Power and High Speed Switching of Ti-doped [30] V. Betz, J. Rose, and A. Marquardt, Architecture and CAD for Deep- NiO ReRAM under the Unipolar Voltage Source of less than 3 V,” in Submicron FPGAs. Norwell: MA:Kluwer, 1999. IEDM Technical Digest, Dec. 2007, pp. 767–770. [31] “International Technology Roadmap for Semiconduc- [25] R. Huang et al., “Resistive switching of silicon-rich-oxide featuring high tors,” Tech. Rep., 2010. [Online]. Available: compatibility with CMOS technology for 3D stackable and embedded http://www.itrs.net/Links/2010ITRS/Home2010.htm applications,” Applied Physics A, pp. 927–931, Mar. 2011. [32] W. Zhao and Y. Cao, “Predictive Technology Model (PTM) website.” [26] Xilinx, “Virtex-6 FPGA data sheets.” [Online]. Available: [Online]. Available: http://ptm.asu.edu/ http://www.xilinx.com/support/documentation/virtex-6.htm [33] W. Zhao and Y. Cao, “New Generation of Predictive Technology Model [27] L. van Ginneken, “Buffer placement in distributed RC-tree networks for for Sub-45nm Design Exploration,” in ISQED, 2006, pp. 585–590. minimal Elmore delay,” in ISCAS, 1990, pp. 865–868.

8 2011 IEEE/ACM International Symposium on Nanoscale Architectures