book 4/9/2009 16: 24 page 133

Appendix A FPGA to ASIC Comparison Details

This appendix provides information on the benchmarks used for the FPGA to ASIC comparisons in Chap. 3. As well, some of the absolute data from that comparison is provided; however, area results are not included as that would disclose confidential information.

A.1 Benchmark Information

Information about each of the benchmarks used in the FPGA to ASIC comparisons is listed in Table A.1. For each benchmark, a brief description of what the benchmark does is given along with information about its source. Most of the benchmarks were obtained from OpenCores (http://www.opencores.org/) while the remainder of the benchmarks came from either internal University of Toronto projects [29, 71, 165, 166] or external benchmark projects at http://www.humanistic.org/∼hendrik/reed- solomon/index.html or http://www.engr.scu.edu/mourad/benchmark/RTL-Bench. html. As noted in the table, in some cases, the benchmarks were not obtained directly from these sources and, instead, were modified as part of the work performed in [79]. The modifications included the removal of FPGA vendor-specific constructs and the correction of any compilation issues in the designs.

A.2 FPGA to ASIC Comparison Data

The results in Chap. 3 were given only in relative terms. This section provides the raw data underlying these relative comparisons. TablesA.2 andA.3 list the maximum operating frequency and dynamic power, respectively, for each design for both the FPGA and ASIC. Finally, Tables A.4 and A.5 report the FPGA and ASIC absolute static power measurements for each benchmark at typical- and worst-case conditions, respectively. The static power measurements for the FPGAs include the adjustments to account for the partial utilization of each device as described in Sect. 3.4.3.2. Finally, Table A.6 summarizes the results when retiming was used with the FPGA

133 book 4/9/2009 16: 24 page 134

134 A FPGA to ASIC Comparison Details

Table A.1 Benchmark descriptions Benchmark Description booth 32-bit serial Booth-encoded multiplier created by the author rs encoder (255,239) Reed Solomon encoder from OpenCores cordic18 18-bit CORDIC algorithm implementation from OpenCores cordic8 8-bit CORDIC algorithm implementation from OpenCores des area DES Encryption/Decryption designed for area from OpenCores with modifications from [79] des perf DES Encryption/Decryption designed for performance from OpenCores with modifications from [79] fir restruct 8-bit 17-tap finite impulse response filter with fixed coefficients from http:// www.engr.scu.edu/mourad/benchmark/RTL-Bench.html with modifications from [79] mac1 Media Access Control (MAC) block from OpenCores with modifications from [79] aes192 AES Encryption/Decryption with 192-bit keys from OpenCores fir3 8-bit 3-tap finite impulse response filter from OpenCores with modifications from [79] diffeq Differential equation solver from OpenCores with modifications from [79] diffeq2 Differential equation solver from OpenCores with modifications from [79] molecular Molecular dynamics simulator [29] rs decoder1 (31,19) Reed Solomon decoder from http://www.humanistic.org/∼hendrik/ reed-solomon/index.html with modifications from [79] rs decoder2 (511,503) Reed Solomon decoder http://www.humanistic.org/∼hendrik/ reed-solomon/index.html with modifications from [79] atm High speed 32 × 32 ATM packet switch based on the architecture from [50] aes AES Encryption with 128-bit keys from OpenCores aes inv AES Decryption with 128-bit keys from OpenCores ethernet Ethernet Media Access Control (MAC) block from OpenCores serialproc 32-bit RISC processor with serial ALU [165, 166] fir24 16-bit 24-tap finite impulse response filter from OpenCores with modifications from [79] pipe5proc 32-bit RISC processor with 5 pipeline stages [165, 166] raytracer Image rendering engine [71]

CAD flow as described in Sect. 3.5.2. The benchmark size (in ALUTs), the oper- ating frequency increase and the total register increase are listed for each of the benchmarks. book 4/9/2009 16: 24 page 135

A.2 FPGA to ASIC Comparison Data 135

Table A.2 FPGA and ASIC operating frequencies Benchmark Maximum operating frequency (MHz) FPGA ASIC booth 188.71 934.58 rs encoder 288.52 1098.90 cordic18 260.08 961.54 cordic8 376.08 699.30 des area 360.49 729.93 des perf 321.34 1000.00 fir restruct 194.55 775.19 mac1 153.21 584.80 aes192 125.75 549.45 fir3 278.40 961.54 diffeq 78.23 318.47 diffeq2 70.58 281.69 molecular 89.01 414.94 rs decoder1 125.27 358.42 rs decoder2 101.24 239.23 atm 319.28 917.43 aes 213.22 800.00 aes inv 152.28 649.35 ethernet 168.58 704.23 serialproc 142.27 393.70 fir24 249.44 645.16 pipe5proc 131.03 378.79 raytracer 120.35 416.67 book 4/9/2009 16: 24 page 136

136 A FPGA to ASIC Comparison Details

Table A.3 FPGA and ASIC dynamic power consumption Benchmark Dynamic power Consumption (W) FPGA ASIC booth 5.10×10−03 1.71×10−04 rs encoder 4.63×10−02 1.88×10−03 cordic18 6.75×10−02 1.08×10−02 cordic8 1.39×10−02 2.44×10−03 des area 3.50×10−02 1.32×10−03 des perf 1.22×10−01 1.31×10−02 fir restruct 2.47×10−02 2.56×10−03 mac1 8.94×10−02 4.63×10−03 aes192 1.04×10−01 3.50×10−03 fir3 7.91×10−03 1.06×10−03 diffeq 4.53×10−02 3.86×10−03 diffeq2 5.18×10−02 4.16×10−03 molecular 4.55×10−01 2.76×10−02 rs decoder1 3.48×10−02 2.20×10−03 rs decoder2 4.74×10−02 4.29×10−03 atm 5.59×10−01 3.71×10−02 aes 6.32×10−02 6.71×10−03 aes inv 7.65×10−02 1.13×10−02 ethernet 9.17×10−02 5.91×10−03 serialproc 3.42×10−02 2.16×10−03 fir24 1.18×10−01 2.22×10−02 pipe5proc 5.11×10−02 6.23×10−03 raytracer 8.99×10−01 1.08×10−01 book 4/9/2009 16: 24 page 137

A.2 FPGA to ASIC Comparison Data 137

Table A.4 FPGA and ASIC static power consump- tion – typical Benchmark Static power consumption (W) FPGA ASIC rs encoder 1.31×10−02 2.61×10−04 cordic18 4.43×10−02 5.73×10−04 des area 1.14×10−02 1.25×10−04 des perf 5.52×10−02 1.08×10−03 fir restruct 1.40×10−02 2.03×10−04 mac1 3.52×10−02 4.08×10−04 aes192 1.61×10−02 1.90×10−04 diffeq2 1.15×10−02 3.63×10−04 molecular 1.27×10−01 1.83×10−03 rs decoder1 1.74×10−02 7.47×10−05 rs decoder2 2.31×10−02 1.91×10−04 atm 2.46×10−01 1.08×10−03 aes 1.67×10−02 5.06×10−04 aes inv 2.06×10−02 6.68×10−04 ethernet 5.11×10−02 2.94×10−04 fir24 2.18×10−02 1.66×10−03 pipe5proc 2.06×10−02 1.27×10−04 raytracer 1.69×10−01 1.74×10−03

Table A.5 FPGA and ASIC static power consump- tion – worst case Benchmark Static power consumption (W) FPGA ASIC rs encoder 3.46×10−02 1.00×10−02 cordic18 1.17×10−01 2.27×10−02 des perf 1.45×10−01 4.16×10−02 fir restruct 3.70×10−02 7.86×10−03 mac1 9.28×10−02 1.56×10−02 aes192 5.00×10−02 7.51×10−03 diffeq 2.45×10−02 1.44×10−02 diffeq2 3.04×10−02 1.40×10−02 molecular 3.95×10−01 7.19×10−02 rs decoder1 4.60×10−02 3.02×10−03 rs decoder2 6.10×10−02 7.46×10−03 atm 7.70×10−01 4.61×10−02 aes 5.21×10−02 1.93×10−02 aes inv 6.42×10−02 2.58×10−02 ethernet 1.35×10−01 1.07×10−02 fir24 6.80×10−02 6.52×10−02 pipe5proc 5.44×10−02 9.20×10−03 raytracer 7.14×10−01 N/A book 4/9/2009 16: 24 page 138

138 A FPGA to ASIC Comparison Details

Table A.6 Impact of retiming on FPGA performance Benchmark Benchmark ALUTs Operating frequency Register count category increase (%) increase (%) des area Logic 469 1.2 0.0 booth Logic 34 0.0 0.0 rs encoder Logic 683 0.0 0.0 fir scu rtl Logic 615 14 89 fir restruct1 Logic 619 11 64 fir restruct Logic 621 15 76 mac1 Logic 1,852 0.0 0.0 cordic8 Logic 251 0.0 0.0 mac2 Logic 6,776 0.0 0.0 md5 1 Logic 2,227 23 21 aes no mem Logic 1,389 0.0 0.0 raytracer framebuf v1 Logic 301 3.0 0.0 raytracer bound Logic 886 0.0 0.0 raytracer bound v1 Logic 889 0.0 0.0 cordic Logic 907 0.0 0.0 aes192 Logic 1,090 9.7 30 md5 2 Logic 858 10 13 cordic Logic 1,278 0.0 0.0 des perf Logic 1,840 −0.5 1.0 cordic18 Logic 1,169 0.0 0.0 aes inv no mem Logic 1,962 0.0 0.0 fir3 DSP 52 −14 −40 diffeq DSP 219 0.0 0.0 iir DSP 284 0.0 0.0 iir1 DSP 218 0.0 0.0 diffeq2 DSP 222 0.0 0.0 rs decoder1 DSP 418 5.4 7.5 rs decoder2 DSP 535 −0.3 11 raytracer gen v1 DSP 1,625 0.0 0.0 raytracer gen DSP 1,706 0.0 0.0 molecular DSP 6,289 1.3 14 molecular2 DSP 6,557 24 71 stereovision1 DSP 2,934 36 19 stereovision3 Memory 82 10 9.3 serialproc Memory 671 −2.0 16 raytracer framebuf Memory 457 12 0.0 aes Memory 675 0.0 0.0 aes inv Memory 813 0.0 0.0 ethernet Memory 1,650 −0.6 4.1 faraday dma Memory 1,987 0.5 0.9 faraday risc Memory 2,596 −1.0 1.3 faraday dsp Memory 7,218 −2.9 −0.1 stereovision0 v1 Memory 2,919 −1.6 0.2 atm Memory 10,514 4.7 1.1 stereovision0 Memory 19,969 3.7 0.4 oc54 cpu DSP & Mem 1,543 0.0 0.0 pipe5proc DSP & Mem 746 5.5 49 (continued) book 4/9/2009 16: 24 page 139

A.2 FPGA to ASIC Comparison Data 139

Table A.6 (continued) Benchmark Benchmark ALUTs Operating frequency Register count category increase (%) increase (%) fir24 DSP & Mem 821 −7.4 −3.3 fft256 nomem DSP & Mem 966 0.0 0.0 raytracer top DSP & Mem 11,438 14 0.0 raytracer top v1 DSP & Mem 11,424 11 −0.3 raytracer DSP & Mem 13,021 3.0 −0.6 fft256 DSP & Mem 27,479 0.0 0.0 stereovision2 v1 DSP & Mem 27,097 117 131 stereovision2 DSP & Mem 27,691 97 124 book 4/9/2009 16: 24 page 141

Appendix B Representative Delay Weighting

The programmability of FPGAs means that the eventual critical paths are not known at design time. However, a delay measurement is necessary if the performance of an FPGA is to be optimized. A solution described in Sect. 4.3.2 was to create a path con- taining all the possible critical path components. The delays of the components were then combined as a weighted sum to reflect the typical usage of each component and that weighted sum, which was termed the representative delay, was used as a mea- sure of the FPGAs’s performance during optimization. This appendix investigates the selection of the weights used to compute the representative delay. As a starting point, the behaviour of benchmark circuits is analysed. That analysis provided one set of possible weights that are then tested along with other possible weightings in Sect. B.2. The results from the different weightings are compared and conclusions are made.

B.1 Benchmark Statistics

The representative delay is intended to capture the behaviour of typical circuits implemented on the FPGA. Therefore, to determine appropriate values for the delay weightings, it is useful to examine the characteristics of benchmark circuits. The focus in this examination will be on how frequently the various components of the FPGA appear on the critical paths of circuits. In particular, for the architecture we will consider, there are four primary components whose usage effectively determines the usage of all the components of the FPGA. These four components are the routing segments, the CLB1 inputs, the CLB Outputs and the LUT. The usage of LUT will be examined in detail later in this section. The usage of these key components was tracked for the critical paths of the 20 MCNC benchmark circuits [116] when implemented on the standard baseline architecture described in Table 6.1. For each benchmark, the number of times each of the components appear on the critical path was recorded. These numbers were

1 Recall that a cluster-based (CLB) is the only type of logic block considered in this work.

141 book 4/9/2009 16: 24 page 142

142 B Representative Delay Weighting

Table B.1 Normalized usage of FPGA components Benchmark LUTs Routing segments CLB inputs CLB outputs alu4 0.20 0.43 0.14 0.23 apex2 0.17 0.49 0.15 0.20 apex4 0.17 0.46 0.17 0.20 bigkey 0.12 0.53 0.18 0.18 clma 0.19 0.44 0.14 0.22 des 0.17 0.46 0.17 0.20 diffeq 0.34 0.13 0.13 0.39 dsip 0.12 0.53 0.18 0.18 elliptic 0.25 0.31 0.15 0.29 ex1010 0.16 0.55 0.12 0.18 ex5p 0.16 0.47 0.18 0.18 frisc 0.25 0.25 0.21 0.28 misex3 0.18 0.42 0.18 0.21 pdc 0.14 0.59 0.12 0.15 s298 0.22 0.33 0.20 0.25 s38417 0.22 0.33 0.18 0.27 s38584.1 0.20 0.34 0.22 0.24 seq 0.18 0.44 0.18 0.21 spla 0.14 0.54 0.16 0.16 tseng 0.26 0.26 0.17 0.31 Minimum 0.12 0.13 0.12 0.15 Maximum 0.34 0.59 0.22 0.39 Average 0.19 0.42 0.17 0.23

normalized to the total number of components on the benchmark’s critical path to allow for comparison across benchmarks with different lengths of critical paths and the results are summarized in Table B.1. The final three rows of the table indicate the minimum, maximum and average normalized usage of each component. Clearly, there is a great deal of variation between the benchmarks, in particular, in the relative demands placed on the LUTs vs. the routing segments. The optimization of an FPGA must attempt to balance these different needs and, therefore, it seems appropriate to consider using these average path statistics to determine the representative delay weights. Before examining the use of these weights, the LUT usage will be more thoroughly investigated.

B.1.1 LUT Usage

In the previous results, the usage of the LUT was assumed to be the same in all cases. However, in reality, the specific input to the LUT that is used has a significant effect on the delay of a signal through the LUT. The reason for these differences is the implementation of the LUT as a fully encoded multiplexer structure and this is illustrated in Fig. B.1. These speed differences can be significant and, therefore, it is advantageous to use the faster inputs on performance critical nets. Commercial book 4/9/2009 16: 24 page 143

B.1 Benchmark Statistics 143

Slow Input Fast Input

SRAM bit

SRAM bit

SRAM bit

SRAM bit LUT SRAM Output bit

SRAM bit

SRAM bit

SRAM bit

Fig. B.1 Input-dependant delays through the LUT

CAD tools generally perform such optimization [16] when possible and, as a result, the faster LUT inputs appear more frequently on the critical path. This usage of some LUT inputs more than other inputs has potentially important optimization implications because area can be potentially conserved on less frequently used paths through the LUT. As the LUT uses a significant portion of the FPGA area, such area savings can impact the overall area and performance of the FPGA. To address this, the usage of the LUT inputs was examined. Unfortunately, the CAD tools used in this work do not recognize the timing differences between the LUT inputs and, therefore, the input LUT usage is certainly not optimized. Instead, to gain a sense of the relative importance of the different LUT inputs, the LUT usage for designs implemented on commercial CAD tools was examined. For the set of benchmark circuits in TableA.6, the critical path of each circuit was examined and the LUT input that was used for each LUT on the critical path was tracked.2 The results are summarized in Table B.2 for all the benchmarks implemented on different FPGA families. The specific FPGA family is listed in the first column of the table. The remaining columns indicated the normalized usage of each input on the critical path from the slowest input to fastest input. Clearly, fastest input is used most frequently while the remaining inputs are not used as much. In general, the remaining inputs are all used with approximately equal frequency.

2 These commercial devices have additional features in the logic element that may require the usage of particular inputs of the LUT. This may have some impact on the LUT usage results. book 4/9/2009 16: 24 page 144

144 B Representative Delay Weighting

Table B.2 Usage of LUT inputs FPGA family Logic element LUT input A (Slowest) B C D E F (Fastest) 4-LUT 0.215 0.251 0.197 0.336 Cyclone 4-LUT 0.243 0.251 0.187 0.319 Cyclone II 4-LUT 0.214 0.261 0.153 0.372 Stratix II ALM (6-LUT) 0.099 0.103 0.202 0.117 0.041 0.439

These results, however, only provide statistics for the two commercially used LUT sizes of 4 and 6. Since more LUT sizes will be examined in this work, it is necessary to make some assumptions about the LUT usage. For simplicity, the fastest input will be assumed to be used 50 % of the time and the remaining LUT usage will be divided equally amongst the remaining LUT inputs. These relative LUT usage proportions will be used to create a weighted sum of the individual LUT input delays that reflects the overall behaviour of the LUT. With the suitable weights now known for the LUTs and all the FPGA components, the usage of these weights to create a representative delay will be examined in Sect. B.2.

B.2 Representative Delay Weights

The representative delay measurement described in Sect. 4.3.2 attempts to capture the performance of an FPGA with a single overall delay measurement. That overall measurement is computed as the weighted combination of the delays of the FPGA components. The results from Sect. B.1 provided a measure of the relative usage of the components within the FPGA and that is one possible weighting that can be applied to the component delays. However, there are other possible weightings and, in this section, a range of weightings will be examined. The full list of weightings that will be tested is given in Table B.3. (Note that weighting number 1 approximately matches the average benchmark characteristics from Table B.1. It does not match precisely because a different approach was used for calculating the average characteristics when this work was performed.) Only a single-routing weight is used as there was only a single type of routing track in the test architecture. Similarly, the LUT weight is the weight for all LUT inputs and the weight amongst the different input cases is split as described above. These different weightings were used to create different representative path delays. The optimization process described in Chap. 4 was then used to produce different FPGA designs. For this optimization, an objective function of Area0Delay1 was used. The area and delay of the design produced for each different weighting was then determined using the standard experimental process with the full CAD flow as described in Sect. 5.1. These area and delay results are plotted in Fig. B.2. The Y-axis is the geometric mean delay for the benchmark circuits and the X-axis refers to the area required to implement all the benchmark designs. book 4/9/2009 16: 24 page 145

B.2 Representative Delay Weights 145

Table B.3 Representative path weighting test weights Weighting LUT Routing segment CLB input CLB output wLUT, wBLE in wrouting,i wCLB in wCLB out 120 40 1723 210 50 1723 330 30 1723 440 20 1723 550 10 1723 620 47 1023 720 42 1523 820 37 2023 920 32 2523 10 20 27 30 23 11 20 53 17 10 12 20 48 17 15 13 20 43 17 20 14 20 38 17 25 15 20 33 17 30 16 20 28 17 35 17 30 10 25.5 34.5 18 26.7 20 22.7 30.7 19 23.3 30 19.8 26.8 20 16.7 50 14.2 19.2 21 13.3 60 11.3 15.3 22 10 70 8.5 11.5 23 55 5 17 23 24 30 40 7 23 25 35 40 2 23 26 25 40 17 18 27 30 40 17 13 28 35 40 17 8 29 40 40 17 3 30 25 30 17 28 31 30 20 17 33 32 35 10 17 38

The figure suggests that the final area and delay of the design does depend on the weighting function used but the differences are not that in fact that large. The slowest design is only 12 % slower than the fastest design and the largest design is only 24 % larger than the smallest design. These differences are relatively small despite the massive changes in the weightings. For example, Weightings 22 and 23 yielded the smallest and largest designs, respectively, yet the specific weights were widely different. This effectively demonstrates that the final delay and area are not extremely sensitive to the specific weights used for the representative path. Based on this observation, the weights determined from the analysis of the benchmark circuits were used in this work for simplicity. Slight performance improvements could be obtained with the use of one of the alternate weights but that new weighting would likely only be useful for this particular architecture. For another architecture, a new book 4/9/2009 16: 24 page 146

146 B Representative Delay Weighting

4.5E-09

4.4E-09

4.3E-09

4.2E-09

4.1E-09

4E-09

3.9E-09 Effective Delay (s)

3.8E-09

3.7E-09

(Geometric Mean Delay as Measured by HSPICE) 3.6E-09 70000000 75000000 80000000 85000000 90000000 95000000 Effective Area (um2)

Fig. B.2 Area and delay with varied representative path weightings

set of weights would be required because the usage of the components would have changed. It would not be feasible to revisit this issue of weighting for every single architecture and, instead, the same weights were used in all cases. This does indicate a potential avenue for future work that better incorporates the eventual usage of the FPGA components into the optimization process. book 4/9/2009 16: 24 page 147

Appendix C Multiplexer Implementations

Multiplexers make up a large portion of an FPGA and, therefore, their design has a significant effect on the overall performance and area of an FPGA. This appendix explores some of the issues surrounding the design of multiplexers to explore and justify the choices made in this book. This complements the work in Sect. 6.5.2 which examined one attribute of multiplexer design: the number of levels. That previous analysis considered the design of the entire FPGA and found that two- level multiplexers were best. Section C.1 revisits this issue of the number of levels in a multiplexer and, in addition to this, the implementation choices for the multi- level multiplexers will be further examined. For simplicity, this analysis will only consider the design and sizing of the multiplexer while the design of the remainder of the FPGA will be treated as constant.

C.1 Multiplexer Designs

In the earlier investigation of multiplexers, the only design choice examined was that of the number of levels in a multiplexer. That is certainly an important factor as each level adds another pass transistor through which signals must pass. However, for any given number of levels (except for one-level designs), there are generally a number of different implementations possible. For example, a 16-input multiplexer could be implemented in at least three different ways as shown in Fig. C.1. These different implementations will be described in terms of the number of configuration bits at each level of the pass transistor tree. This makes the design in Fig. C.1b a 8:2 implementation, since the first level has 8 bits to control the eight-pass transistors in each branch of the tree at this level. In the second and last stage of this multiplexer, there are 2 bits. Some configurations allow for more inputs than required such as the 6:3 design shown in Fig. C.1c and, in that case, the additional pass transistors could simply be eliminated. However, this creates a non-symmetric multiplexer as some inputs will then be faster than other outputs. In some cases this is clearly unavoidable, such as for a 13-input multiplexer, but, in general, we will avoid these asymmetries and restrict our analysis to completely balanced multiplexers.

147 book 4/9/2009 16: 24 page 148

148 C Multiplexer Implementations

SRAM SRAM SRAM SRAM SRAM SRAM SRAM SRAM bit bit bit bit bit bit bit bit

Output Inputs

(a) 4:4 Implementation

SRAM SRAM SRAM SRAM SRAM SRAM SRAM SRAM SRAM SRAM bit bit bit bit bit bit bit bit bit bit

Output Inputs

(b) 8:2 Implementation

Fig. C.1 Two-level 16-input multiplexer implementations book 4/9/2009 16: 24 page 149

C.2 Evaluation of Multiplexer Designs 149

SRAM SRAM SRAM SRAM SRAM SRAM bit bit bit bit bit bit SRAM SRAM SRAM bit bit bit Inputs

Output

(c) 6:3 Implementation

Fig. C.1 Two-level 16-input multiplexer implementations C.2 Evaluation of Multiplexer Designs

We will examine a range of possible designs for both 16-input and 32-input mul- tiplexers. These sizes are particularly interesting because a 16-input multiplexer is within the range of sizes typically found in the programmable routing and a 32-input multiplexer is a typical size seen for the input multiplexers to the BLEs in large clus- ters. For both the 16-input and 32-input designs, the possibilities considered ranged from a one-level (one hot) design to four-level designs (which is fully encoded in the case of the 16-input multiplexer.) To simplify this investigation, minimum-width transistors were assumed and the area of the multiplexer was measured simply by counting the number of transistors, including the configuration memory bits, in the design. While this is not the preferred analysis approach, it was the most appropriate method at the time when this work was performed. This analysis still provides an indication of the minimum size of a design and its typical performance. The different 16-input designs are compared in Fig. C.2. Each design is labelled according to the number of configuration bits used in each stage as follows:

(Number of Inputs) (Bits in Level 1) (Bits in Level 2) (Bits in Level 3) (Bits in Level 4), book 4/9/2009 16: 24 page 150

150 C Multiplexer Implementations

120

100

80

60

40

20 Area (Number of Transistors)

0

16_2_8_0_0 16_4_4_0_0 16_8_2_0_0 16_2_2_4_0 16_2_4_2_0 16_4_2_2_0 16_2_2_2_2 16_16_0_0_0 Multiplexer Structure

(a) Transistor Count for Different Topologies of 16-input Multiplexer

1.80E-10

1.60E-10

1.40E-10

1.20E-10

1.00E-10

8.00E-11

6.00E-11

Mutliplexer Delay (s) 4.00E-11

2.00E-11

0.00E+00

16_2_8_0_0 16_4_4_0_0 16_8_2_0_0 16_2_2_4_0 16_2_4_2_0 16_4_2_2_0 16_2_2_2_2 16_16_0_0_0 Multiplexer Structure

(b) Delay of Different Topologies of 16-input Multiplexer

Fig. C.2 Area–delay trade-offs with varied 16-input multiplexer implementations book 4/9/2009 16: 24 page 151

C.2 Evaluation of Multiplexer Designs 151

1.40E-08

1.20E-08

1.00E-08

8.00E-09

6.00E-09

4.00E-09

2.00E-09

0.00E+00 Area Delay (Number of Transistors · Delay)

16_2_8_0_0 16_4_4_0_0 16_8_2_0_016_2_2_4_0 16_2_4_2_016_4_2_2_0 16_2_2_2_2 16_16_0_0_0 Multiplexer Structure

(c) Area Delay for Different Topologies of 16-input Multiplexer

Fig. C.2 Area–delay trade-offs with varied 16-input multiplexer implementations

where Level 1 refers to the transistors that are closest to the inputs. For example, the label “16 8 2 0 0” describes the two-level 8:2 multiplexer shown in Fig. C.1b. The area (in number of transistors) of the various configurations is shown in Fig. C.2a. The fully encoded design, “16 2 2 2 2”, requires the least area as expected and the one hot encoding requires the most area. There is also significant variability in the areas for the different two-level and three-level designs. The delay results are shown in Fig. C.2b. The reported delay is for the multiplexer and the following buffer. These results indicate clearly that the most significant factor is the number of multiplexer levels and, as expected the performance degrades with an increasing number of levels. The performance of the two-level designs is certainly worse than the one-level design. The difference in performance is slightly larger than was observed in Sect. 6.5.2 but this is likely due to the poor sizing used for the results in this section. In Fig. C.2c the different multiplexer configurations are compared in terms of their area delay product. By this metric, the two-level “16 4 4 0 0” and three-level the “16 4 2 2 0” designs are very similar. The lower delay for the two-level 4:4 design clearly makes it the preferred choice. Similar trends can be seen in Fig. C.3 which plots the results for the 32-input multiplexer. Figure C.3a summarizes the area of the different designs. The one-level design requires the most area by far and the remainder of the designs with a few exceptions have relatively similar area requirements. The overall trend is unchanged from the 16-input multiplexers as increasing the number of levels typically decreases the area. The delay results are shown in Fig. C.3b. It is notable that the one-level book 4/9/2009 16: 24 page 152

152 C Multiplexer Implementations

250

200

150

100

50 Area (Number of Transistors)

0

32_4_8_0_032_8_4_0_0 32_2_2_8_032_2_4_4_032_2_8_2_032_4_2_4_032_4_4_2_032_8_2_2_032_2_2_2_432_2_2_4_232_2_4_2_232_4_2_2_2 32_32_0_0_032_2_16_0_0 32_16_2_0_0 Multiplexer Structure

(a) Transistor Count for Different Topologies of 32-input Multiplexer

1.80E-10

1.60E-10

1.40E-10

1.20E-10

1.00E-10

8.00E-11

6.00E-11

Mutliplexer Delay (s) 4.00E-11

2.00E-11

0.00E+00

32_4_8_0_032_8_4_0_0 32_2_2_8_032_2_4_4_032_2_8_2_032_4_2_4_032_4_4_2_032_8_2_2_032_2_2_2_432_2_2_4_232_2_4_2_232_4_2_2_2 32_32_0_0_032_2_16_0_0 32_16_2_0_0 Multiplexer Structure

(b) Delay of Different Topologies of 32-input Multiplexer

Fig. C.3 Area–delay trade-offs with varied 32-input multiplexer implementations book 4/9/2009 16: 24 page 153

C.2 Evaluation of Multiplexer Designs 153

3.50E-08

3.00E-08

2.50E-08

2.00E-08

1.50E-08

1.00E-08

5.00E-09

0.00E+00 Area Delay (Number of Transistors · Delay)

32_4_8_0_032_8_4_0_0 32_2_2_8_032_2_4_4_032_2_8_2_032_4_2_4_032_4_4_2_032_8_4_2_032_2_2_2_432_2_2_4_232_2_4_2_232_4_2_2_2 32_32_0_0_032_2_16_0_0 32_16_2_0_0 Multiplexer Structure

(c) Area Delay for Different Topologies of 32-input Multiplexer

Fig. C.3 Area–delay trade-offs with varied 32-input multiplexer implementations design no longer offers the best performance and, instead, the best performance is obtained with the “32 8 4 0 0” design. As was seen with the 16-input designs, the three-level and four-level designs have longer delays. Finally, in Fig. C.3c, the area and delay measurements for each design are combined as the area-delay product. Again, some of the two-level and three-level designs achieve similar results but, with its lower delay, the two-level design is a more useful choice. These results for the 16-input and the 32-input multiplexers confirm the observa- tions made in Sect. 6.5.2 that two-level designs are the most effective choice. It is also clear from these results that the number of levels could be useful for making area and delay trade-offs as increasing the number of levels offers area savings but that comes at the cost of degraded performance. However, the same potential opportu- nity for trade-offs does not appear to exist when changing designs for any particular fixed number of levels because one design tended to offer both the best area and performance. Therefore, only the number of levels in a multiplexer was explored in Sect. 6.5.2. (However, the results in Sect. 6.5.2 found that in practise the number of levels did not enable useful trade-offs.) These results do indicate that the specific design for any number of levels should be selected judiciously. For example, the “32 2 16 0 0” is both slow and requires a lot of area despite being a two-level design. In this work, two-level designs were selected based on two factors. First, the number of configuration bits was minimized. The second factor was that amongst the designs with the same number of configuration bits, the design that puts the larger number of pass transistors closer to the input of the multiplexer (Level 1) was used. This intuitively makes sense as it puts the larger capacitive load on a lower resistance path to the driver. book 4/9/2009 16: 24 page 155

Appendix D Architectures Used for Area and Delay Range Investigation

This appendix describes the architectures that were used for the design space explo- ration in Chap. 6. The specific parameters that were varied for this exploration are summarized in Table D.1 and the specific architectures used are listed in Table D.2. The headings in Table D.2 refer to the abbreviations described in Table D.1. In all cases, the intra-cluster routing was fully populated.

Table D.1 Parameters considered for design space exploration Parameter Symbol LUT size K Cluster size N Routing track length type 1 L1 Fraction of tracks of length type 1 F1 Routing track length type 2 L2 Fraction of tracks of length type 2 F2 Input connection block flexibility (as a fraction of the channel width) Fc,in Output connection block flexibility (as a fraction of the channel width) Fc,out Channel width W Number of inputs to logic block I Number of inputs/output pins per row or column of logic blocks on each side of array I/Os per row/col

155 book 4/9/2009 16: 24 page 156

156 D Architectures Used for Area and Delay Range Investigation

Table D.2 Architectures used for design space exploration

NKL1 F1 L2 F2 Fc,in Fc,out WII/Os per row/col 2 6 4 1 0.25 0.5 56 9 3 2 3 4 1 0.25 0.5 48 5 2 2 4 4 1 0.25 0.5 56 6 2 2 5 4 1 0.25 0.5 56 8 2 4 2 4 1 0.2 0.25 56 5 2 4 3 4 1 0.2 0.25 64 8 2 4 4 4 1 0.25 0.25 64 10 4 4 5 4 1 0.2 0.25 56 13 4 4 6 4 1 0.2 0.25 56 15 4 4 7 4 1 0.2 0.25 56 18 5 6 2 4 1 0.15 0.167 64 7 2 6 3 4 1 0.2 0.167 80 11 3 6 4 4 1 0.25 0.16 80 14 4 6 6 4 1 0.2 0.167 80 21 6 8 2 4 1 0.15 0.125 80 9 3 8 3 2 1 0.25 0.125 76 14 8 8 3 4 1 0.2 0.125 88 14 4 8 4 1 1 0.25 0.125 82 18 8 8 4 2 1 0.25 0.1 84 18 4 8 4 4 1 0.25 0.1 88 18 4 8 4 6 1 0.25 0.1 96 18 4 8 5 2 1 0.25 0.125 80 23 8 8 5 4 1 0.2 0.125 88 23 6 8 6 2 1 0.25 0.125 76 27 8 8 6 4 1 0.2 0.125 88 27 7 8 7 2 1 0.25 0.125 80 32 8 8 7 4 1 0.2 0.125 96 32 8 10 2 4 1 0.2 0.1 80 11 3 10 3 1 1 0.25 0.1 88 17 8 10 3 2 1 0.25 0.1 92 17 8 10 3 4 1 0.2 0.1 104 17 6 10 3 4 1 0.3 0.1 104 17 6 10 4 1 1 0.2 0.1 96 22 4 10 4 2 1 0.2 0.1 96 22 4 10 4 4 1 0.2 0.1 104 22 7 10 4 4 1 0.3 0.1 104 22 7 10 4 4 0.667 10 0.333 0.2 0.1 120 22 4 10 4 4 0.706 10 0.294 0.2 0.1 136 22 4 10 4 4 0.8 10 0.2 0.2 0.1 100 22 4 10 4 6 1 0.25 0.1 120 22 4 10 4 8 1 0.2 0.1 128 22 4 10 5 1 1 0.25 0.1 84 28 8 10 5 2 1 0.25 0.1 80 28 8 10 5 4 1 0.2 0.1 96 28 8 10 5 4 1 0.3 0.1 96 28 8 10 6 1 1 0.25 0.1 84 33 8 10 6 2 1 0.25 0.1 80 33 8 10 6 4 1 0.15 0.1 96 33 8 (continued) book 4/9/2009 16: 24 page 157

D Architectures Used for Area and Delay Range Investigation 157

Table D.2 (continued)

NKL1 F1 L2 F2 Fc,in Fc,out WII/Os per row/col 10 6 4 1 0.3 0.1 96 33 8 10 7 1 1 0.25 0.1 92 39 8 10 7 2 1 0.25 0.1 92 39 8 10 7 4 1 96 39 8 10 4 4 1 0.2 0.1 104 22 4 10 4 4 0.5 8 0.5 0.2 0.1 128 22 4 12 2 4 1 0.15 0.0833 96 13 4 12 3 4 1 0.2 0.0833 104 20 5 12 4 4 1 0.2 0.0833 104 26 7 12 5 4 1 0.2 0.0833 104 33 9 12 6 4 1 0.2 0.0833 104 39 10 12 7 4 1 0.2 0.0833 104 46 12 book 4/9/2009 16: 24 page 159

Appendix E Logical Architecture to Transistor Sizing Process

This appendix reviews the main steps in translating a logical architecture into a optimized transistor-level netlist. This will done by way of example using the baseline architecture from Chap. 6. The logical architecture parameters for this design are listed in Table E.1. Starting from the architecture description, the widths (or fan-ins) of the multiplex- ers in the design must first be determined. For the architectures considered in this work, there are three multiplexers whose width must be determined. The Routing Mux is the multiplexer used within the inter-block routing. Determining this width is rather involved due to rounding issues and the possibility of tracks with multiple different segment lengths. However, for the baseline architecture, the width can be approximately computed from the parameters in Table E.1 as follows   2W F + 2W − 2W (F − 1) + F WN = L s L s c,output ≈ WidthRouting Mux 2W 12. (E.1) L The width of 12 would not be obtained if the numbers from Table E.1 are substituted into the equation due to rounding steps that are omitted in the equation for simplicity. The next multiplexer to be considered is the CLB Input Mux which is used within the input connection block to connect the inter-block routing into the logic block. The width of this multiplexer is

WidthCLB Input Mux = Fc,inputW = 0.2 × 104 ≈ 22, (E.2)

where again the rounding process has been omitted for simplicity. Finally, the width of the multiplexers that connect the intra-cluster routing to the BLEs is determined. These multiplexers are known as BLE Input Mux and its width is determined as follows

WidthBLE Input Mux = I + N = 22 + 10 = 32. (E.3)

There is an additional multiplexer inside the BLE but, for the architectures consid- ered, this multiplexer, the CLB Output Mux will always have two inputs, one from the LUT and one from the flip-flop.

159 book 4/9/2009 16: 24 page 160

160 E Logical Architecture to Transistor Sizing Process

Table E.1 Architecture parameters Parameter Value LUT size, k 4 Cluster size, N 10 Number of cluster inputs, I 22 Tracks per channel, W 104 Track length, L 4 Interconnect style Unidirectional Driver style Single driver Fc,input 0.2 Fc,output 0.1 Fs 3 Pads per row/column 4

With the widths of the multiplexers known, appropriate implementations must be determined. A number of implementation choices were examined in both Chap. 6 and Appendix C. The specific implementation for each multiplexer will be selected based on the input electrical parameters. The transistor-level implementation of the remaining components of the FPGA is straightforward. Buffers, with level-restorers, are necessary after all the multiplex- ers. If desired, buffers are also added prior to the multiplexers; however, for this example, no such buffers will be added. The LUT is implemented as a fully encoded multiplexer. Buffers can be added inside the pass transistor tree as needed. For this particular design, such buffers will not be added. Once these decisions have been made, the complete structure of the FPGA is known. The transistor sizes within this structure must then be optimized. This is done using the optimizer described in Chap. 4. For this analysis, sizing will be performed with the goal of minimizing theArea–Delay product. The resulting transistor sizes are listed in Table E.2. (All parameters corresponding to transistor lengths and widths in the table are specified in um.) In Fig. E.1, the meaning of the different transistor size parameters is illustrated through labels in the figure. For the buffers in the parameter list, stage0 refers to the inverter stage within the buffer that is closest to the input. Similarly, level0 for the multiplexers refers to the pass-transistor grouping that is closest to the input. The multiplicity of each multiplexer stage refers to the number of pass transistors within each group of transistors at each level of the multiplexer. Equivalently, the multiplicity is also the number of configuration memory bits needed at each level. Once these sizes have been determined, the transistor-level design of the FPGA is complete. The effective area and delay for this design can then be assessed using the full experimental process described in Chap. 6. book 4/9/2009 16: 24 page 161

E Logical Architecture to Transistor Sizing Process 161 To Routing

BUFFER_CLB_OUTPUT_POST

MUX_CLB_OUTPUT DQ

... BLE 2 BLE 2 BUFFER_LUT_POST BLE N K-LUT

BUFFER_LE_INPUT_POST

MUX_LE_INPUT Block ... BLE Input Logic Cluster

BUFFER_CLB_INPUT_POST

...

MUX_CLB_INPUT Intra-cluster track Block Input Connection Track Routing

M U X_R OU TIN G BUFFER_ROUTING_POST

Fig. E.1 Terminology for transistor sizes book 4/9/2009 16: 24 page 162

162 E Logical Architecture to Transistor Sizing Process

Table E.2 Transistor sizes for example architecture Parameter Value MUX CLB INPUT num levels 2.00 MUX CLB INPUT level0 width 0.24 MUX CLB INPUT level0 multiplicity 6.00 MUX CLB INPUT level1 width 0.24 MUX CLB INPUT level1 multiplicity 4.00 BUFFER CLB INPUT POST num stages 2.00 BUFFER CLB INPUT POST stage0 nmos width 0.84 BUFFER CLB INPUT POST stage0 pmos width 0.42 BUFFER CLB INPUT POST stage1 nmos width 0.84 BUFFER CLB INPUT POST stage1 pmos width 1.34 BUFFER CLB INPUT POST pullup width 0.24 BUFFER CLB INPUT POST pullup length 0.50 MUX ROUTING num levels 2.00 MUX ROUTING level0 width 1.64 MUX ROUTING level0 multiplicity 4.00 MUX ROUTING level1 width 1.84 MUX ROUTING level1 multiplicity 3.00 BUFFER ROUTING POST num stages 2.00 BUFFER ROUTING POST stage0 nmos width 5.34 BUFFER ROUTING POST stage0 pmos width 2.67 BUFFER ROUTING POST stage1 nmos width 5.34 BUFFER ROUTING POST stage1 pmos width 8.01 BUFFER ROUTING POST pullup width 0.24 BUFFER ROUTING POST pullup length 0.50 MUX LE INPUT num levels 2.00 MUX LE INPUT level0 width 0.24 MUX LE INPUT level0 multiplicity 8.00 MUX LE INPUT level1 width 0.24 MUX LE INPUT level1 multiplicity 4.00 BUFFER LE INPUT POST num stages 1.00 BUFFER LE INPUT POST stage0 nmos width 0.64 BUFFER LE INPUT POST stage0 pmos width 0.32 BUFFER LE INPUT POST pullup width 0.24 BUFFER LE INPUT POST pullup length 0.50 LUT LUT0 stage0 width 0.24 LUT LUT0 stage1 width 0.34 LUT LUT0 stage2 width 0.34 LUT LUT0 stage3 width 0.24 LUT LUT0 stage0 buffer nmos width 0.34 LUT LUT0 stage0 buffer pmos width 0.24 LUT LUT0 stage pullup length 0.40 LUT LUT0 stage pullup width 0.24 LUT LUT0 signal buffer stage0 nmos width 0.34 LUT LUT0 signal buffer stage0 pmos width 0.41 LUT LUT0 signal buffer stage1 nmos width 0.24 LUT LUT0 signal buffer stage1 pmos width 0.34 BUFFER LUT POST num stages 2.00 BUFFER LUT POST stage0 nmos width 0.54 BUFFER LUT POST stage0 pmos width 0.38 (continued) book 4/9/2009 16: 24 page 163

E Logical Architecture to Transistor Sizing Process 163

Table E.2 (continued) Parameter Value BUFFER LUT POST stage1 nmos width 1.08 BUFFER LUT POST stage1 pmos width 1.30 BUFFER LUT POST pullup width 0.24 BUFFER LUT POST pullup length 0.50 MUX CLB OUTPUT num levels 1.00 MUX CLB OUTPUT level0 width 2.74 MUX CLB OUTPUT level0 multiplicity 2.00 BUFFER CLB OUTPUT POST stage0 nmos widths 3.94 BUFFER CLB OUTPUT POST stage0 pmos widths 3.15 BUFFER CLB OUTPUT POST stage1 nmos widths 3.94 BUFFER CLB OUTPUT POST stage1 pmos widths 5.12 BUFFER CLB OUTPUT POST pullup widths 0.24 BUFFER CLB OUTPUT POST pullup lengths 0.50 book 4/9/2009 16: 24 page 165

References

[1] International Technology Roadmap for Semiconductors, 2007 Edition (2007), http://www.itrs.net/reports.html [2] Corporation: Act 1 series FPGAs (1996), http://www.actel.com/ documents/ACT1 DS.pdf [3] Actel Corporation: Axcelerator family FPGAs (2005), http://www.actel.com/ documents/AX DS.pdf [4] Actel Corporation: SX-A Family FPGAs v5.3 (2007), http://www.actel.com/ documents/SXA DS.pdf [5] Actel Corporation: ProASIC3 flash family FPGAs (2008), http://www.actel. com/documents/PA3 DS.pdf [6] Aggarwal, A., Lewis, D.: Routing architectures for hierarchical field- programmable gate arrays. In: IEEE International Conference on Computer Design, pp. 475–478 (1994) [7] Ahmed, E.: The effect of logic block granularity on deep-submicron FPGA performance and density. Master’s thesis, University of Toronto (2001). http://www.eecg.toronto.edu/∼jayar/pubs/theses/Ahmed/EliasAhmed.pdf [8] Ahmed, E., Rose, J.: The effect of LUT and cluster size on deep-submicron FPGA performance and density. In: FPGA ’00: Proceedings of the 2000 ACM/SIGDA Eighth International Symposium on Field Programmable Gate Arrays, pp. 3–12. ACM, New York, NY (2000), DOI http://doi.acm.org/ 10.1145/329166.329171 [9] Ahmed, E., Rose, J.: The effect of LUT and cluster size on deep-submicron FPGA performance and density. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 12(3), 288–298 (2004) [10] Aken’Ova, V., Lemieux, G., Saleh, R.: An improved “soft” eFPGA design and implementation strategy. In: Custom Integrated Circuits Conference, 2005. Proceedings of the IEEE, pp. 178–181 (2005) [11] Alpert, C., Chan, T., Kahng, A., Markov, I., Mulet, P.: Faster minimization of linear wirelength for global placement. IEEE Transactions on Computer- Aided Design of Integrated Circuits and Systems 17(1), 3–13 (1998), DOI 10.1109/43.673628

165 book 4/9/2009 16: 24 page 166

166 References

[12] Corporation: APEX II programmable logic device family, DS-APEXII- 3.0 (2002). http://www.altera.com/literature/ds/ds ap2.pdf [13] Altera Corporation: APEX 20K programmable logic device family data sheet, DS-APEX20K-5.1 (2004). http://www.altera.com/literature/ds/apex. pdf [14] Altera Corporation: Partnership with TSMC yields first silicon success on Al- tera’s 90-nm, low-k products (2004). http://www.altera.com/corporate/news room/releases/releases archive/2004/products/nr-tsmc partnership.html [15] Altera Corporation: Altera demonstrates 90-nm leadership by shipping world’s highest-density, highest-performance FPGA (2005). http://www. altera.com/corporate/news room/releases/releases archive/2005/products/nr- ep2s180 shipping.html [16] Altera Corporation: Quartus II Development Software Handbook, 5.0 edn. (2005). http://www.altera.com/literature/hb/qts/quartusii handbook. pdf [17] Altera Corporation: Stratix II vs. Virtex-4 power comparison & estimation accuracy. White Paper (2005). http://www.altera.com/literature/wp/wp s2v4 pwr acc.pdf [18] Altera Corporation: The Industry’s Biggest FPGAs (2005). http://www.altera. com/products/devices/stratix2/features/density/st2-density.html [19] Altera Corporation: Stratix device family data sheet, volume 1, S5V1-3.4 (2006). http://www.altera.com/literature/hb/stx/stratix vol 1.pdf [20] Altera Corporation: Cyclone III device handbook (2007). Ver. CIII5V1-1.2 http://www.altera.com/literature/hb/cyc3/cyclone3 handbook.pdf [21] Altera Corporation: Stratix II Device Handbook SII5V1-4.3 (2007). http:// www.altera.com/literature/hb/stx2/stratix2 handbook.pdf [22] Altera Corporation: Stratix III device handbook (2007). SIII5V1-1.4 http:// www.altera.com/literature/hb/stx3/stratix3 handbook.pdf [23] Altera Corporation: Cyclone II device handbook (2008). Ver. CII5V1-3.3 http:// www.altera.com/literature/hb/cyc2/cyc2 cii5v1.pdf [24] Altera Corporation: HardCopyASICs: Technology for business (2008). http:// www.altera.com/products/devices/hardcopy/hrd-index.html [25] Altera Corporation: Stratix IV Device Handbook Volumes 1–4 SIV5V1- 1.0 (2008). http://www.altera.com/literature/hb/stratix-iv/stratix4 handbook. pdf [26] Altera Corporation: Stratix IV FPGA power management and advantages WP- 01057-1.0 (2008). http://www.altera.com/literature/wp/wp-01059-stratix-iv- 40nm-power-management.pdf [27] Anderson, J., Najm, F.:A novel low-power FPGA routing switch. In: Proceed- ings of the IEEE 2004 Custom Ingretated Circuits Conference, pp. 719–722 (2004) [28] Anderson, J.H., Najm, F.N.: Low-power programmable routing circuitry for FPGAs. In: IEEE/ACM International Conference on Computer Aided De- sign 2004, pp. 602–609. IEEE Computer Society, Washington, DC (2004), DOI http://dx.doi.org/10.1109/ICCAD.2004.1382647 book 4/9/2009 16: 24 page 167

References 167

[29] Azizi, N., Kuon, I., Egier, A., Darabiha, A., Chow, P.: Reconfigurable molec- ular dynamics simulator. In: FCCM ’04: Proceedings of the 12th Annual IEEE Symposium on Field-Programmable Custom Computing Machines, pp. 197–206. IEEE Computer Society, Washington, DC (2004) [30] Bai, X., Visweswariah, C., Strenski, P.N.: Uncertainty-aware circuit optimiza- tion. In: DAC ’02: Proceedings of the 39th Conference on DesignAutomation, pp. 58–63. ACM Press, New York, NY (2002), DOI http://doi.acm.org/ 10.1145/513918.513935 [31] Bauer, T.: . Private Communication [32] Betz, V., Rose, J.: VPR: A new packing, placement and routing tool for FPGA research. In: Seventh International Workshop on Field-Programmable Logic and Applications, pp. 213–222 (1997), DOI 10.1007/3-540-63465-7 [33] Betz, V., Rose, J.: Circuit design, transistor sizing and wire layout of FPGA interconnect. In: Proceedings of the 1999 IEEE Custom Integrated Circuits Conference, pp. 171–174 (1999) [34] Betz, V., Rose, J., Marquardt, A.: Architecture and CAD for Deep-Submicron FPGAs. Kluwer, New York, NY (1999) [35] Boese, K.D., Kahng, A.B., McCoy, B.A., Robins, G.: Fidelity and near- optimality of Elmore-based routing constructions. In: Proceedings of 1993 IEEE International Conference on Computer Design: VLSI in Computers and Processors ICCD’93, pp. 81–84 (1993), DOI 10.1109/ICCD.1993.393400 [36] Boyd, S., Vandenberghe, L.: Convex Optimization. Cambridge University Press, Cambridge (2003) [37] Brayton, R., Hachtel, G., Sangiovanni-Vincentelli, A.: Multilevel logic synthesis. Proceedings of the IEEE 78(2), 264–300 (1990) [38] Brown, S., Rose, J., Vranesic, Z.: A detailed router for field-programmable gate arrays. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 11(5), 620–628 (1992), DOI 10.1109/43.127623 [39] Brown, S.D., Francis, R., Rose, J., Vranesic, Z.: Field-Programmable Gate Arrays. Kluwer, New York, NY (1992) [40] Cadence: Encounter Design Flow Guide and Tutorial, Product Version 3.3.1 (2004) [41] Cao, Y., Sato, T., Sylvester, D., Orshansky, M., Hu, C.: New paradigm of predictive MOSFET and interconnect modeling for early circuit design. In: Proceedings of the IEEE 2000 Custom Ingretated Circuits Conference, pp. 201–204 (2000) [42] Capitanio, E., Nobile, M., Renard, D.: Removing aluminum cap in 90 nm copper technology (2006). http://www.imec.be/efug/EFUG2006 Renard.pdf [43] Chan, M., Leventis, P., Lewis, D., Zaveri, K.,Yi, H.M., Lane, C.: Redundancy structures and methods in a programmable logic device (2007). US Patent 7,180,324 [44] Chandra, V., Schmit, H.: Simultaneous optimization of driving buffer and routing switch sizes in an FPGA using an iso-area approach. In: Proceedings of the IEEE Computer Society Annual Symposium on VLSI (ISVLSI’02), pp. 28–33 (2002), DOI 10.1109/ISVLSI.2002.1016870 book 4/9/2009 16: 24 page 168

168 References

[45] Chang, A., Dally, W.J.: Explaining the gap between ASIC and custom power: a custom perspective. In: DAC ’05: Proceedings of the 42nd annual con- ference on Design automation, pp. 281–284. ACM, New York, NY (2005), DOI http://doi.acm.org/10.1145/1065579.1065652 [46] Chen, C.P., Chu, C.C.N., Wong, D.F.: Fast and exact simultaneous gate and wire sizing by langrangian relaxation. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 18(7), 1014–1025 (1999), DOI 10.1109/43.771182 [47] Cheng, L., Li, F., Lin, Y., Wong, P., He, L.: Device and architecture cooptimization for FPGA power reduction. IEEE Transactions on Computer- Aided Design of Integrated Circuits and Systems 26(7), 1211–1221 (2007), DOI 10.1109/TCAD.2006.888289 [48] Chinnery, D., Keutzer, K.: Closing the Gap Between ASIC & Custom Tools and Techniques for High-Performance ASIC Design. Kluwer, New York, NY (2002) [49] Chinnery, D.G., Keutzer, K.: Closing the power gap between ASIC and cus- tom: an ASIC perspective. In: DAC ’05: Proceedings of the 42nd annual conference on Design automation, pp. 275–280. ACM Press, New York, NY (2005), DOI http://doi.acm.org/10.1145/1065579.1065651 [50] Chow, P., Karchmer, D., White, R., Ngai, T., Hodgins, P.,Yeh, D., Ranaweera, J., Widjaja, I., Leon-Garcia, A.: A 50,000 transistor packet-switching chip for the Starburst ATMswitch. In: Custom Integrated Circuits Conference, 1995, Proceedings of the IEEE 1995, pp. 435–438 (1995) [51] Clein, D.: CMOS IC Layout : Concepts, Methodologies and Tools. Elsevier, Amsterdam (2000) [52] Cliff, R.: Altera Corporation. Private Communication [53] Compton, K., Hauck, S.: Automatic design of area-efficient configurable ASIC cores. IEEE Transactions on Computers 56(5), 662–672 (2007), DOI 10.1109/TC.2007.1035 [54] Compton, K., Sharma,A., Phillips, S., Hauck, S.: Flexible routing architecture generation for domain-specific reconfigurable subsystems. In: International Conference on Field Programmable Logic andApplications, pp. 59–68 (2002) [55] Cong, J., Ding, Y.: FlowMap: An optimal technology mapping algorithm for delay optimization in lookup-table based FPGA designs. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 13(1), 1–12 (1994) [56] Cong, J., Ding, Y.: On area/depth trade-off in LUT-based FPGA technology mapping. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 2(2), 137–148 (1994), DOI 10.1109/92.285741 [57] Cong, J., He, L.: Optimal wiresizing for interconnects with multiple sources. ACM Transactions on Design Automation of Electronic Systems (TODAES) 1(4), 478–511 (1996), DOI http://doi.acm.org/10.1145/238997.239018 [58] Cong, J., Peck, J., Ding, Y.: RASP: a general logic synthesis system for SRAM-based FPGAs. In: FPGA ’96: Proceedings of the 1996 ACM Fourth International Symposium on Field-Programmable Gate Arrays, pp. 137–143. book 4/9/2009 16: 24 page 169

References 169

ACM, New York, NY (1996), DOI http://doi.acm.org/10.1145/228370. 228390 [59] Conn, A.R., Coulman, P.K., Haring, R.A., Morrill, G.L., Visweswariah, C., Wu, C.W.: JiffyTune: circuit optimization using time-domain sensitivities. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 17(12), 1292–1309 (1998) [60] Conn, A.R., Elfadel, I.M., W. W. Molzen, J., O’Brien, P.R., Strenski, P.N., Visweswariah, C., Whan, C.B.: Gradient-based optimization of custom cir- cuits using a static-timing formulation. In: DAC ’99: Proceedings of the 36th ACM/IEEE Conference on Design Automation, pp. 452–459. ACM, New York, NY (1999), DOI http://doi.acm.org/10.1145/309847.309979 [61] Dally, W.J., Chang, A.: The role of custom design inASIC chips. In: DAC ’00: Proceedings of the 37th Design Automation Conference, pp. 643–647. ACM, New York, NY (2000), DOI http://doi.acm.org/10.1145/337292.337604 [62] Darabiha, A., Rose, J., Maclean, J.: Video-rate stereo depth measurement on programmable hardware. In: Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003 , vol. 1 (2003) [63] De, V., Borkar, S.: Technology and design challenges for low power and high performance. In: Proceedings of the 1999 International Symposium on Low Power Electronics and Design (ISLPED ’99), pp. 163–168. ACM, New York, NY, (1999), DOI http://doi.acm.org/10.1145/313817.313908 [64] Dunga, M.V., Yang, W.M., Xi, X.J., He, J., Liu, W., Kanyu, Cao, M., Jin, X., Ou, J.J., Chan, M., Niknejad, A.M., Hu, C.: BSIM4.6.1 MOS- FET Model (2007), http://www-device.eecs.berkeley.edu/∼bsim3/BSIM4/ BSIM461/doc/BSIM461 Manual.pdf [65] Dunlop, A., Kernighan, B.: A procedure for placement of standard-cell VLSI circuits. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 4(1), 92–98 (1985) [66] Ebeling, C., McMurchie, L., Hauck, S., Burns, S.: Placement and routing tools for the Triptych FPGA. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 3(4), 473–482 (1995) [67] Eisele, V.,Hoppe, B., Kiehl, O.: Transmission gate delay models for circuit op- timization. In: Proceedings of the European Design Automation Conference, 1990, EDAC, pp. 558–562 (1990), DOI 10.1109/EDAC.1990.136709 [68] Elmore, W.C.: The transient response of damped linear networks with par- ticular regard to wideband amplifiers. Journal of Applied Physics 19, 55–63 (1948) [69] Fang, W.M.: Modeling routing demand for early-stage FPGA architecture development. Master’s thesis, University of Toronto (2008) [70] Fang, W.M., Rose, J.: Modeling routing demand for early-stage fpga archi- tecture development. In: FPGA ’08: Proceedings of the 16th International ACM/SIGDA Symposium on Field Programmable Gate Arrays, pp. 139– 148. ACM, NewYork, NY (2008), DOI http://doi.acm.org/10.1145/1344671. 1344694 book 4/9/2009 16: 24 page 170

170 References

[71] Fender, J., Rose, J.: A high-speed ray tracing engine built on a field- programmable system. In: Proceedings of IEEE International Conference on Field-Programmable Technology (FPT), 2003, pp. 188–195 (2003) [72] Fishburn, J.P., Dunlop, A.: TILOS: A posynomial programming approach to transistor sizing. In: International Conference on Computer Aided Design, pp. 326–328 (1985) [73] Gayasen, A., Lee, K., Vijaykrishnan, N., Kandemir, M., Irwin, M.J., Tuan, T.: A dual-vdd low power FPGA architecture. In: Proceedings of the International Conference on Field-Programmable Logic and Applications, pp. 145–157 (2004) [74] Goetting, E.: Introducing the newVirtex-4 FPGA family. Xcell Journal (2006), http://www.xilinx.com/publications/xcellonline/xcell 52/xc pdf/xc v4top- view52.pdf [75] Ho, C., Leong, P., Luk, W., Wilton, S., Lopez-Buedo, S.: Virtual embedded blocks: A methodology for evaluating embedded elements in FPGAs. In: Proceedings of IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM), pp. 35–44 (2006) [76] Ho, R., Mai, K., Horowitz, M.: The future of wires. Proceedings of the IEEE 89(4), 490–504 (2001) [77] Hutton, M., Chan, V., Kazarian, P., Maruri, V., Ngai, T., Park, J., Patel, R., Pedersen, B., Schleicher, J., Shumarayev, S.: Interconnect enhancements for a high-speed PLD architecture. In: Proceedings of the 2002 ACM/SIGDA Tenth International Symposium on Field-Programmable Gate Arrays, pp. 3– 10. ACM, New York, NY (2002), DOI http://doi.acm.org/10.1145/503048. 503050 [78] James, D.: 2004 – The year of 90-nm: A review of 90 nm devices. In: 2005 IEEE/SEMI Advanced Semiconductor Manufacturing Conference and Workshop, pp. 72–76 (2005), DOI 10.1109/ASMC.2005.1438770 [79] Jamieson, P.: Improving the area efficiency of heterogeneous FPGAs with shadow clusters. PhD thesis, University of Toronto (2007) [80] Jamieson, P., Rose, J.: Enhancing the area-efficiency of FPGAs with hard circuits using shadow clusters. In: IEEE International Conference on Field- Programmable Technology, pp. 1–8 (2006), DOI 10.1109/FPT.2006.270384 [81] Jiang, W., Tiwari, V., de la Iglesia, E., Sinha, A.: Topological analysis for leakage prediction of digital circuits. In: ASP-DAC ’02: Proceedings of the 2002 Conference on Asia South Pacific Design Automation/VLSI Design, p. 39. IEEE Computer Society, Washington, DC (2002) [82] Jones Jr., H.S., Nagle, P.R., Nguyen, H.T.: A comparison of standard cell and gate array implementions in a common CAD system. In: IEEE 1986 Custom Integrated Circuits Conference, pp. 228–232 (1986) [83] Kasamsetty, K., Ketkar, M., Sapatnekar, S.: A new class of convex functions for delay modeling and its application to the transistor sizing problem. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 19(7), 779–788 (2000) book 4/9/2009 16: 24 page 171

References 171

[84] Ketkar, M., Kasamsetty, K., Sapatnekar, S.: Convex delay models for tran- sistor sizing. In: DAC ’00: Proceedings of the 37th Design Automation Conference, pp. 655–660. ACM, NewYork, NY (2000), DOI http://doi.acm. org/10.1145/337292.337607 [85] Kilopass Technology, Inc.: Kilopass XPM embedded non-volatile mem- ory solutions (2007), http://www.kilopass.com/public/Killopass Bro CR1- 01(Web).pdf [86] Kirkpatrick Jr., S., C.D.G., Vecchi, M.P.: Optimization by simulated annealing. Science 220(4598), 671–680 (1983) [87] Kleinhans, J., Sigl, G., Johannes, F., Antreich, K.: GORDIAN: VLSI placement by quadratic programming and slicing optimization. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 10(3), 356–365 (1991) [88] Kuon, I.: Automated FPGA design, verification and layout. Master’s thesis, University of Toronto (2004) [89] Kuon, I., Egier, A., Rose, J.: Design, layout and verification of an FPGA using automated tools. In: FPGA ’05: Proceedings of the 2005 ACM/SIGDA 13th International Symposium on Field-Programmable Gate Arrays, pp. 215– 226. ACM, NewYork, NY (2005), DOI http://doi.acm.org/10.1145/1046192. 1046220 [90] Kuon, I., Rose, J.: Measuring the gap between FPGAs and ASICs. In: FPGA ’06: Proceedings of the 2006 ACM/SIGDA 14th International Symposium on Field Programmable Gate Arrays, pp. 21–30. ACM, New York, NY (2006), DOI http://doi.acm.org/10.1145/1117201.1117205 [91] Kuon, I., Rose, J.: Measuring the gap between FPGAs and ASICs. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 26(2), 203–215 (2007), DOI 10.1109/TCAD.2006.884574 [92] Kuon, I., Rose, J.: Area and delay trade-offs in the circuit and architecture design of FPGAs. In: FPGA ’08: Proceedings of the 16th International ACM/ SIGDA Symposium on Field Programmable Gate Arrays, pp. 149– 158. ACM, NewYork, NY (2008), DOI http://doi.acm.org/10.1145/1344671. 1344695 [93] Kuon, I., Rose, J.: Automated transistor sizing for FPGA architecture explo- ration. In: DAC ’08: Proceedings of the 45th Annual Conference on Design Automation, pp. 792–795. ACM, New York, NY (2008) [94] Lamoureux, J.: On the interaction between power-aware computer-aided design algorithms for field-programmable gate arrays. Master’s thesis, Univer- sity of British Columbia (2003), http://www.ece.ubc.ca/∼julienl/papers/pdf/ lamoureux masc.pdf [95] Lamoureux, J., Wilton, S.J.E.: On the interaction between power-aware FPGA CAD algorithms. In: ICCAD ’03: Proceedings of the 2003 IEEE/ACM Inter- national Conference on Computer-Aided Design, p. 701. IEEE Computer So- ciety, Washington, DC (2003), DOI http://dx.doi.org/10.1109/ICCAD.2003. 106 book 4/9/2009 16: 24 page 172

172 References

[96] Corporation: LatticeECP2/M Family Handbook, Ver- sion 01.6 (2007), http://www.latticesemi.com/dynamic/ view document.cfm? document id=19028 [97] Lattice Semiconductor Corporation: LatticeECP2/M Family Handbook, Ver- sion 02.9 (2007), http://www.latticesemi.com/dynamic/ view document.cfm? document id=21733 [98] Lee, E., Lemieux, G., Mirabbasi, S.: Interconnect driver design for long wires in field-programmable gate arrays. In: IEEE International Conference on Field Programmable Technology, 2006. FPT 2006, pp. 89–96 (2006) [99] Lee, E., Lemieux, G., Mirabbasi, S.: Interconnect driver design for long wires in field-programmable gate arrays. Journal of Signal Processing Systems 51(1), 57–76 (2008), DOI 10.1007/s11265-007-0141-y [100] Lemieux, G., Lee, E., Tom, M., Yu, A.: Directional and single-driver wires in FPGA interconnect. In: IEEE International Conference on Field- Programmable Technology, pp. 41–48 (2004) [101] Lemieux, G., Lewis, D.: Using sparse crossbars within LUT clusters. In: FPGA ’01: Proceedings of the 2001 ACM/SIGDA Ninth International Sym- posium on Field Programmable Gate Arrays, pp. 59–68. ACM, New York, NY (2001), DOI http://doi.acm.org/10.1145/360276.360299 [102] Lemieux, G., Lewis, D.: Analytical framework for switch block design. In: International Conference on Field Programmable Logic and Applications, pp. 122–131 (2002) [103] Leventis, P., Chan, M., Chan, M., Lewis, D., Nouban, B., Powell, G., Vest, B., Wong, M., Xia, R., Costello, J.: Cyclone: A low-cost, high- performance FPGA. In: Proceedings of the IEEE 2003 Custom Ingretated Circuits Conference, pp. 49–52 (2003) [104] Lewis, D., Ahmed, E., Baeckler, G., Betz, V., Bourgeault, M., Cashman, D., Galloway, D., Hutton, M., Lane, C., Lee, A., Leventis, P., Marquardt, S., McClintock, C., Padalia, K., Pedersen, B., Powell, G., Ratchev, B., Reddy, S., Schleicher, J., Stevens, K.,Yuan, R., Cliff, R., Rose, J.: The Stratix II logic and routing architecture. In: FPGA ’05: Proceedings of the 2005 ACM/SIGDA 13th International Symposium on Field-Programmable Gate Arrays, pp. 14– 20. ACM, New York, NY (2005), DOI http://doi.acm.org/10.1145/1046192. 1046195 [105] Lewis, D., Ahmed, E., Cashman, D., Vanderhoek, T., Lane, C., Lee, A., Pan, P.: Architectural enhancements in Stratix-IIITM and Stratix-IVTM. In: FPGA ’09: Proceeding of the ACM/SIGDA International Symposium on Field Pro- grammable GateArrays, pp. 33–42. ACM, NewYork, NY (2009), DOI http:// doi.acm.org/10.1145/1508128.1508135 [106] Lewis, D., Betz, V., Jefferson, D., Lee, A., Lane, C., Leventis, P., Marquardt, S., McClintock, C., Pedersen, B., Powell, G., Reddy, S., Wysocki, C., Cliff, R., Rose, J.: The StratixTM routing and logic architecture. In: FPGA ’03: Proceedings of the 2003 ACM/SIGDA Eleventh International Symposium on Field Programmable Gate Arrays, pp. 12–20. ACM, New York, NY (2003), DOI http://doi.acm.org/10.1145/611817.611821 book 4/9/2009 16: 24 page 173

References 173

[107] Li, F., Lin,Y., He, L.: FPGA power reduction using configurable dual-Vdd. In: DAC ’04: Proceedings of the 41st Annual Conference on Design Automation, pp. 735–740. ACM, NewYork, NY (2004), DOI http://doi.acm.org/10.1145/ 996566.996767 [108] Li, F., Lin, Y., He, L.: Vdd programmability to reduce FPGA interconnect power. In: IEEE/ACM International Conference on Computer Aided Design, 2004 (2004) [109] Li, F., Lin, Y., He, L., Chen, D., Cong, J.: Power modeling and characteristics of field programmable gate arrays. IEEETransactions on Computer-Aided De- sign of Integrated Circuits and Systems 24(11), 1712–1724 (2005), DOI 10. 1109/TCAD.2005.852293 [110] Liu, W., Jin, X., Xi, X., Chen, J., Jeng, M.C., Liu, Z., Cheng, Y., Chen, K., Chan, M., Hui, K., Huang, J., Tu, R., Ko, P.K., Hu, C.: BSIM3V3.3 MOSFET Model (2005), http://www-device.eecs.berkeley.edu/ ∼bsim3/ftpv330/Mod doc/b3v33manu.tar [111] Luu, J., Kuon, I., Jamieson, P., Campbell, T., Ye, A., Fang, W.M., Rose, J.: Vpr 5.0: Fpga cad and architecture exploration tools with single-driver routing, heterogeneity and process scaling. In: FPGA ’09: Proceeding of the ACM/SIGDA International Symposium on Field Programmable Gate Arrays, pp. 133–142. ACM, NewYork, NY (2009), DOI http://doi.acm.org/10.1145/ 1508128.1508150 [112] Marquardt, A., Betz, V., Rose, J.: Using cluster-based logic blocks and timing-driven packing to improve FPGA speed and density. In: ACM/SIGDA International Symposium on Field Programmable Gate Arrays, pp. 37–46 (1999) [113] Marquardt, A., Betz, V., Rose, J.: Speed and area tradeoffs in cluster-based FPGA architectures. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 8(1), 84–93 (2000) [114] Marquardt, A.R.: Cluster-based architecture, timing-driven packing and timing-driven placement for FPGAs. Master’s thesis, University of Toronto (1999) [115] McClintock, C., Lee, A.L., Cliff, R.G.: Redundancy circuitry for logic circuits (2000). US Patent 6034536 [116] McElvain, K.: LGSynth93 benchmark set: Version 4.0 (1993), Formerly available at http://mcnc.org [117] Nabaa, G., Azizi, N., Najm, F.: An adaptive FPGA architecture with process variation compensation and reduced leakage. In: Proceedings of the 43rd Annual Conference on Design Automation, pp. 624–629. ACM, New York, NY (2006), DOI 10.1145/1146909.1147069 [118] NEC Electronics: ISSP (Structured ASIC) (2005), http://www.necel.com/ issp/english/ [119] Nye, W., Riley, D.C., Sangiovanni-Vincentelli, A., Tits, A.L.: DE- LIGHT.SPICE: An optimization-based system for the design of integrated circuits. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 7(4), 501–519 (1988) book 4/9/2009 16: 24 page 174

174 References

[120] Okamoto, T., Cong, J.: Buffered Steiner tree construction with wire sizing for interconnect layout optimization. In: ICCAD ’96: Proceedings of the 1996 IEEE/ACM International Conference on Computer-Aided Design, pp. 44–49. IEEE Computer Society, Washington, DC (1996) [121] Ousterhout, J.K.: Switch-level delay models for digital MOS VLSI. In: DAC ’84: Proceedings of the 21st Conference on Design Automation, pp. 542–548. IEEE, Piscataway, NJ, USA (1984) [122] Padalia, K., Fung, R., Bourgeault, M., Egier, A., Rose, J.: Automatic tran- sistor and physical design of FPGA tiles from an architectural specification. In: FPGA ’03: Proceedings of the 2003 ACM/SIGDA Eleventh International Symposium on Field Programmable Gate Arrays, pp. 164–172. ACM, New York, NY (2003), DOI http://doi.acm.org/10.1145/611817.611842 [123] Phillips, S., Hauck, S.: Automatic layout of domain-specific reconfigurable subsystems for system-on-a-chip. In: Proceedings of the 2002 ACM/SIGDA Tenth International Symposium on Field-Programmable Gate Arrays, pp. 165–173. ACM, New York, NY (2002), DOI http://doi.acm.org/10.1145/ 503048.503073 [124] Poon, K.K.W., Wilton, S.J.E., Yan, A.: A detailed power model for field- programmable gate arrays. ACM Transactions on DesignAutomation of Elec- tronic Systems (TODAES) 10(2), 279–302 (2005), DOI http://doi.acm.org/ 10.1145/1059876.1059881 [125] Rabaey, J.M.: Digital Integrated Circuits A Design Perspective. Prentice Hall, Upper Saddle River, NJ (1996) [126] Rahman, A., Polavarapuv, V.: Evaluation of low-leakage design techniques for field programmable gate arrays. In: FPGA ’04: Proceedings of the 2004 ACM/SIGDA 12th International Symposium on Field Programmable Gate Arrays, pp. 23–30. ACM, New York, NY (2004), DOI http://doi.acm.org/ 10.1145/968280.968285 [127] Roche, P., Gasiot, G.: Impacts of front-end and middle-end process modifica- tions on terrestrial soft error rate. IEEE Transactions on Device and Materials Reliability 5(3), 382–396 (2005), DOI 10.1109/TDMR.2005.853451 [128] Rose, J., Francis, R., Lewis, D., Chow, P.:Architecture of field-programmable gate arrays: the effect of logic block functionality on area efficiency. IEEE Journal of Solid-State Circuits 25(5), 1217–1225 (1990) [129] Rubinstein, J., Penfield, P., Horowitz, M.A.: Signal delay in RC tree net- works. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 2(3), 202–211 (1983) [130] Sangiovanni-Vincentelli, A., El Gamal, A., Rose, J.: Synthesis methods for field programmable gate arrays. Proceedings of the IEEE 81(7), 1057–1083 (1993) [131] Sapatnekar, S.S., Rao, V.B.,Vaidya, P., Sung-Mo, K.: An exact solution to the transistor sizing problem for CMOS circuits using convex optimization. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 12(11), 1621–1634 (1993) book 4/9/2009 16: 24 page 175

References 175

[132] Sechen, C., Sangiovanni-Vincentelli, A.: The TimberWolf placement and routing package. IEEE Journal of Solid-State Circuits 20(2), 510–522 (1985) [133] Sentovich, E.M., Singh, K.J., Lavagno, L., Moon, C., Murgai, R., Sal- danha, A., Savoj, H., Stephan, P.R., Brayton, R.K., Sangiovanni-Vincentelli, A.L.: Sis: A system for sequential circuit synthesis. Technical Report UCB/ERL M92/41, University of California, Berkeley, Electronics Research Lab, University of California, Berkeley, CA, 94720 (1992) [134] Shang, L., Kaviani, A.S., Bathala, K.: Dynamic power consumption in VirtexTM-II FPGA family. In: Proceedings of the 2002 ACM/SIGDA Tenth International Symposium on Field-Programmable Gate Arrays, pp. 157–164. ACM, New York, NY (2002), DOI http://doi.acm.org/10.1145/ 503048.503072 [135] Shyu, J.M., Sangiovanni-Vincentelli, A.: ECSTASY: A new environment for IC design optimization. In: International Conference on Computer Aided Design, pp. 484–487 (1988) [136] Sidense Corp: Sidense the future of logic NVM (2008), http://www.sidense. com/index.php?option=com content&task=view&id=130&Itemid=30 [137] Smith, M.J.S.: Application-Specific Integrated Circuits. Addison-Wesley, Germany (1997) [138] STMicroelectronics: MOTOROLA, PHILIPS and STMicroelectronics In- troduces Debut Industry’s First 90-NANOMETER CMOS Design Platform (2002), http://www.st.com/stonline/press/news/year2002/t1222h.htm [139] STMicroelectronics: 90-nm CMOS090 Design Platform (2005), http://www.st.com/stonline/products/technologies/soc/90plat.htm [140] Sundararajan, V., Sapatnekar, S.S., Parhi, K.K.: Fast and exact transistor sizing based on iterative relaxation. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 21(5), 568–581 (2002) [141] Sutherland, I., Sproule, R., Harris, D.: Logical Effort : Designing fast CMOS circuits. Morgan Kaufmann, San Fransisco, CA (1999) [142] Swartz, J.S.: A high-speed timing-award router for FPGAs. Master’s the- sis, University of Toronto (1998), http://www.eecg.toronto.edu/∼jayar/pubs/ theses/Swartz/JordanSwartz.pdf [143] Swartz, J.S., Betz, V., Rose, J.: A fast routability-driven router for fpgas. In: FPGA ’98: Proceedings of the 1998 ACM/SIGDA Sixth International Symposium on Field Programmable Gate Arrays, pp. 140–149. ACM, New York, NY (1998), DOI http://doi.acm.org/10.1145/275107.275134 [144] : HSIM. http://www.synopsys.com/products/mixedsignal/hsim/hsim. html [145] Synopsys: HSPICE. http://www.synopsys.com/products/mixedsignal/hspice/ hspice.html [146] Synopsys: NanoSim. http://www.synopsys.com/ products/mixedsignal/ nanosim/nanosim.html [147] Synopsys: Design Compiler Reference Manual: Constraints and Timing, version v-2004.06 edn. (2004) book 4/9/2009 16: 24 page 176

176 References

[148] Synopsys: Design Compiler User Guide, version v-2004.06 edn. (2004) [149] Taiwan Semiconductor Manufacturing Company Ltd: TSMC 0.18 and 0.15-micron technology platform (2005), http://www.tsmc.com/download/ english/a05 literature/0.15-0.18-micron Brochure.pdf [150] Taiwan Semiconductor Manufacturing Company Ltd: TSMC 0.35-micron technology platform (2005), http://www.tsmc.com/download/english/a05 literature/0.35-micron Brochure.pdf [151] Taiwan Semiconductor Manufacturing Company Ltd: TSMC 90-nm tech- nology platform (2005), http://www.tsmc.com/download/english/a05 litera- ture/90nm Brochure.pdf [152] Tennakoon, H., Sechen, C.: Efficient and accurate gate sizing with piecewise convex delay models. In: DAC ’05: Proceedings of the 42nd Annual Con- ference on Design Automation, pp. 807–812. ACM, New York, NY (2005), DOI http://doi.acm.org/10.1145/1065579.1065793 [153] The MOSIS Service: MOSIS scalable CMOS (SCMOS) revision 8.00 (2004), http://www.mosis.com/Technical/Designrules/scmos/scmos-main.html [154] Warner, R.: Applying a composite model to the IC yield problem. IEEE Journal of Solid-State Circuits 9(3), 86–95 (1974) [155] Weber, J.E.: Mathematical Analysis: Business and Economic Applications, 3rd edn. Harper & Row, New York, NY (1976) [156] Weste, N.H.E., Harris, D.: CMOS VLSI Design A Circuits and Sys- tems Perspective. Pearson Addison-Wesley, Upper Saddle River, NJ (2005) [157] Wilton, S.: Architectures and algorithms for field-programmable gate arrays with embedded memories. PhD thesis, Department of Electrical and Computer Engineering, University of Toronto (1997) [158] Wilton, S.J., Kafafi, N., Wu, J.C.H., Bozman, K.A., Aken’Ova, V., Saleh, R.: Design considerations for soft embedded programmable logic cores. IEEE Journal of Solid-State Circuits 40(2), 485–497 (2005) [159] Wu, C., Leung,Y.,Chang, C., Tsai, M., Huang, H., Lin, D., Sheu,Y.,Hsieh, C., Liang, W., Han, L., et al.: A 90-nm cmos device technology with high-speed, general-purpose, and low-leakage transistors for system on chip applications. In: Electron Devices Meeting, 2002. IEDM’02. Digest. International, pp. 65–68 (2002) [160] Xilinx: Virtex-4 family overview (2005), http://www.xilinx.com/bvdocs/ publications/ds112.pdf [161] Xilinx: Spartan-3E (2006). Ver. 3.4 http://direct.xilinx.com/bvdocs/publica- tions/ds312.pdf [162] Xilinx: Virtex-5 user guide (2008), UG190 (v4.0) http://www.xilinx.com/ support/documentation/user guides/ug190.pdf [163] Yang, S.: Logic synthesis and optimization benchmarks user guide version 3.0. Technical Report, Microelectronics Center of North Carolina (1991) [164] Yang, X., Choi, B.K., Sarrafzadeh, M.: Routability-driven white space allocation for fixed-die standard-cell placement. IEEE Transactions on book 4/9/2009 16: 24 page 177

References 177

Computer-Aided Design of Integrated Circuits and Systems 22(4), 410–419 (2003), DOI 10.1109/TCAD.2003.809660 [165] Yiannacouras, P.: The microarchitecture of FPGA-based soft processors. Master’s thesis, University of Toronto (2005) [166] Yiannacouras, P., Steffan, J.G., Rose, J.: Application-specific customization of soft processor microarchitecture. In: FPGA ’06: Proceedings of the 2006 ACM/SIGDA 14th International Symposium on Field Programmable Gate Arrays, pp. 201–210. ACM, New York, NY (2006), DOI http://doi.acm.org/ 10.1145/1117201.1117231 [167] Young, S.P.: Six-input multiplexer with two gate levels and three memory cells (1998). US Patent 5,744,995 [168] Young, S.P., Bauer, T.J., Chaudhary, K., Krishnamurthy, S.: FPGA repeatable interconnect structure with bidirectional and unidirectional interconnect lines (1999). US Patent 5,942,913 [169] Zhao, W., Cao, Y.: New generation of predictive technology model for sub- 45 nm early design exploration. IEEE Transactions on Electron Devices 53(11), 2816–2823 (2006), DOI 10.1109/TED.2006.884077. Transistor models downloaded from http:// www.eas.asu.edu/∼ptm/ [170] Zuchowski, P.S., Reynolds, C.B., Grupp, R.J., Davis, S.G., Cremen, B., Troxel, B.: A hybrid ASIC and FPGA architecture. In: Proceedings of the 2002 IEEE/ACM international conference on Computer-aided design ’02, pp. 187–194 (2002) book 4/9/2009 19: 28 page 179

Index

Symbols D 90 nm CMOS, 28–29, 91, 106 delay FPGA to ASIC gap, 49–56 A delay range, 107, 113, 116, 117, 125 Adaptive Logic Module (ALM), 7, 30, 47 definition, 107 adaptive lookup table (ALUT), 7 Design for Testability (DFT), 34 antifuse FPGAs, 131 dynamic power, 69, 126 architecture, 5 FPGA to ASIC gap, 38–39, 56–59 logic block, 5–9 dynamic transistor sizing, 21, 23 routing, 9–12 dynamic transistor sizing , 23 area FPGA to ASIC gap, 40–49 E area model elasticity, 108–109 minimum width transistor areas, 18–20 elasticity threshold factor, 109, 113–114 refinements to original model, 70–72 Elmore delay model, 76 area range, 107, 113, 116, 117, 125 definition, 20 definition, 107 issues, 22 ASIC, 1–3 flow, 32–36 F B fingering Basic Logic Element (BLE), 5, 6 effect on area model, 71 body biasing, 131 Flash FPGAs, 131 BSIM3, 23 FlowMap, 17, 94 BSIM4, 23 FPGA to ASIC gap past measurements, 24–26 C channel width, 11 G CLB, 6 gap clock ASIC to Custom, 26 constraint, 31, 34 gate boosting, 14, 87, 89 network, 35, 39, 59, 61 cluster inputs, 6 H cluster size, 6 heterogeneity connection block, 11–12 resource-based, 13 convex function, 21–22, 76 tile-based, 13 crosstalk, 37 heterogeneous logic blocks, 7, 8, 12–13 Cyclone II FPGA, 126, 127, 146 impact on area gap, 40

179 book 4/9/2009 19: 28 page 180

180 Index

heterogeneous logic blocks (cont.) predictive technology models, 100 impact on power gap, 59 process scaling, 100–103 impact on speed gap, 50 Q I Quartus II, 31 input connection block flexibility, 12 interconnect pitch, 28 R interesting trade-offs, 108–110 redundancy, 132 island-style FPGA, 10 representative path, 73 retiming L effect on FPGA to ASIC gap, 55 level restoring buffer, 14 routing logical equivalence, 13, 66 ASIC, 35–36 LogicLock, 31 FPGA, 18, 31 LUT routing segment, 10 definition, 5 routing track, 10 implementation, 16 row utilization, 35

M S masks, 1 scan chains, 34 Moore’s law, 100 segment length, 11 MPGA, 2, 25 single-driver routing, 12 multi-driver routing, 12 SIS, 17, 94 multiplexers, 14–16, 106, 149, 155 speed grades, 32 implementation choices, 69, 87, 89, 120–123 effect on FPGA to ASIC gap, 54 sizing considerations, 67 SRAM-based FPGAs, 13 transistor sizing, 86 static power, 14, 39, 69, 126, 131 with level restorer, 15 FPGA to ASIC gap, 38–39, 59–62 multipliers, 7, 9, 42, 50 static transistor sizing, 21–23 Stratix II FPGA, 28, 31, 146 N architecture, 7–9 non-recurring engineering (NRE) costs, 1 switch block flexibility, 11 switch box pattern, 11 O synthesis objective function, 66, 70, 79, 106, 146 ASIC, 32–35 output connection block flexibility, 12 FPGA, 31

P T parameter sensitivity T-VPack, 17, 94 calculation, 79 TILOS algorithm, 22, 76 during sizing, 79 transistor sizing, 20–24 parasitic extraction, 36 logic, 5 U impact on area gap, 44 unidirectional routing, 12 placement ASIC, 35–36 V FPGA, 18, 31 vectorless activity estimation, 39 posynomial function, 22, 76 VPR, 18, 20, 87, 94 definition, 21 delay model compared to HSPICE, 95