Eindhoven University of Technology

MASTER

From RTL to layout using blast chip

Postma, D.

Award date: 2002

Link to publication

Disclaimer This document contains a student thesis (bachelor's or master's), as authored by a student at Eindhoven University of Technology. Student theses are made available in the TU/e repository upon obtaining the required degree. The grade received is not published on the document as presented in the repository. The required complexity or quality of research of student theses may vary by program, and the required minimum study period may vary in duration.

General rights Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights.

• Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain T U/ e technische unh",,,,it.it eindhoven

Faculty of Electrical Engineering Section Design Technology For Electronic Systems (ICStES) ICS-ES 802

Master's Thesis

FROM RTL TO LAYOUT USING BLAST CHIP.

D. Postma

Supervisor: prof.dr.ir. P.R. Groeneveld Date: August 2002

The Faculty of Electrical Engineering of the Eindhoven University of Technology does not accept any responsibility regarding the contents of Master's Theses T U/ e technische universiteit eindhoven

Faculty of Electrical Engineering Section Design Technology (ICS/ES)

Master's Thesis

From RTL to Layout using Blast Chip D. Postma (414407)

Supervisor: prof. dr. ir. P. Groeneveld Date: August 2002

The faculty of Electrical Engineering of the Eindhoven University of Technology does not accept any responsibility regarding the contents of Master's Theses. Contents

1 Introduction 5

2 RTL to Layout 7 2.1 'Traditional Design Flow 7 2.2 Fixed Timing ...... 9 2.3 Logical Effort Methodology 10 2.4 Gain Based Synthesis flow . 11 2.5 Practical 'Training Assignments 12 2.5.1 Generating a Layout .. 12 2.5.2 Timing versus Area trade-off 14 2.5.3 Hierarchical Area Report 15 2.5.4 Example: FIR Filter 18

3 VLSI Power Analysis 21 3.1 Hot chips ...... 21 3.2 Power Consumption Model 21 3.2.1 Terms ...... 22 3.2.2 Switching Power . . 22 3.3 Low Power Design Methodologies . 24 3.4 Power Analysis in Blast Chip 3.1 . 25 3.5 Standard Cell Library Experiments . 26 3.6 Importing Standard Cell library ... 28 3.7 Power Analysis Experiments. . . . . 28 3.7.1 Switching Activity of an Inverter 28 3.7.2 Supply-voltage Scaling of a 32-bit Multiplier 29 3.7.3 Power Consumption of an 1D-IDCT 31

4 Conclusions and recommendations 33

A From Idea to Chiplayout in 15 minutes 37 A.1 Introduction ...... 37 A.I.1 Requirements .. 37 A.I.2 Getting Started 38 A.2 Blast Chip Reports ... 40 A.3 Blast CHip 'Hello world' 40

3 4 CONTENTS

B Magma TCL scripts 43 B.1 Main Flow: run.tel ...... 43 B.2 General Functions: procedures.tel .. . 44 B.3 Hierarchical Area Report: hierarea.tcl 47 B.4 Import Standard Cell Library 49

C C++ source of a FIR filter 51

D Power Tests 53 D.I General Sources. 53 D.LI 53 D.L2 power.tel 53 D.L3 run.tel 53 D.2 Library Without Power Characteristics 53 D.2.1 Liberty Source ...... 53 D.2.2 Relevant lines of Magma Logfile 54 D.2.3 Results of Power Analysis .. 54 D.3 default_ceIUeakage_power Attribute 55 D.3.1 Liberty Source ...... 55 D.3.2 Results of Power Analysis 55 D.4 leakage_power_unit Attribute .. 55 D.4.1 Liberty Source ...... 55 D.4.2 Results of Power Analysis 56 D.5 celLleakage_power Attribute .. . 56 D.5.1 Liberty Source ...... 56 D.5.2 Results of Power Analysis 56 D.6 leakage_power Group ...... 57 D.6.1 Liberty Source ...... 57 D.6.2 Results of Power Analysis 57 D.7 Derating Factors ...... 58 D.7.1 Liberty Source ...... 58 D.7.2 Results of Power Analysis 59 D.8 1D Lookup Table totaLoutput..neLcapacitance 60 D.8.1 Liberty Source ...... 60 D.8.2 Results of Power Analysis ... . 61 D.9 ID Lookup Table inpuLtransition_time 61 D.9.1 Liberty Source ...... 61 D.9.2 Results of Power Analysis 62

E Switching Activity of an Invertor 63 E.I Verilog Source: design. v .... . 63 E.2 Testbench: testbench.v ...... 63 E.3 Icarus Verilog batchfile: icarus.bat 64 E.4 Icarus Simulation result ...... 64 E.5 Library: simple. lib ...... 64 E.6 TCL script to import VCD: power.tel 65 Chapter 1

Introduction

At the Eindhoven University of Technology, department of Electrical Engineer­ ing, section Information and Communication Systems (rCS), research chair De­ sign Automation (ES), students can perform a practical training VLSI Design. Until recently, the goal of the practical training was to show students the various steps of digital system ~, from algorithm specific level to Register Trans­ fer level. ARIT Designer of Adelante Technologies is the tool that is used to generate an RTL description. Input to ARIT Designer is an algorithm written in C++ and a chip architecture the algorithm should be mapped on. Students should be able to explore various design alternatives and evaluate them using criteria such as performance, area and power consumption. In order to be able to get better understanding of the consequences of choices made at high level and in order to obtain numerical estimates of the criteria mentioned above (speed, area and power), it is desirable to extend the practical training with the steps from RTL to layout (see Figure 1.1).

Figure 1.1: Extended flow of practical training.

This report discusses this extension of the practical training using the RTL to layout tool 'Blast Chip' of Magma Design Automation. Chapter 2 gives an introduction of the flow from RTL to layout. First, the traditional flow is sum­ marized and afterwards, the flow that Blast Chip uses is explained. Chapter 3 discusses power analysis of VLSI designs and the way this can be done using Blast Chip. At the end of both chapters, some possible related practical as­ signments for the students are given. The last chapter provides conclusions and recommendations.

5 Chapter 2

RTL to Layout

2.1 Traditional Design Flow

Figure 2.1 shows a traditional VLSII design flow from RTL2 to physical layout. Starting with an RTL description, the design is synthesized into a gate level netlist using a standard cell library. After creating a floorplan by defining the locations of macro cells, memory blocks, bond cells, pads and main power wires, the logic gates are placed and routed.

Figure 2.1: Traditional VLSI design flow from RTL to physical layout.

The final step of the flow is checking the timing constraints; if timing con­ straints are not met, adjustments have to be made in previous steps, e.g. chang-

1 Very Large Scale Integration 2Register Transfer Level

7 8 CHAPTER 2. RTL TO LAYOUT ing some drive strengths of logic gates, restructuring logic of the worst timing path or even changing the RTL (e.g. pipelining). After these adjustments, the design needs to be placed, routed and checked for timing constraints again. Un­ fortunately, the adjustments to optimize the worst timing path often introduce new worst case timing paths, which might even be worse. Therefore, many slow iterations might be needed before all timing constraints are met. The main problem of this traditional flow is the fact that during synthesis, drive strengths for each logic gate are estimated using statistical (inaccurate) wire delay models. In early days of VLSI design, these inaccuracies were not a problem because path delay was mainly caused by the delay of individual gates. However, as feature sizes decrease, the timing problem moves more and more from a gate delay problem to a wire delay problem, as shown in Figure 2.2.

Gate delay versus wire delay 45 40 Q -I-Gate Delay 35 -\ ~WJreDelay

..... 30 \\ ~ Sum of delays III 25 \~ .e:>- .!! 20 X"-. ,,~ a 15 ""~ 10 ~ :::::::---r ~J --r 5 ..,..- o o 100 200 300 400 500 Feature size (nm)

Figure 2.2: As feature sizes decrease, the timing problem moves from a gate delay problem to a wire delay problem (source: [1]).

The reason for this is the fact that wire delay does not scale down as well as gate delay when dimensions shrink. Reducing sizes leads to decreased wire height and width, which results in a larger wire resistance (slower). Although the increased delay due to increased resistance is cancelled out by the decreased wire capacitance to ground, increased wire density leads to smaller distances between neighboring wires, which results in larger coupling capacitances between wires. Furthermore, for power and electromigration reasons, the aspect ratio of wires is changed such that wires become taller than they are wide. This leads to even higher coupling capacitances. Thus, the RC delay of the wire does not scale down as much as gate delay. Another problem concerning wire delay is the fact that wire delay is quadratic with wire length. When using a simple RC wire delay model, this is shown by the following equation: 2.2. FIXED TIMING 9

Here, R is the wire resistance, C is the wire capacitance, L is the wire length and W is the wire width.

Before placement and routing, wire lengths are unknown. Therefore, esti­ mating path delay in advance to determine optimal drive strengths of the logic gates becomes a very difficult task when a traditional RTL to layout flow is used.

2.2 Fixed Timing

Traditional tools (try to) manage wire delay after placement and routing. The amount of costly3 iterations that are needed to meet timing constraints grows as technology advances. Decomposition of a design into smaller blocks can be applied to solve this problem, but this approach introduces problems in other steps in the design flow, e.g. placement and wire planning become more difficult. It is obvious that such designs are far from optimal. The alternative is to work months to find a solution, or worse, to work months to find out that a solution does not even exist. A flow that eliminates the need of many iterations is shown in Figure 2.3. Using Ivan Sutherland's theory about 'Logical Effort' and dynamic gate sizing to control delay throughout the flow, timing can be fixed before placement and routing. Then, no iterations are required to meet timing constraints. The Log­ ical Effort methodology and gain based synthesis are discussed in the following two sections.

RTL I I J, I Logic synthesis I J, Fix timing II II J, Placement I I J, I Routing I J, Layout I I

Figure 2.3: Fixing time before placement and routing provides a flow from RTL to layout without many slow iterations.

terms of human and computational effort 10 CHAPTER 2. RTL TO LAYOUT

2.3 Logical Effort Methodology

Logical Effort is a methodology that can be used to optimize an arbitrary logical function for speed [5J. It can amongst others be used to find out which of several circuits with the same logic function is the fastest and to determine the optimal sizes of logic gates in order to minimize the delay. But the most interesting aspect of the logical effort methodology is that it shows that the delay of a logic gate can be controlled independent on the actual load it has to drive.

The logical effort methodology models the delay of a single logic gate and its load as shown in Equation 2.1:

d=g·h+p (2.1) where d is the total delay, 9 is the logical effort, h is the electrical effort or gain and p is the intrinsic delay of the gate. The unit of time is r, which is defined as the delay of an inverter driving an identical inverter with no parasitics, in a given process technology. The logical effort (9) is a measure for the topological complexity of a logic gate: a complex gate has a high logical effort, a low ability to produce output current and is slow. The logical effort of any logical function shows how much worse it is than an inverter at producing output current, given that each of the inputs of the logic gate present the same amount of input capacitance as an inverter. Increasing the size of a logic results in a higher drive strength, but higher Cin , so this definition can also be formulated as: the logical effort of a logic gate shows how much more input capacitance is needed to be able to produce the same output current as an inverter. Some examples of determining the logical effort of a logic gate are given below.

a-,----

b--i----..,..-<:\ 4 a --''-----l--l 1_1--1-r---y

b-----'--j lJ-l 1

Figure 2.4: Structures of an inverter, a 2-input NAND and a 2-input NOR. The transistor widths are chosen such that the ability to produce output current (and therefore the speed) are more or less the same.

Figure 2.4 shows the topological structures of a CMOS inverter, a 2-input NAND and a 2-input NOR gate. The numbers drawn at the transistors represent the relative transistor widths. The widths are chosen in such a way that the output currents are more or less the same: in order to produce the same output 2.4. GAIN BASED SYNTHESIS FLOW 11

current, PMOS transistors need to be about twice as wide as NMOS transistors because of the lower mobility of the majority carriers of P material (holes). Furthermore, two transistors in series both need twice the width of a single transistor to be able to produce the same output current. Assumed that all the transistor gate lengths are the same, the transistor gate capacitances are proportional to their widths and the input capacitances at the input pins of the inverter, 2-input NAND and 2-input NOR are 3, 4 and 5 capacitive units respectively. The values of the logical effort can be obtained by dividing the input capacitance of the logic gate by the input capacitance of the inverter. This results in 3/3 = 1 for the inverter, 4/3 = 1.33 for the 2-input NAND and 5/3 = 1.66 for the 2-input NOR.

The electrical effort (h) is the ratio between the output capacitance at the output pin of a logic gate (Gout) and the input capacitance (Gin):

h (2.2)

The larger Gout. the slower the gate. By increasing the widths of the tran­ sistors, at the cost of increased Gin, the ability to produce output current gets larger.

The intrinsic delay (p) is mainly caused by internal capacitances. Although a larger gate has larger internal capacitance, its ability to drive these capacitances is also larger. Therefore, the intrinsic delay is largely independent on the size of the logic gate. The logical effort (g) is also independent on the size of the transistors. The electrical effort (h) however is dependent on the size of the transistors, which means that the ratio ca:t can be kept constant by resizing the transistors. Thus, without knowing the exact value of Gout, the delay of a logic gate can be determined. This provides the possibility to fix timing before placement and routing.

More information about logical effort, e.g. about optimizing paths and net­ works of logic gates, can be found in [5J. Because of it's simplicity, the method of logical effort can be used to optimize large designs.

2.4 Gain Based Synthesis flow

Figure 2.5 shows the gain based synthesis RTL to layout flow of Blast Chip of Magma Design Automation. The most important difference with traditional tools is the fact that timing is fixed before placement and routing. To be able to tune cell gains throughout the flow, Blast Chip uses abstract logic cells in stead of the actual logic cells of the standard cell library. An abstract cell, called 'SuperCell', has the same logical function as the cell in the library it is derived from, but the difference is that it can be resized dynamically until the latest steps of placement and routing. Because of this, every single cell of the design can be tuned for timing exactly as needed. 12 CHAPTER 2. RTL TO LAYOUT

RTL &

Figure 2.5: Gain based synthesis: drive strengths of the standard cells can be tuned throughout the design flow.

Some advantages of the Blast Chip flow compared to the traditional flow are discussed now. First of all, because timing is fixed before placement and routing, no slow iterations are needed to meet timing. Another important advantage is the fact that every single logic gate is tuned exactly at the correct drive strength. Only the gates that need to drive long wires or other capacitances need to be relatively large to meet timing constraints. Non-critical paths only need small logic gates, which saves area and reduces crosstalk. During routing, the Super Cells are mapped back to the original library cells. To control timing during routing, the widths of wires can be changed. The end result is a well­ tuned chip layout that is generated in a relative short time. An important remark of this approach is that it is important to use a stan­ dard cell library that is supplied with many different drive strengths per logic function. The reason for this is that Blast Chip ultimately maps the SuperCells back to cells of the standard library. The more variations in drive strengths are available, the better this mapping can be performed. A good idea may be the use of a 'liquid cell' library, where any drive strength can be chosen.

2.5 Practical Training Assignments

2.5.1 Generating a Layout

The interface of Blast Chip is based on the scripting language TCL 4 . TCL commands can amongst others be used to enter parameter settings, control the design flow by entering run commands and to obtain various reports. One of the first practical training assignments could be the generation of several layouts

4Tool Command Language 2.5. PRACTICAL TRAINING ASSIGNMENTS 13

using prepared TCL scripts, with which students can evaluate important criteria of their design (e.g. area, speed and power).

Appendix B shows general scripts that can be used for this purpose. The main parts of the TCL script 'run.tcl' are explained briefly now. The script starts with the definition of the procedure 'dpconfig'. This procedure provides the ability to set global config settings. By calling' [dpconfig config_ var _name]" the value of the config variable 'config_var..name' can be obtained anywhere. This provides an easy way for students to set the most important configuration settings, such as the name of the Verilog module they want to generate a layout from and the target clock frequency.

The command: 'set I [ImportLib]' imports a volcan05 that contains a stan­ dard cell library and sets variable '1' to the name of the library, e.g. '\slow'. By applying the command 'set m [ImportRtl $1]' a Verilog design is imported into Blast Chip. 'fix netlist' starts logic synthesis and 'fix timing' optimizes and fixes the timing for each path using the logical effort methodology as described in section 2.3. In this script, the 'fix time' command is replaced by a procedure 'FixTime' because some extra settings are required, amongst others some tim­ ing constraints and the clock period. After the timing is fixed, a floorplan is generated using the procedure 'FixPlan'. Here, the chip aspect ratio and cell row utilization percentage are set. Then 'fix cell' is called which starts area optimization and placement of cells. After routing the clock signals using the command 'fix clock', the design is routed by applying the command 'fix wire'.

Appendix A contains brief students manual that should be enough to gen­ erate basic layouts like shown in Figure 2.6, the result of layout generation of an 1D-IDCT by Blast Chip.

Chip database 14 CHAPTER 2. RTL TO LAYOUT

Figure 2.6: Layout of an 1D-IDCT generated by Blast Chip.

2.5.2 Timing versus Area trade-off To control timing, Blast Chip performs numerous timing optimization tech­ niques at the cost of area. These techniques are amongst others:

• Restructuring logic. E.g. replacing a ripple carry adders (slow, but small) by a carry look ahead adders (fast, but large),

• Logic gate resizing. By increasing the size of a gate, its drive strength grows, so it can charge capacitances faster,

• Parallel gates. If the standard cell library does not contain a certain desired size, two or more gates in parallel can be used to 'create' a gate with more drive strength,

• Cloning. If a gate drives multiple gates, cloning can be applied to divide the charge of the Gin of the fanout over multiple gates.

An interesting objective of the practical training could be to show students the trade-off between timing and area. An example of the area timing trade-off of a 32-bit multiplier is given now.

Given the following RTL description: 2.5. PRACTICAL TRAINING ASSIGNMENTS 15

module design Ca, b, y, elk); input [31:0J a; input [31:0J b; input elk; output [31:0J y; reg [31 :0] y;

always ~ Cposedge elk) begin y <= a * b; end endmodule

and the following configuration settings in the script run.tel (see Appendix B.1):

set dpconfigCmax_delay) 100n set dpeonfigCplan_utilization) 0.80

Then, when the configuration parameter dpconfig( clock_period) is varied between 3 ns and 50 ns, the following graph can be produced that shows the relation between timing and area:

1400000 1200000 ~ ;;- 1000000 / E .:.. 800000 t' 1'1 ~ 600000 / ~ 400000 / 200000 / -"'o..K o o 50 _ 100 100 200 250 Clock frequency (MHz)

Figure 2.7: Area timing trade-off of a 32 bit multiplier.

Below 100 MHz, timing constraints are easily met which means that no area has to be traded for speed. At clock frequencies above 100 MHz, Blast Chip applied a lot of timing optimization techniques that resulted in much larger chip.

2.5.3 Hierarchical Area Report In order to perform placement and routing, Blast Chip flattens the design be­ cause efficient algorithms to perform hierarchical placement and routing do not exist yet. However, for a designer it is useful to know which hierarchical block 16 CHAPTER 2. RTL TO LAYOUT is the largest and therefore a good candidate for area optimization. Although much hierarchical information is lost during flattening, some hierarchical infor­ mation still exists because Blast Chip assigns a hierarchical name to each logic gate. Using this information, Blast Chip can show a layout where the largest clusters are drawn with different colors, e.g. as depicted in Figure 2.8.

Figure 2.8: Layout of an 1D-IDCT. The largest clusters are shown with different colors.

Although this figure gives a good impression of the largest clusters, Blast Chip does not yet contain a command to obtain a numerical hierarchical area report. The C++ sources of Blast Chip are not available, so a TCL script is written that generates an area report sorted by the largest hierarchical blocks. This subsection discusses two of such scripts. The following TCL code snippet can be used to obtain an associative array with as key the name of the cell and as value the cell area:

# Loop over all cells of model $m data loop c model_cell $m { # Get area of cell $c set b [data only cell_outline $c] set area [unfix [data get $b area]] 2.5. PRACTICAL TRAINING ASSIGNMENTS 17

# Store area of cell $c in associative array $cells set cells($c) $area } The first script uses the code above to import all cells and their area into an associative array. Then, the hierarchy is recreated by walking over all cells, analyze the name for hierarchical information and insert it into a TeL tree data structure [4], as shown in Figure 2.9. After that, all nodes of the tree are recursively visited using a depth first search approach to calculate the total area of all the parent nodes. The area of a parent node is defined as the sum of the area of the child nodes.

Figure 2.9: First the hierarchy is re-created, then a depth first search is per­ formed to determine the area of all the parent nodes.

After implementing this algorithm, it appeared that it was terribly slow, es­ pecially for larger designs (>10000 cells). The problem is the fact that TeL only uses strings, lists of strings and associative arrays as primitive data structures. Because of that it is very difficult to design efficient tree structures using TeL. The TeL tree implementation that is discussed in [4] is based on a internal associative array that stores tree nodes using hierarchical names as keys. The depth first search based calculation of the area of nodes needs many 'who are my children' lookups, which leads to very slow operation because per node, all other nodes need to be examined to determine whether it is a child. Therefore, this algorithm has a computational magnitude of about O(N2), where N is the number of cells. A simpler and faster solution had to be found, so a second script is created which is more efficient. Again, the list of cells with their area is imported into an associative array. This array consists of all the leaf cells of a tree that needs to be build. The generation speed of the hierarchical report is optimized using the fact that a child knows the exact key name of its parent. Therefore, a 'who is my parent' lookup is much faster than a 'who are my children' lookup. Using this information, another algorithm is found that is depicted in Figure 2.lO.

12

12 b

5 d e 7 I5 l

Figure 2.lO: Fast algorithm of generating a hierarchical area report. 18 CHAPTER 2. RTL TO LAYOUT

While recreating the hierarchy, the area of a cell is immediately added to the area of the parent and the parent's parents. Dependent on the hierarchical cell distribution, this algorithm has a computational magnitude of about O(N . average_hierarchical-depth), where N is the amount of cells. Typically, the average hierarchical depth will be something like log(N). The following code snippet shows the main algorithm:

# Loop over all cells foreach cell [array names source) { # Prepare looping over parents set cell_split [split $cell ". ") set depth [llength $cell_split)

# Loop over all parents of cell for {set i O} {$i < $depth} {incr i 1} { set index [expr $depth - $i - 1] set parent [join [lrange $cell_split 0 $index) ". "]

# If parent entry does not exist yet, create one if {![info exists destination($parent)]} { set destination($parent) 0 }

# Add area of cell to total area of parent set destination($parent) [expr {$destination($parent) + $source($cell)} } }

2.5.4 Example: FIR Filter

This subsection discusses an example of the use of the hierarchical area report. Appendix C contains the C++ source of a FIR filter of which a layout is gener­ ated. First, the C++ code is mapped on the default ARIT architecture. Then, layouts are generated for a number of clock frequencies. Figure 2.11 shows the results: a graph of the total cell area as a function of the target clock frequency. 2.5. PRACTICAL TRAINING ASSIGNMENTS 19

FlRwlth standard multiplier 800000 700000 f> ~ 600000 / ~500000 / ca / ...GI 400000 ca 300000 /' GI "- ~ 0 200000 v v ~ 100000 0 0,0 20,0 40,0 60,0 80,0 100,0 120,0 140,0 Clock freq. (MHz)

Figure 2.11: Cell area as a function of the clock frequency of the FIR filter with a standard multiplier.

Running the hierarchical area report script on the 125 MHz layout, returns the following result:

########## BEGIN Top [10] clusters (ordered by area) ########## +------\ Area (u2) \ Cluster +------253068.8 /work/processor/processor/mult_1 119660.8 /work/processor/processor/mult_1.exp_142 81145.6 /work/processor/processor/mult_1.exp_47 24102.4 /work/processor/processor/reg_dx_alu_1 18841.6 /work/processor/processor/alu_1 16780.8 /work/processor/processor/alu_1.exp_317 11609.6 /work/processor/processor/u_urom_ir 10886.4 /work/processor/processor/reg_dy_alu_1 10809.6 /work/processor/processor/reg_dy_alu_1.reg_u_1 10387.2 /work/processor/processor/microrom ########## END Top [10] clusters ##########

Most of the area is occupied by the multiplier 'mult_1', so this multiplier is a good candidate to optimize for timing. Therefore, the standard multiplier in the ARIT architecture is replaced by a pipelined multiplier. Again, layouts are generated for various clock frequencies and the area is compared. The results are shown in Figure 2.12. 20 CHAPTER 2. RTL TO LAYOUT

FIR· Standard vs. pipe lined multiplier 800000 700000 f> -<>-Standard M.lltiplier .r 600000 / £l 1 500000 -c-Apelined rrultiplier / / III f 400000 / / III 300000 7 / =; ,.., ..... ~ ..n.---U (J 200000 vv ~ 100000 o M ~ ~ ~ ~ 1~ 1~ 1~1~ 1~ Clock freq. (MHz)

Figure 2.12: Standard multiplier versus pipelined multiplier.

Around an area usage of about 275000 /lm, the FIR filter with the stan­ dard multiplier can run at about 100 MHz where the filter with the pipelined multiplier can run at about 140 MHz, a performance increase of about 40%.

FIR· Standard vs. pipelined multiplier - zoom 290000 270000 A -<>- Standard MJltiplier ~ 250000 /' -c-Apelined rrultiplier '; 230000 / .-0 CII L ...0---- :; 210000 u- ~ Cii 190000 V' (J <:>-

Figure 2.13: Standard multiplier versus pipelined multiplier, zoomed around cross point.

Figure 2.13 shows the same results, but zoomed around the cross point of the graphs. At low frequencies (below 70 MHz), the standard multiplier uses less area than the pipelined multiplier. This can be explained by the fact that the standard multiplier has less overhead. For unexperienced chip designers the decision of whether the standard or the pipelined multiplier is better in terms of area, can only be made after a layout has been generated for the target clock frequency. Chapter 3

VLSI Power Analysis

3.1 Hot chips

Today's chips are hot, for many applications far too hot. Therefore it becomes increasingly important to put more emphasize on low power design. For hand­ held devices that need to be small but require high performance, such as mobile phones, PDA's and laptops, the power consumption of the processor is already a major limiting factor. Tradeoffs have to be made between battery lifetime and battery size and weight. Reducing the power consumption of the processor for both active state (to increase working time) and for inactive state (to increase standby times) has become a major design issue for such devices. But the power consumption problem is not limited to handheld devices only. High core temperatures impose high failure rates and thus low reliability. So, extensive cooling is needed which implicates high packaging costs and/or noisy fans. With many processors in the same room, air conditioning is needed, which leads to even higher energy bills. It's obvious that design for low-power methodologies are very important and in some application domains even critical. The first step towards design for low­ power is analyzing and estimating power usage of VLSI circuits, both qualitative and quantitative. Blast Chip 3.1 is the first version that is shipped with the abil­ ity to analyze power consumption of large chips. Although the main reasons for Blast Chip to implement power analysis are basically for power routing planning purposes, the feature can also be used within a practical training framework: to show students the consequences of high level design choices have on the power consumption. This chapter gives an introduction to the model that is used by Blast Chip to analyze power consumption. After that, some experiments concerning power analysis are discussed.

3.2 Power Consumption Model

This section discusses the model that can be used to perform power analysis of large CMOS circuits [5,6, 7J. After a short introduction to various power con­ sumption terms, the main sources of power consumption are summarized. After that, power analysis of large circuits by means of 'node activity' is discussed.

21 22 CHAPTER 3. VLSI POWER ANALYSIS

At the end O'f this sectiO'n, the PO'wer analysis functiO'nality O'f Blast Chip 3.1 is briefly summarized.

3.2.1 Terms The PO'wer cO'nsumptiO'n O'f a large CMOS circuits can be decO'mpO'sed intO' a number O'f separate sources. First O'f all, the tO'tal PO'wer cO'nsumptiO'n can be separated intO' a dynamic and a static part:

Ptotal = Pstatic + Pdynamic (3.1) Static PO'wer is the PO'wer that is dissipated when a cell is inactive. Because this cO'mpO'nent is mainly caused by subtreshO'ld leakage currents, static power is alsO' called leakage power (/'leakage)' Typically, the fractiO'n O'f static PO'wer in relatiO'n to' the tO'tal PO'wer cO'nsumptiO'n is IO'w: usually less than 1 %. HO'wever, when a chip is in idle mO'de fO'r mO'st of the time (little switching actiVity), static PO'wer can be a problem when long standby times are required. FO'r this reason it is impO'rtant to take leakage currents intO' accO'unt in the PO'wer mO'del.

Dynamic PO'wer (Pdynamic) is the PO'wer that is cO'nsumed during the time a circuit is active. Two types of dynamic power that can be distinguished are the so called switching power and the internal power:

Pdynamic = Pswitching + Pinternal (3.2) The switching power (Pswitching) is the PO'wer that is cO'nsumed due to the charging and discharging of O'utput capacitances of cells. SectiO'n 3.2.2 discusses switching PO'wer mO're deeply. The internal PO'wer (Pinternal) is the dynamic power that is dissipated within the bO'undaries O'f a cell. TwO' types O'f internal PO'wer are the power due to the charging and discharging O'f internal capacitances (Pinternalcap) and the shori circuit power (Pshortcirc'Uit).

Pinternal Pinternalcap + Pshortcircuit (3.3) During the switching of a CMOS gate, the NMOS and PMOS circuit will bO'th be in a conducting state for a shO'rt periO'd O'f time. During this period, a shO'rt circuit current flows frO'm Vdd to' G N D. The shO'rt circuit PO'wer is the PO'wer that is dissipated due to this current. When the rise and fall times of the input signals O'f a cell are much IO'nger than the rise and fall time O'f the output, the shO'rt circuit PO'wer can be significant because the duration of the short circuit is relatively IO'ng. In that case, it can be about 30% O'f the tO'tal power cO'nsumptiO'n.

3.2.2 Switching Power As mentioned befO're, switching power is caused by the charging and discharging O'f output capacitances O'f cells in the circuit. This subsectiO'n discusses switching PO'wer in more detaiL

We start by examining the switching PO'wer O'f a single inverter as shown in Figure 3.L It is assumed that all output capacitances are included in CL (wire capacitance and capacitance of the fanO'ut). Let the inverter be in a stable state 3.2. POWER CONSUMPTION MODEL 23 where the input is high (1) and the output is low (0). The charge of CL equals O.

Vin I--,--Voul

__----'-_ ...... L ...... _ Gnd

Figure 3.1: Model of CMOS inverter. CL consists of the wire capacitances and the input capacitances of the fanout.

When the input switches from high to low, it is assumed that the output switches from 0 to Vdd in T seconds. During the transition, the current that is drawn from the power supply to charge CL from 0 to CL Vdd, equals: out z.() t = CL-- dV (3.4) dt The power that is given by: pet) i(t)Vdd (3.5) The energy that is drawn from the power supply can be derived as follows:

T EO ...... l 10 P(t)dt

Vdd loT i(t)dt

= Vdd loT CLdV out dt

CLVdd IoVd.d dVout = cLvlL (3.6)

Part of this energy is stored in CL , and the rest is dissipated in the PMOS transistor. The amount of energy that is stored in CL is given by:

Ecap = loT Pcap(t)dt

= loT Vout(t)icap(t)dt

Vdd = CL l0 Vout(t)dVout 1 2 = 2" CLVdd (3.7) 24 CHAPTER 3. VLSI POWER ANALYSIS

So, half of the energy is stored in the capacitor CL and the other half is dissipated in the PMOS transistor.

When a high to low transition occurs (El-+O) at the output, no energy is drawn from Vdd, and the energy that is stored in CL (= ~ CLVld)' is dissipated in the NMOS transistor. The total amount of energy drawn from the power supply for a high to low transition and a low to high transition equals CL vld'

The model presented above can be generalized to any other CMOS cell by replacing the NMOS transistor by a NMOS network and the PMOS transis­ tor by the complementary PMOS network, as shown in Figure 3.2. The same derivations for energy and power consumption hold.

~---'-- Gnd

Figure 3.2: General switching power model of CMOS cell.

3.3 Low Power Design Methodologies

During the design of a large chip many methodologies can be applied to reduce the power consumption [6J. A technique that has significant impact on power consumption is scaling down Vdd as Pswitching is proportional to vld' In order to maintain the same circuit performance, 'lit should be scaled with the same factor as as Vdd [6, p. 22]. This is a limiting factor because decreasing 'lit leads to higher subtreshold leakage currents.

Some other power reduction techniques are amongst others:

• avoid the use of higher clock frequenCies than needed (e.g. turn of the clock when a module is idle),

• introduce local caches to reduce the number of global memory accesses,

• use power optimized bit codes to reduce the number of bit changes (e.g. Gray code for addresses),

• avoid glitches by restructuring logic,

• don't switch control signals that are 'don't care',

• use copper wires in stead of aluminium. 3.4. POWER ANALYSIS IN BLAST CHIP 3.1 25

3.4 Power Analysis in Blast Chip 3.1

Figure 3.3 shows the typical flow for power analysis using Blast Chip. Using switching activity, interconnect capacitances and power figures of the standard cell library, an accurate estimation of the total power consumption can be made. After the step 'fix netlist', power analysis can be performed using any informa­ tion that is available at the particular stage in the flow.

I Test bench I I RTL Description i I Technology Library I J, J, I HDL Simulator: I Value Change Dump (VCD) J, RTL Blast Chip I I ..[, I fix netlist I Power Analysis .;, v ...

-' I ;~"glaccuracy I Layout I ,

Figure 3.3: Typical power analysis flow using Blast Chip and a HDL simulator.

As shown in section 3.2.2, one of the most important pieces of informa­ tion concerning power consumption is the amount of charges and discharges of capacitances. In Blast Chip, those can be computed when proper switching activity values at each circuit node are assigned. The switching activity of a net is represented by two variables: the toggle rate (T) and the probability that the signal value is logic '1' (P). One possibility of setting switching activity is by applying the command: force activity annotate Using this command, default signal probabilities and toggle rates of models and individual nets can be set. Another method to set activities is importing a VCD file (Value Change Dump) that is generated by simulation using rep­ resentative input vectors. The corresponding Blast Chip command to do this is: import vcd When the activities of the sequential elements of a circuit are specified (ei­ ther by 'force activity annotate' or 'import vcd'), switching activities can be propagated through the combinational logic between sequential elements. The Blast Chip command to perform this is: 26 CHAPTER 3. VLSI POWER ANALYSIS

force activity propagate

Propagation of activity through a logic cell can be done if the probabilities and toggle rates of the input signals and the logical function are known [6, p. 137J. An example: for an inverter, the probability of logic '1' at the output (Po) equals 1 Pi. The toggle rate of the output (To) is the same as the toggle rate of the input because for each input transition, an output transition occurs. So, let the input of an inverter have Pi = 0.3 and Ii = 75, then:

Po = 1 - Pi = 0.7 To = Ti = 75

Another example: for an AND gate with Til = 45, Ti2 = 87, Pi! = 0.2, Pi2 = 0.7, the activity figures of the output are:

IT Pi = 0.2·0.7 = 0.14 iEinputs To Pil . Ti2 + Pi2 • Til = 0.2· 87 + 0.7 . 45 = 48.9

But the switching activity alone is not enough. Capacitance of the inter­ connect is also an important piece of information. Estimations of wire lengths and their capacitance are inaccurate before placement and routing, so power analysis before placement and routing is inaccurate as welL

3.5 Standard Cell Library Experiments

In order to get more understanding of a power characterized standard cell li­ brary in the Synopsis Liberty (.lib) format [7J, a number of small experiments are performed using a very simple design (a single inverter). Several simple handwritten libraries are created to test certain parameters of the library and their influence on the results of power analysis. Because of the very small scale of the design, wire capacitance does not play an important role here, the focus lies on leakage and internal power. Appendix D contains sources and results of the tests that are discussed in this section.

No power characteristics The first test is performed to check the results of power analysis when the library contains no power figures at all. The expectation is that the power results are all zero. The results were as expected, see also Appendix D.2. default_ceILleakage_power Appendix D.3 shows the results of some tests of the attribute defauILcellJeakage_power. All results were correct and as expected. 3.5. STANDARD CELL LIBRARY EXPERIMENTS 27 leakage_power _unit Appendix D.4 shows the results of some tests of the attribute leakage_power _unit. The parameter defaulLcelUeakage_power is set to 2. All results were correct and as expected. celUeakage_power Appendix D.5 shows the results of some tests of the attribute ceIUeakage_power. All results were correct and as expected. leakage_power group The Liberty leakage_power group specifies the leakage power of a logic cell when it is dependent on the logical state of the pins. Appendix D.6 shows the results of some tests of this parameter. The results appeared to be incorrect. The resulting leakage power is a factor nom_voltage to high. This bug is reported to Magma Design Automation Inc.

Derating The values of many Liberty parameters are dependent on operating conditions such as the manufacturing process (P), supply voltage (V) and temperature (T). One methodology the Liberty format uses to specify the influence of different operating conditions is called derating. Because derating is not well explained in [7], it is briefly discussed here.

Let x be the name of an attribute of the library that is sensitive for variations in the operating conditions. Let the value of x be specified for certain nominal operating conditions P nom , V nom and Tnom. In case the actual process, voltage and temperature differ, the corresponding value of x can be determined using derating factors kp,x, kv,x and kT,x as follows:

x = X nom ' II (1 + kc ,x6.c) (3.8) cE{P,V,T}

where 6.c = c - Cnom

Appendix D.7 shows the results of some tests of varying the derating fac­ tors and operating conditions. The following TeL command was used to set operating conditions:

rule condition default $1 -process 1.0 -voltage 2.0 -temperature 125.0

All results were correct and as expected.

ID Lookup Table totaLoutput..neLcapacitance Appendix D.8 shows the results of a test of the lookup table parameter to­ taLoutpuLneLcapacitance. All results were correct and as expected. 28 CHAPTER 3. VLSI POWER ANALYSIS

lD Lookup Table input_transition_time . Appendix D.9 shows the results of a test of the lookup table parameter in­ put.transition_time. The input slew on input pin a is varied using the command:

force timing slew $m/mpin:a 5n

All results were correct and as expected.

3.6 Importing Standard Cell library

Because a prepared standard cell library with power characteristics is not avail­ able for Blast Chip, a search on the internet is performed to try to find one for free; One is found [9J that is targeted on a TSMC O.25/-Lm, 2.5 V CMOS process. Importing this library into Magma appeared to be a skill on its own. Under­ standing the various steps to perform, the meaning of many different file types, mapping GDSII layers onto Blast Chip layers, how to deal with vias etc. etc. is a very difficult task without proper documentation. Nevertheless, a TCL script is created that seems to work in most cases, see Appendix B.4. Although some designs are analyzed successfully using this script, it is not recommended to use this script for practical training purposes. It is probably incomplete and is instable. Small changes in the script have resulted in Blast Chip to crash. It is better to wait until Magma releases a tutorial that is shipped with a power characterized standard cell library volcano that is prepared by an experienced user.

3.7 Power Analysis Experiments

This section presents the power analysis of some example designs. As mentioned in previous sections, the largest amount of power is typically consumed due to switching and therefore it is important to have good estimates of the switching activity per node. One way to get representative activity figures is to use a HDL simulator combined with a representative test bench (input vectors). Be­ cause a commercial HDL simulator was not available, the freely available Verilog simulator 'Icarus Verilog' [8J is used to generate VCD files. VCD files can be imported into Blast Chip to set switching activities. After activity propagation, the desired power consumption report can be generated.

3.7.1 Switching Activity of an Inverter The first experiment in using VCD files concerns the analysis of a single inverter. Appendix E shows the sources of the Verilog code of the design, the Verilog source of the testbench, the result of the simulation by Icarus verilog, a simple Liberty file that is used, the results of power analysis and some TCL scripts. Simulation of the design resulted in the following output: time= 1 a=1 out=O time'" 2 a=O out""1 time"" 3 a=1 out=O time"" 4 a=O out:1 3.7. POWER ANALYSIS EXPERIMENTS 29 time= 5 a=l out=O time= 6 a=O out=l time= 7 a=l out=O time= 8 a=O out=l time= 9 a=l out=O time= 10 a=O out=l

After simulation by Icarus, the VCD file is imported into blast chip. Initially, a problem occurred after 'import vcd' was applied because the activity figures where not set:

############### BEGIN Activity Report ######################## !work!design!design!mpin:out no data !work!design!design!mpin:a no data ############### END Activity Report ##########################

This was caused by the fact that Icarus generated a VCD file in DOS text format with \r\n line breaks. Blast Chip appeared not to be able to interpret VCD files in DOS format correctly. After running 'dos2unix' on the VCD file and importing again, the following activities were set on the nets:

############### BEGIN Activity Report ######################## mpin:out activity_imported_vcd {{O.OOOe+OO 4.000e+12 5.000e+12}} mpin:a activity_imported_vcd {{O.OOOe+OO 5.000e+12 4.000e+12}} ############### END Activity Report ##########################

The numbers after 'activity_imported_vcd' should be the probability, the number of rises and the number of falls respectively. All three figures are incor­ rect. The signal probability of pin a should be 0.5, the number of rises 4 and the number of falls 5. After reporting this to Magma, the factor 10E+12 in the counts appeared to be a bug: the results were divided by 1 ps once too many. The reason the probability is not set correctly is still unknown.

Because of these major bugs, no further attempt is made to try to perform power analysis using 'import vcd'. The results would be useless.

3.7.2 Supply-voltage Scaling of a 32-bit Multiplier

As Pswitching is proportional to Vdd, reducing supply voltage of a chip typically reduces its power consumption. When vt is kept constant (e.g. in case the same standard cell library is used), the circuit performance will decrease be­ cause capacitors are charged and discharged slower. Therefore, meeting timing constraints gets harder and more chip area is needed to compensate for that. This subsection discusses some experiments about scaling the supply voltage of a 32-bit multiplier while keeping vt and the timing constraints constant. In Blast Chip, the target supply voltage can be set using a TCL command like: rule condition default $1 -voltage 2.3

Because of the bugs concerning 'import vcd', power analysis is performed by using the following commands: 30 CHAPTER 3. VLSI POWER ANALYSIS

force activity annotate $m -probability 0.5 -toggles 1e6 force activity propagate $m $1 report power analysis $m

Figure 3.4 shows the required area when the target supply voltage is varied.

140000 r;t r;.I 130000 -8-10ns ;::- 120000 "J\ ~15ns .§.. 110000 M\ ...... \. --&-20ns :... 100000 v 1\1 90000 A. ~ ~ 'ii ~, (,) 80000 - - - 1>0 70000 60000 0 1 2 3 4 5 6 Supply voltage (V)

Figure 3.4: Required area as a function of supply voltage, for three values of clock period: IOns, 15ns and 20ns

In general the amount of required area reduces when the supply voltage increases. At low voltages, some random behavior can be noticed. This may be explained by the fact that at low voltages Blast Chip may have applied certain optimization steps that it did not use at higher supply voltages. Figure 3.5 shows the results of power analysis for various values of the supply voltage. In general, the power consumption increases as the supply voltage increases and some wild behavior occurs at lower supply voltages, probably for the same reason mentioned above.

9000 8000 J> 7000 J'l / §" 6000 r/ / 2.... 5000 /' /' Q. / / ; 4000 "s-e"" -8-10ns - ~ 3000 r.I ./ g')J, ~ ~15ns - 2000 "'" - ~ -tr-20 ns 1000 s- O 0 1 2 3 4 5 6 Supply voltage (V)

Figure 3.5: Power consumption as a function of supply voltage, for three values of clock period: IOns, 15ns and 20ns 3.7. POWER ANALYSIS EXPERIMENTS 31

In Figure 3.6, the results of area and power are combined into one graph to show that area can be traded for low power.

9000

~ --e-10 ns ~ 8000 [ 7000 ~15ns c--

~60oo ~20ns - -:- 5000 ; 4000 ~ "- "- ,., ~ 3000 / .... ----- tJ- -a;:- 2000 1000 ~ ~ "- o 'a "" 60000 80000 100000 120000 140000 2 Cell area (um )

Figure 3.6: Combining previous two figures shows that area can be traded for power.

3.7.3 Power Consumption of an ID-IDCT Although 'import vcd' showed no good results, it might be useful to examine the possibilities of simulating a design generated by ARIT Designer for the time that 'import vcd' works correctly. ARIT Designer can generate a Verilog test bench that can be used to test the generated Verilog design. An attempt is made to simulate a design generated by ARIT Designer using Icarus Verilog. The first encountered problem is the fact that Icarus Verilog does not support several Verilog statements that are used by the ARIT Designer Verilog generator, e.g. a multi-bit parameter. After manually adapting the generated Verilog source, Icarus did not complain about unsupported Verilog statements anymore, but something worse happened: it crashed. No attempt is made to solve the crash as this may be very difficult. When performing power analysis with 'import vcd' based on simulation of an ARIT test bench, it is recommended to use a commercial Verilog simulator to generate VCD files in stead of Icarus Verilog, or give the students an example design with some VCD files that are not generated by themselves. Chapter 4

Conclusions and recommendations

This report discusses the extension of a practical course VLSI design with the steps from RTL to layout. For a number of reasons, Blast Chip of Magma Design Automation Inc. is a suitable tool to use for the extension. First of all, it is a tool that is already used widely by industry and because of its fixed timing design flow it will probably grow in popularity the coming decade.

Secondly, the interface to Blast Chip is formed by the easy to learn scripting language TCL (Tool Command Language) and an intuitive GUI (graphical user interface). Therefore, students will be able to interact with the tool and explore designs without spending much effort in just understanding the tool interface.

Where meeting timing constraints is the today's major VLSI design problem, soon another problem will be a limiting factor in increasing chip performance: power consumption. Therefore, it is a good thing to prepare students for low power design on all design levels. Recently, Blast Chip is extended with the ability to analyze power consumption. When a good power characterized stan­ dard cell library and a proper Verilog simulator become available, students can evaluate their designs for power using Blast Chip.

Although "learn by doing" can be a good principle, care has to be taken that students won't only learn how to push a button, but will also learn what is actually happening. To achieve this it may be wise to put some effort in learning them some basic theory, e.g. about the logical effort methodology, wire delay, cross-talk and some timing optimization techniques. To test whether students have actually learned something, their knowledge should be tested at the end of the course, e.g. by an examination or by letting them solve several problems as an exercise, or by introducing a competition element between students.

By extending the practical course with the steps from RTL to layout, stu­ dents will get more understanding about deep sub-micron VLSI problems and will also learn techniques how to solve them. That way they will be prepared to design tomorrow's large chips or EDA tools.

33 Bibliography

[1] EDA Roadmap Taskforce Report, Design of Microprocessors, Silicon Inte­ gration Initiative Inc. and Electronic Design Automation Industry Council, 1999.

[2] Blast Chip 3.0 Users Guide, Magma Design Automation Inc., 200l.

[3] Blast Chip 3.1 Users Guide~ Magma Design Automation Inc., 2002.

[4] Flynt, C, Tcl/Tk for Real Programmers, Academic Press, San Diego, 1999. [5] Sutherland IE, Sproull RF and Harris D, Logical Effort: Designing Fast CMOS Circuits, Morgan Kaufmann Publishers, Inc. San Francisco, Cali­ fornia, 1999. [6] Rabaey, JM, Pedram M, Lower Power Design Methodologies, Kluwer Academix Publishers, 1996.

[7] Liberty User Guide, v1999, Inc., 1999. [8) Icarus Verilog website, http://icarus.com/eda/verilog/index.html [9] Free TSMC 0.25um Standard Cell Library, http://www.ee.vt.edu/ ha/celUibrary /distribution.html

35 Appendix A

From Idea to Chiplayout In• 15 minutes

A.I Introd uction

This document is a short manual to the practical training in VLSI design, fo­ cussing on generating a chip layout based on a Verilog RTL description, using Blast Chip of Magma Design Automation.

A.I.1 Requirements This manual assumes you are using a Win32 platform. The requirements are:

• An account on a host running ART Designer (C to RTL)

• An account on a host running Blast Chip (RTL to layout)

• PuTTY, a free telnet / SSH client

• Exceed, an X server for Win32 platforms

Hosts- Currently, the following hosts are available during the practical training:

• kameleon.ics.ele.tue.nl

• 03.ics.ele. tue.nl

• co.ics.ele.tue.nl

• c02.ics.ele.tue.nl

• c03.ics.ele.tue.nl

Blast Chip can only run on 03, co, co2 and c03. All of these need SSH or rlogin as terminal protocol. The hosts co, co2 and co3 are dual processor machines.

37 38 APPENDIX A. FROM IDEA TO CHIPLAYOUT IN 15 MINUTES

Terminal An terminal client is needed to login to the system Blast Chip is installed on. Windows 9x is shipped with telnet, but does not have a client for secure login (SSH) which is needed if you want to login to certain hosts directly. A good and freely downloadable telnet / SSH client for Win32 platforms is PuTTY. At the time of writing of this tutorial, the single executable putty.exe can be downloaded (324kB) from http://www.chiark.greenend.org.uk/-sgtatham/putty/

X server In order to be able to use the graphical user interface of Blast Chip, an X server is required. For windows systems, 'Exceed' is a well known X server. Some window manager related problems have been reported when Blast Chip is used with Exceed. Two solutions are:

• Start Xconfig. Click the screen 0 tab. From the window manager dialog l box, select Native . Run 'Exceed' . • Start the X server by running the tool HWM2 that is shipped with Exceed.

Furthermore, in order to run graphical applications you must configure the display to which an X client (an application) needs to send it's graphical data to. This can be achieved by setting the environment variable 'DISPLAY' to the IP address or hostname of the system you're working on, followed by ':0', Example: setenv DISPLAY pc05,ics.ele.tue.nl:O

This command tells the terminal session to send all X data to the X server that is running on host pc05..ele.tue.nl. An easy way to set the DISPLAY variable automatically every time you login is by putting the following line in the textfile .login in you're home directory: setenv DISPLAY {$REMDTEHDST}:O

A.1.2 Getting Started In order to create a chip, follow the steps that are described in this section.

1. Start Exceed. 2. Open a terminal (using PuTTY) and login to kameleon. 3. If needed, set the environment variable DISPLAY 4. On kameleon, start ART Designer

art designer

INote that 'Default to Native' is not the same as 'Native' 2Hummingbird Window Mana.ger A.l. INTRODUCTION 39

5. Create a new ART project, create a C source of the design and generate a Verilog file (make sure you turn on the RT option 'generate Verilog')

6. Open another terminal and login to one of the machines Blast Chip can be run from. Use this session to perform the following steps.

7. If needed, set the environment variable DISPLAY

8. Go to the project directory of the ART design.

9. Copy the template TCL scripts to the ART project directory3:

cp -r /home/dpostma/public/magma .

10. Change current directory:

cd magma

11. Edit the configuration settings in file run. tel, using you're favorite text editor. Example:

pi co run.tcl

You must at least set the config variable 'top_module'. The name you have to enter here can usually be found at the bottom of the Verilog file that is generated by ART. This file can be found at 'artprojectdirec­ tory / art-Y / artd_design.v'. Usually, this name looks like' artd..FunctionN arne' 12. Start Blast Chip by entering the following command:

mantle

Now you should see the mantle command prompt which looks like:

mantle [1J >

13. Start the graphical user interface by entering the following command at the mantle prompt4 :

ui start

14. Open the animator using the GUI menu: ---- View ---- Animator. This way the chip generation process from RTL to layout will be visualized.

3NOT available from kameleon. 4Make sure you are running an X server, such as Exceed 40 APPENDIX A. FROM IDEA TO CHIPLAYOUT IN 15 MINUTES

15. Run the chip generation script by entering the following command in the GUI console:

source run. tel

In order to make changes to the design, the only steps that are needed are:

1. If needed, change the C source file and push button in ART Designer to regenerate the Verilog file 2. If needed, change settings in run.tel (e.g. timing settings)

3. In mantle, rerun the script run.tel

A.2 Blast Chip Reports

This section summarizes some interesting Blast Chip commands that can be run from the mantle command prompt. Make sure $m is set correctly to the name of the model. Example: set m /work/designTop/designTop

For more information on commands, check the Bast Chip manpages5 . report model $m Display general information about model. report timing path $m Display information about the worst timing path. report area $m Display information about area.

A.3 Blast Chip 'Hello world'

Blast Chip uses the scripting language TCL 6 as interface to the main engine 'mantle'. This section provides a very short introduction to TCL. Let's start with the TCL version of 'Hello world!'. Just enter the following line at the mantle command prompt: puts "Hello World!"

If nothing went wrong, the output will look like: Hello World! In TCL variables can be set by using the 'set' command. By prefixing a variable with the dollar character ($), the value of the variable is substituted. The 'Hello world' example can also be written as follows: set var "Hello World!" puts $var

5you can browse the man pages using de GUI, under the ta.b 'DOCS' 6Tool Command Language A.3. BLAST CHIP 'HELLO WORLD' 41

Some unusual features you need to know about TCL are the fact that TCL only uses strings, lists of strings and arrays as data structures. So, numbers such as integers and doubles, are represented by strings also. Furthermore, TCL is a position based language, not keyword based. TCL commands are separated by new lines - the first word on a new line is a command, the following words on the same line are options and arguments to the command (separated by whitespace). In case a line gets too long to be readable, you can end a line using the backslash character (\). " " and { } can be used to group multiple words into one argument. Now, take a look at the following TCL code: if {O != 1} { puts "Phe'W, just 'What I expected" } else { puts "Huh?! Something strange is happening here ... " }

Note that the word 'if' is the command, the condition {O 1= I} is the first argument and the rest is the body. Because the body is an argument to the command, the opening brace ({) of the body must reside on the same line the if command starts on.

The command 'expr' can be used to evaluate expressions. Brackets ([ ]) can be used to substitute the output of a command inside these brackets. The following examples will output '3', '7' and '30' successively7: expr 1+2 set result [expr 3+4] puts $result puts [expr "5 * 6"]

The last thing you'll learn about TCL is the 'source' command. The source command is used to import a file and execute it immediately afterwards. E.g. the following command imports and executes the text file 'run.tel' if it exists in the current directory: source run. tel

For now, this knowledge about TCL is enough to understand the examples in this tutorial. For more extensive tutorials on TCL, it is recommended to perform a Google search on 'tel tutorial'. If you really want to know anything about TCL, read a book like Tcl/Tk for Real Programmers, written by Clif Flynt, published by Academic Press, San Diego 1999.

7Note that the 'set' command returns the value of the second argument 42 APPENDIX A. FROM IDEA TO CHIPLAYOUT IN 15 MINUTES Appendix B

Magma TeL scripts

B.1 Main Flow: Tun. tel

#################### Configuration Settings #################### proc dpconfig {var} {

# Name of topmodule of Verilog description set dpconfig(top_module) "artd_ ... "

# Clock period set dpconfig(clock_period) 30n

# Maximum delay between all input and output pins set dpconfig(max_delay) lOOn

# Cell row utilization fraction set dpconfig(plan_utilization) 0.90

########## No Changes need to be made below this line ##########

# Location of directories and files set dpconfig(library_volcano) "/home/magma/common/tutorial/library.volcano" set dpconfig(design_dir) " .. /artd_v" set dpconfig(design_files) "artd_design.v"

# Return the value of the requested variable if ([info exists dpconfig($var)]) { return $dpconfig($var) } else { error "Unknown dpconfig variable: $var" } }

43 44 APPENDIX B. MAGMA TCL SCRlPTS

#################### Main flow ####################

# Import scripts source procedures.tcl

# Delete previous environment data delete object /

# Turn of intermediate volcano snapshots config snap off

# Import standard cell library set 1 [ImportLib]

# Import verilog design set m [ImportRtl $1]

# Perform area based optimization fix netlist $m $1

# Perform gain based synthesis FixTime $m $1

# Floorplanning and power routing FixPlan $m $1

# Global placement fix cell $m $1

# Insert clock logic and optimize it fix clock $m $1

# Perform detailed routing fix wire $m $1

B.2 General Functions: procedures. tel

# Import technoiogy library proc ImportLib {} { set libVolcano [dpconfig library_volcano] set 1 /slow

if ![expr [data exists $1]] { if {[file exists $libVolcano]} { data delete! object / import volcano $libVolcano } B.2. GENERAL FUNCTIONS: PROCEDURES. TCL 45

if [expr ! [data exists $1]] { error "Volcano does not contain valid data." }

# Hide sub-optimum library cells set hideCellList {/slow/INV/INVX12 /slow/INV/INVX16 /slow/INV/INVX20} foreach cell [split $hideCellList] { force hide $cell }

# Prepare Super cells run prepare lib $1

# Protect the pad cell entities so that they are neither mapped # (i.e. hide them), nor are unmapped (i.e. keep them). data loop e lib_entity /slow { data loop mm entity_model $e { set model type [data get $mm model_type] if [regexp -pad $modeltype] { force hide $e force keep $e } } } }

return $1 }

# Import Veri log RTL description proc ImportRtl {I} { set design_dir [dpconfig design_dir]

# Concatenate design dir to files foreach filename [split [dpconfig design_files]] { lappend design_files [file join ${design_dir} ${filename}] }

set topModule [dpconfig top_module] set m /work/${topModule}/${topModule}

if [expr ! [data exists $m]) { eva! import rtl -include ${design_dir} $design_files fix rtl $m

# To generate assign-free verilog, keep top-level model pin nets data loop pins model_pin $m { if { [lsearch -glob $pins *mpin:*) != -1 } { set net [ data only pin_net $pins ) force keep $net 46 APPENDIX B. MAGMA TCL SCRlPTS

} }

run gate sweep -hier $m run bind logical -preserve_hier $m $1 } enwrap "" import-rtl $m return $m }

################ Timing ################ proc FixTime {m I} { data flatten $m

# Configure wire load model for the delay calculator force wire load $m -name TSMC25_Aggressive -techlib $1

# Differentiate between input and output pins # and set minimal timing constraints for all pins data loop p model_pin $m { set dir [data get $p direction] if {$dir=="out"} { force timing required $p [dpconfig max_delay] puts lI$p" } else { force timing arrival $p 0 puts "$p" } }

# Specify the maximum delay of the logic between # the outputs of latches and the inputs of latches # In short: The clock period :) force timing clock $m/clk [dpconfig clock_period]

# Tell Blast Chip that the timing of the reset net is not important force timing false -from $m/rst_a

fix time $m $1 -rtl -margin 500 }

################ Floor-planning, Power-planning ################ proc FixPlan {m I} { run bind physical $m $1

force plan net -usage power -port VDD net=VDD model=$m force plan net -usage ground -port VSS net=VSS model=$m force plan chip $m -ratio 1.0 -util [dpconfig plan_utilization] \ -left_clearance 30u -right_clearance 30u \ B.3. HIERARCHICAL AREA REPORT: HIERAREA. TCL 47

-top_clearance 30u -bottom_clearance 30u run plan create chip $m run plan create pin $m

force plan mesh -layer METAL5 -place over -width 3.0u -spacing 0.8u -offset 0.8u \ -max_repeat 150u mesh=PGmesh model=$m orientation=horizontal net=VSS net=VDD force plan mesh -layer METAL4 -place over -width 3.0u -spacing 0.6u -offset O.Su \ -max_repeat l50u mesh=PGmesh model=$m orientation=vertical net=VSS net=VDD run plan create mesh $m

force plan ring external $m vertical -offset 0.6u -layer METAL4 \ -width lOu -spacing 0.6u VSS VDD force plan ring external $m horizontal -offset 0.6u -layer METAL3 \ -width lOu -spacing 0.6u VSS VDD run plan create ring $m }

B.3 Hierarchical Area Report: hierarea.tcl

# # Hierarchical Area Report # # Description: # Generate hierarchical area report of a Blast Chip model # # Procedure: # 1) import list of cells and their individual areas into an associative array # 2) For each cell, if parent does not exist yet, create it. # 3) For each cell, sum its area to his parent and parent's parents. # # Notes: # In the process "from a Verilog RTL netlist to layout", # Blast Chip inserts buffers in order to meet drive strenghts # or timing constraints. As buffers are inserted after flattening # the design, no hierarchical information of those buffers is # available. # # Author: # Dirk Postma # # History: # 03-05-2002 Created proc importAreaCells {model destination} { # Import cell area's of $model into an associative array $destination # array keys: cell name # array values: cell area upvar $destination cells 48 APPENDIX B. MAGMA TCL SCRlPTS

upvar $model m

# Loop over all cells of model $m data loop c model_cell $m { # Get area of cell $c set b [data only cell_outline $cJ set area [unfix [data get $b area])

# Store area of cell $c in associative array $cells set cells($c) $area } } proc generateHierarchicalArea {cells clusters} { # Starting with an imported associative array of cells and their area, # generate an associatie array with hierarchical area information, # by adding entries for clusters of cells and for clusters of clusters, # with their area, which is the sum of their childs. upvar $cells source upvar $clusters destination

if {[array exists destination]} { unset destination }

# Loop over all cells foreach cell [array names sourceJ { set celLsplit [split $cell "."J set depth [llength $cell_split]

# Loop over all parent names of cell for {set i O} {$i < $depth} {incr i 1} { set index [expr $depth - $i - 1] set parent [join [lrange $cell_split 0 $indexJ ". "]

# If parent entry does not exist yet, create one if {! [info exists destination($parent)]} { set destination($parent) 0 }

# Add area of cell to total area of parent set area [expr {$destination($parent) + $source($cell)}] set destination($parent) $area } } } proc arrayValueSort { arrayname b a } { # Function used for sorting an associative array of cell's and their area upvar $arrayname array B.4. IMPORT STANDARD CELL LIBRARY 49

if { $array($a) == $array($b) } { return 0 } elseif { $array($a) < $array($b) } { return -1 } else { return 1 } }

proc reportAreaTop { what clusters n } { # Show N clusters, ordered by area upvar $whatclusters clusters

puts ,11########## BEGIN Top [$n] clusters (ordered by area) ##########" puts ,,+------" puts "I Area (u2)\tICluster" puts "+------" set i 0 foreach cluster [lsort -command "arrayValueSort clusters" [array names clusters] ] { if { $i < $n} { puts "I [expr 1e12 >I< $clusters($cluster)]\tl $cluster" incr i 1 } } puts "########## END Top [$n] clusters ##########" }

importAreaCells m area_cells generateHierarchicalArea area_cells area_clusters reportAreaTop area_clusters 50

B.4 Import Standard Cell Library

This section shows a script to import a standard cell library from LEF and GnSII files. It is NOT recommended to use this script!

##################### Import Technology Library ################### set config_lib_dir "/home/dpostma/power/vtvtlib25" set 1 [import lib "$config_lib_dir/Synopsys_Libraries/vtvtlib25.lib"]

import lef rule n$config_lib_dir/vtvtlib_lef/vtvtlib25.1ef" source "magma. rules" rule gdsii layer metal1 49 >I< $1 -use wire -mode import rule gdsii layer meta12 51 >I< $1 -use wire -mode import rule gdsii layer meta13 62 >I< $1 -use wire -mode import 50 APPENDIX B. MAGMA TCL SCRIPTS rule gdsii layer meta14 31 * $1 -use wire -mode import rule gdsii layer meta15 33 * $1 -use wire -mode import rule gdsii layer nwell 42 * $1 -use wire -mode import rule gdsii layer active 43 * $1 -use wire -mode import rule gdsii layer nactive 103 * $1 -use wire -mode import rule gdsii layer pactive 113 * $1 -use wire -mode import rule gdsii layer poly 46 * $1 -use wire -mode import rule gdsii layer via 50 * $1 -use wire -mode import rule gdsii layer via2 61 * $1 -use wire -mode import rule gdsii layer via3 30 * $1 -use wire -mode import rule gdsii layer via4 32 * $1 -use wire -mode import rule gdsii layer cc 25 * $1 -use wire -mode import rule gdsii layer cc 47 * $1 -use wire -mode import rule gdsii layer cc 48 * $1 -use wire -mode import rule gdsii layer metal1 126 * $1 -use pin_text -mode import rule gdsii layer meta12 126 * $1 -use pin_text -mode import rule gdsii layer metal3 126 * $1 -use pin_text -mode import rule gdsii layer meta14 126 * $1 -use pin_text -mode import rule gdsii layer meta15 126 * $1 -use pin_text -mode import rule gdsii layer metal1 125 * $1 -use pin_text -mode import rule gdsii layer meta12 125 * $1 -use pin_text -mode import rule gdsii layer meta13 125 * $1 -use pin_text -mode import rule gdsii layer metal4 125 * $1 -use pin_text -mode import rule gdsii layer metal5 125 * $1 -use pin_text -mode import rule grid routing 1.25u $1 -direction vertical import gdsii -keep_text -lib $1 U$config_lib_dir/layout/vtvtlib25.gds" config timing slew threshold 0.2 0.8 0.2 0.8 run prepare lib $1 Appendix C c++ source of a FIR filter

#include

#define T_IW Fix<24,23> #define T_OW Fix<24,23> #define T_CW Fix<16,15> #define T_AW Fix<32,31>

#define NR_TAPS 16

II number of bits needed to address all taps #define ADDRESS 4 #define T_AD Uint

1*------*1 1* Definition of the coefficient array *1

T_CW COEF[NR_TAPS] = { 5.08189182603199E-03, 4.44167747034644E-02, 9. 89816634693354E-03, -4. 51813281959995E-02, -6. 88768722861049E-02, 1.89145907229325E-02, 2. 02429623778305E-Ol , 3. 56702628106457E-Ol, 3. 56702628106457E-Ol, 2.02429623778305E-Ol, 1.89145907229325E-02, -6. 88768722861049E-02 , -4.51813281959995E-02, 9. 89816634693354E-03, 4. 44167747034644E-02, 5.08189182603199E-03 }; 1*------*1 1* Initialisation of the taps *1

51 52 APPENDIX C. C++ SOURCE OF A FIR FILTER

T_IW DELAY[NR_TAPS] = { 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0 }; void FIR( const T_IW IN, T_OW& OUT) 1* Direct form FIR *1 { #pr agma OUT OUT 1*------*1 1* Definition of the summation of the tap multipliers *1

static T_AD baseptr = 0; DELAY [baseptr] = IN; T_AD index = baseptr; baseptr--; T_AW ACC '" 0;

loopi: for (int i-Oj i

OUT = ACC; } Appendix D

Power Tests

D.1 General Sources

This appendix shows general sources that are used with various tests.

D.l.l Verilog module design(a, y); input aj output y; INV UHa,y); endmodule

D.l.2 power.tel force activity annotate $m -probability 0.5 -toggles le6 force activity propagate $m $1 -verbose report power analysis $m -cells -detail

D.1.3 run.tel data delete object / set testname "0001" set 1 [import lib -add $testname.lib ] set m [import netlist $testname.v] run bind logical $m $1 fix netlist $m $1

Note that the variable $testname is changed with each test.

D.2 Library Without Power Characteristics D.2.l Liberty Source library(simple) { technology (cmos) ;

53 54 APPENDIX D .. POWER TESTS

cell( INV ) { pin (A) { direction input; } pin(Y) { direction output; function: "A'" } } }

D.2.2 Relevant lines of Magma Logfile

Some required values are not set in the library. Blast Chip chooses some default values which are shown below:

LC-73 WARNING: missing voltage_unit in library, using lV LC-73 WARNING: missing current_unit in library, using lmA LC-73 WARNING: missing pulling_resistance_unit in library, using lKOhm LC-73 WARNING: missing capacitive_Ioad_unit in library, using lff LC-73 WARNING: missing leakage_poyer_unit in library, using lW LC-73 WARNING: missing nom_process in library, using 1.0 LC-73 WARNING: missing nom_voltage in library, using 5.0 LC-73 WARNING: missing nom_temperature in library, using 25 LC-l0l WARNING: Missing area attribute for cell INV LC-81 WARNING: line 9: Neither an instance value nor a library default specified for capacitance, assuming 0.000000 LC-81 WARNING: line 2: Neither an instance value nor a library default specified for default_cell_leakage_pover, assuming 0.000000 LC-81 WARNING: line 2: Neither an instance value nor a library default specified for default_leakage_pover_density, assuming 0.000000

D.2.3 Results of Power Analysis

################################################ # Report poyer of model: design ################################################

cell model leakage internal svcap total ------(H) design 0.0 W 0.0 W 0.0 W 0.0 W

/vork/design/design/ cell model leakage internal svcap total ------Ul_Cl INV 0.0 W 0.0 W 0.0 W 0.0 W D.3. DEFAULT_CELL..LEAKAGKPOWER ATTRIBUTE 55 D.3 default_ceILleakage_power Attribute

D.3.1 Liberty Source library(simple) { technology (cmos) ; default_cell_leakage_power 2e-9;

cell( INV ) { pin (A) { direction input; } pin(Y) { direction output; function : "A'" } } }

D .3.2 Results of Power Analysis

default_cell_leakage_power: Leakage power: 2E-9 2.0 nW 2 2.0W 3 3.0W 4E-3 4.0mW

D.4 leakage_power _unit Attribute

D.4.1 Liberty Source library(simple) { technology (cmos) ; leakage_power_unit : 1nW; default_cell_leakage_power 2;

cell( INV ) { pin (A) { direction input; } pin(Y) { direction output; function : "A'" } } } 56 APPENDIX D. PO"WER TESTS

0.4.2 Results of Power Analysis

leakage_power_unit: Leakage power: unset 2.0W 1.0mW 2.0mW 2.0mW 4.0mW i 1.0 nW 2.0 nW

D.5 celLleakage_power Attribute

0.5.1 Liberty Source

library(simple) { technology (cmos) leakage_power_unit : lnW; default_cell_leakage_power 2;

cell(INV) { 1* Next line is under examination: *1 cell_leakage_power : 3;

pin (A) { direction input; } pin(Y) { direction output; function : IIA' If } } }

0.5.2 Results of Power Analysis

ceILleakage_power: Leakage power: unset 2.0nW 1 1.0 nW 3.5 3.5 nW 3 3.0nW

Table D.l: Results of varying .lib attribute cellJeakage_power. D.6. LEAKAGE.POWER GROUP 57 D.6 leakage_power Group

D.6.1 Liberty Source library(simple) { technology (cmos) voltage_unit: "lv"; current_unit : "lmA"; pulling_resistance_unit : "lkohm" ; capacitive_load_unit(l, ff) ; default_leakage_power_density : 0.0; leakage_power_unit : lnW; default_cell_leakage_power : 2;

nom_process: 1.0; nom_temperature : 125; nom_voltage : 2;

cell(INV) { area: 0.5 cell_leakage_power : 1;

/* Next few lines are under examination: */ leakage_power () { vhen : "A"; value : 3; }

leakage_power () { when: "!A"; value : 5; }

pin (A) { direction input; }

pin(Y) { direction output; function: "A'" } } }

D.6.2 Results of Power Analysis

################################################ # Report pover of model: design ################################################ 58 APPENDIX D. POWER TESTS cell model leakage internal swcap total ------(H) design 8.0 nW 0.0 W 0.0 W 8.0 nW

Iwork/design/designl cell model leakage internal swcap total ------Ul_Cl INV 8.0 nW 0.0 W 0.0 W 8.0 nW

D.7 Derating Factors

D.7.1 Liberty Source library(simple) { technology (cmos) voltage_unit: "lv"; current_unit: "lmA"; pulling_resistance_unit : "lkohm" ; capacitive_load_unit(l, ff) ; default_leakage_power_density : 0.0; leakage_power_unit : lnW; default_cell_leakage_power : 2;

nom_process: 1.0; nom_voltage : 2; nom_temperature : 125;

1* Next few lines are under examination: *1 k_process_cell_leakage_power : 3 k_volt_cell_leakage_power 3 k_temp_cell_leakage_power : 3 ;

cell(INV) { area: 0.5 cell_leakage_power : 7; pin (A) { direction: input; } piney) { direction output; capacitance o ; function : "A' " } } } D.7. DERATING FACTORS 59

D.7.2 Results of Power Analysis Test: vary derating factors Expectation: when operating conditions change, the power changes according to derating formula (see report).

ceILleakage_power: ~epower: nom_process nom_voltage 2.0 nom_temperature 125

k_vol t_ceIUeakage_power 0 k_temp_ceIUeakage_power 0 k_process_cellJeakage_power 0

celUeakage_power 7

Table D.2: Initial values of varied variables.

Process: Voltage (V): Temperature (OC): Leakage power (nW): 1.0 2.0 125.0 7.0 2.0 2.0 125.0 7.0 1.0 3.0 125.0 7.0 1.0 2.0 126.0 7.0 2.0 3.0 126.0 7.0

Table D.3: Results when all k's are zero.

Process: Voltage (V): Temperature (OC): Leakage power (nW): 1.0 2.0 125.0 7.0 2.0 2.0 125.0 28.0 1.0 3.0 125.0 7.0 1.0 2.0 126.0 7.0 2.0 3.0 126.0 28.0

Table D.4: Results when k_process_ceUJeakage_power = 3.

Process: Voltage (V): Temperature (OC): Leakage power (nW): 1.0 2.0 125.0 7.0 2.0 2.0 125.0 7.0 1.0 3.0 125.0 28.0 1.0 2.0 126.0 7.0 2.0 3.0 126.0 28.0

Table D.5: Results when k_voILcelUeakage_power 3. 60 APPENDIX D. POWER TESTS

Process: Voltage (V): Temperature (OC): Leakage power (nW): 1.0 2.0 125.0 7.0 2.0 2.0 125.0 7.0 1.0 3.0 125.0 7.0 1.0 2.0 126.0 28.0 2.0 3.0 126.0 28.0

Table D.6: Results when k_temp_ceILleakage_povter 3.

Process: Voltage (V): Temperature (OC): Leakage power (nW): 1.0 2.0 125.0 7.0 2.0 2.0 125.0 28.0 1.0 3.0 125.0 28.0 1.0 2.0 126.0 28.0 2.0 3.0 126.0 448.0

Table D.7: Results when all k's equal 3.

D.B ID Lookup Table totaLoutput-llet_capacitance

D.S.! Liberty Source library(simple) { technology (cmos) ; voltage_unit : "1v"; current_unit : !l1mA"; pulling_resistance_unit : "1kohm" ; capacitive_load_unit(1, ff) ; default_leakage_power_density : 0.0; leakage_power_unit.: lnW; default_cell_leakage_power : 0;

nom_process: 1.0; nom_voltage : 2; nom_temperature : 125;

power_lut_template(output_by_cap) { variable_l : total_output_net_capacitance index_l ("0.0. 20.0. 50.0, 70.0") ; }

cell(INV) { area: 0.5 pin (a) { direction: input; D.9. ID LOOKUP TABLE INPUT_TRANSITION_TIME 61

capacitance : 0 ; } pin(y) { direction output; capacitance 0; function: "a'"

internal_power() { power(output_by_cap) { values("10.0, 50.0, 70.0, 60.0") } } } } }

D.8.2 Results of Power Analysis

Capacitance: Internal power (nW): 0 10 10 30 15 40 20 50 45 66.7 80 55

Table D.S: Internal power as function of capacitance at output pin y.

D.9 ID Lookup Table input_transition_time D.9.! Liberty Source library(simple) { technology (cmos) ; voltage_unit : "lv"; current_unit : "lmA"; pulling_resistance_unit : "lkohm" ; capacitive_load_unit(l, ff) ; default_leakage_power_density : 0.0; leakage_power_unit : lnW; default_cell_leakage_power : 0;

nom_process: 1.0; nom_voltage : 2; nom_temperature : 125;

power_lut_template(input_trans) { variable_l : input_transition_time 62 APPENDIX D. POWER TESTS

index_l ("0.0, 20.0, 50.0, 70.0") ; }

cell(INV) { area: 0.5 pin (a) { direction: input; capacitance : 0 ; internal_power() { power(input_trans) { values("l0.0, 50.0. 70.0, 60.0") } } } piney) { direction outputj capacitance o ; function : Ita' n } } }

D.9.2 Results of Power Analysis

Slew: Internal power (nW): 0 10 5 20 10 30 15 40 20 50 25 53.3 40 63.3 70 60

Table D.9: Internal power as function of the input transition time of input pin a. Appendix E

Switching Activity of an Invertor

E.! Verilog Source: design.v module design (out, a); output out; input a; assign out = -a; endmodule

E.2 Testbench: testbench.v

'timescale 1ns/1ns module testbench; reg a; wire out;

design design (out, a);

always #1 begin a = 1 - a; end

initial begin $dumpon; a = 0; #10 $dumpoff; #1 $finish; end

63 64 APPENDIX E. SWITCHING ACTIVITY OF AN INVERTOR

initial begin $dumpfile ("design.vcd"); $dumpvars; $monitor("time=%3d a=%b out=%b", $time, a, out); end endmodule

E.3 Icarus Verilog batchfile: icarus. bat del design.vvp del design.vcd iverilog -0 design.vvp -c sources. txt vvp design.vvp .. \dos2unix.exe design.vcd

E.4 Icarus Simulation result time:: 1 a=l out=O time= 2 a=O out=l time= 3 a=l out=O time= 4 a=O out=l time= 5 a=l out=O time= 6 a=O out=l time= 7 a=l out=O time= 8 a=O out=l time= 9 a=l out=O time= 10 a=O out=l

E.5 Library: simple. lib library (simple) { technology (cmos) voltage_unit: "lv"; current _uni t : " lmA" ; pulling_resistance_unit : "lkohm" ; capacitive_load_unit(l, ff) ; default_leakage_power_density : 0.0;

nom_process: 1.0; nom_voltage : 2; nom_temperature : 25;

leakage_power_unit : lnW; default_cell_leakage_power : 4; E.6. TCL SCRIPT TO IMPORT VCD: POWER.TCL 65

variable_1 : input_transition_time ; index_1 (110.0, 20.0, 50.0, 70.0") ; }

cell (INV) { area: 0.5 pin(a) { direction : input; capacitance : 1 ; internal_power() { power(input_trans) { values("10.0, 50.0, 70.0, 60.0") } } } piney) { direction output; capacitance 10 ; function : Ha'" } } }

E.6 TCL seript to import VCD: power.tel

##################### BEGIN Power Analysis ############################## clear activity all config activity waveform 0.5n import vcd $m testbench.design design.vcd

puts "############### BEGIN Activity Report ########################"; data loop x model_pin $m { puts "$X \t---\t [query activity $x -stored_at]"; } puts "############### END Activity Report ##########################";

report power analysis $m -cells -detail ##################### END Power Analysis ################################