UNIVERSIDADE TECNICA´ DE LISBOA INSTITUTO SUPERIOR TECNICO´

Timing Analysis of Integrated Circuits Under Process Variations

Lu´ısJorge Br´asMonteiro Guerra e Silva (Mestre)

Disserta¸c˜aopara obten¸c˜aodo Grau de Doutor em Engenharia Inform´aticae de Computadores

Orientador: Doutor Lu´ısMiguel Teixeira d’Avila´ Pinto da Silveira

J´uri

Presidente: Reitor da Universidade T´ecnicade Lisboa Vogais: Doutor Jo˜aoPaulo Marques da Silva Doutor Lu´ısMiguel Teixeira d’Avila´ Pinto da Silveira Doutor Jo˜aoManuel Paiva Cardoso Doutor Jos´eCarlos Alves Pereira Monteiro Doutor Joel Reuben Phillips Doutor Nuno Filipe Valentim Roma

Maio de 2009

Abstract

As feature sizes in technology decrease into the nanometer scale, the impact of process parameter variations in circuit performance becomes extremely relevant.

Traditional, nominal case analysis and verification methodologies are no longer able to ensure silicon success. This dissertation addresses this problem, by developing key contributions for a variation-aware timing analysis methodology, capable of accurately model and predict circuit performance for the latest integrated circuit technologies. The proposed approach builds on reliable and established timing analysis paradigms by introducing a variation-aware extension, that can easily be implemented in currently used design flows. This dissertation presents several key contributions. One is a methodology for generating parametric delay models, tailored to the specific needs of delay calculation for pre-characterized standard cells. Unlike previous approaches based on numerical approximations, the proposed method is essentially analytical, and therefore capable of producing more accurate and robust results, at a fraction of the computational cost. Another contribution is a methodology that enables the automated computation of the critical timing conditions (corners) of a digital integrated circuit, given variation-aware parametric delay models. This constitutes an automated replacement for a task that has been mostly performed manually, relying on the knowledge of designers and process engineers.

Keywords timing analysis, delay modeling, process parameter variations, corner analysis, critical timing corners

i ii Resumo

A` medida que a tecnologia dos circuitos integrados atinge escalas nanom´etricas,o impacto de varia¸c˜oesnos parˆametrosde processo no desempenho dos circuitos torna-se extremamente relevante. As metodologias tradicionais de an´alisee verifica¸c˜ao,considerando valores nomi- nais, j´an˜aogarantem o sucesso na fabrica¸c˜ao.Esta disserta¸c˜aotrata este problema, desenvol- vendo contribui¸c˜oeschave para uma metodologia de an´alisetemporal considerando varia¸c˜oes, capaz de modelar e prevˆercom precis˜aoo desempenho nas tecnologias recentes de circuitos integrados. A abordagem proposta assenta em paradigmas de an´alisetemporal estabeleci- dos, atrav´esda introdu¸c˜aode extens˜oescapazes de tratar varia¸c˜oes,que podem facilmente ser implementadas nos fluxos de projecto actualmente utilizados. Esta disserta¸c˜aoapre- senta v´ariascontribui¸c˜oeschave. Uma delas ´euma metodologia para gera¸c˜aode modelos de atraso param´etricos,adaptados `asnecessidades espec´ıficas de c´alculode atrasos para c´elulas pr´e-caracterizadas. Contrariamente a abordagens baseadas em aproxima¸c˜oesnum´ericas,o m´etodo proposto ´eessencialmente anal´ıtico,e portanto capaz de produzir resultados mais precisos numa frac¸c˜aodo custo computacional. Outra contribui¸c˜ao´euma metodologia que permite determinar automaticamente as condi¸c˜oestemporais cr´ıticasde um circuito integrado digital, dados modelos param´etricosde atraso. Trata-se da automatiza¸c˜aode uma tarefa nor- malmente realizada manualmente, recorrendo `aexperiˆencia dos projectistas e engenheiros de processo.

Palavras-Chave an´alisetemporal, modela¸c˜aode atrasos, varia¸c˜oesnos parˆametrosde processo, an´alisede condi¸c˜oeslimite, condi¸c˜oestemporais cr´ıticas

iii iv Acknowledgments

First, I would like to thank Miguel, my adviser, for his friendship, wisdom and persistence in supervising a nasty student like me, that many times would procrastinate more than he should, not to mention other things that should be kept unmentioned. He has taught me many useful things, but particularly how to distinguish between bad and good research, and how to conduct my research career in a honest a productive manner.

Second, I would like to thank Joel, my informal co-adviser, for his teachings, for his support during my stay in Berkeley, and for his patience in explaining me things that were trivial for him but difficult for me. Our weekly meetings were the source for many of the nice original contributions produced in this work.

I would also like to thank Jo˜aoMarques-Silva, a friend that has always been able to provide the right advice when necessary, and that helped convince me to stay in academia when I had my mind set to leave to the real world and make some real money. I do not regret that decision and I am glad I followed his advice.

Inˆes,Sofia, Jos´eCarlos and Vasco, my long time friends and co-workers, also deserve my gratitude, for their constant support and encouragement. Additionally, I would also like to thank all the current and past members of the ALGOS Group, particularly Ana de Jesus,

Arlindo Oliveira and Jos´eMonteiro.

I am also grateful to all the members of the Cadence Research Laboratories in Berkeley, which contributed to the great work environment that I was lucky to find there, during my stay. A special thanks goes to Deidre Murphy and Andreas Kuehlmann. I keep the fondest memories of that period.

v Finally, I would like to thank my parents, Maria and Jorge. Throughout my life I have always found on them a source of support and encouragement in pursuing my own options.

They did that even at times that were particularly difficult for me as well as for them. I love them very much.

This work was carried out at the Optimization and Simulation Algorithms Research Group

(ALGOS) of INESC-ID Lisboa, in Lisboa, Portugal and at the Cadence Research Labora- tories, in Berkeley, California. This work was partially supported by Instituto Superior

T´ecnico,by the Portuguese Foundation for Science and Technology under the project Pow- erPlan (POSC/EEA-ESE/61528/2004) and by Cadence Design Systems, Inc.

vi Contents

1 Introduction 1

1.1 Motivation ...... 1

1.2 Timing Verification Methodology ...... 4

1.3 Objectives ...... 6

1.4 Original Work ...... 7

1.5 Dissertation Layout ...... 8

2 Background 9

2.1 Timing Simulation ...... 9

2.2 Static Timing Analysis ...... 11

2.3 Variability ...... 13

2.4 Statistical Approaches ...... 14

2.5 Corner-Based Approaches ...... 18

3 Parametric Delay Modeling 21

3.1 Delay and Slew Definitions ...... 22

3.2 Affine Functions ...... 25

3.2.1 Definition ...... 25

3.2.2 Extreme Values ...... 25

3.2.3 Sum ...... 26

3.2.4 Exact Max ...... 27

3.2.5 Simplification of the Max ...... 28

3.2.6 Bounding the Max ...... 29

vii 3.2.7 Bounding Error ...... 30

3.3 Mechanics of Delay Computation ...... 32

3.4 Interconnect and Cell Characterization ...... 34

3.5 Timing Graph ...... 36

4 Parametric Delay Calculation 39

4.1 Nominal Delay Calculation ...... 40

4.1.1 Cell Delay and Cell Loading ...... 40

4.1.2 Interconnect Delay ...... 46

4.2 Variation-Aware Methodology ...... 48

4.2.1 General Perturbation Formulation ...... 48

4.2.2 Specialization to Interconnect ...... 50

4.2.3 Interconnect Sensitivity Calculation ...... 52

4.3 Cell Delay Sensitivity Calculation ...... 52

4.4 Practical Implementation ...... 55

4.4.1 Interconnect Delay ...... 55

4.4.2 Effective Capacitance and Cell Delay ...... 56

4.4.3 Interconnect Delay Sensitivity ...... 58

4.4.4 Cell Delay Sensitivity ...... 59

4.5 Conclusions ...... 59

5 Worst-Timing Corner 61

5.1 Worst-Delay Corner ...... 63

5.2 Exhaustive Methods ...... 65

5.3 Static Pruning ...... 65

5.4 Dynamic Pruning ...... 67

5.4.1 Branch-and-Bound ...... 68

5.4.2 Path Space Search ...... 69

5.4.3 Parameter Space Search ...... 74

5.4.4 Decision Heuristics ...... 79

viii 5.5 Worst-Slack Corner ...... 79

5.5.1 Sequential Timing Constraints ...... 80

5.5.2 Setup Time and Late Mode ...... 81

5.5.3 Hold Time and Early Mode ...... 82

5.5.4 Multi-Cycle Paths ...... 83

5.5.5 Transparent Latches ...... 84

5.6 Conclusions ...... 84

6 Applications and Extensions 87

6.1 Augmented Timing Graph ...... 88

6.2 Worst-Slack Corner of a Single Register ...... 89

6.3 Worst-Slack Corner Over All Registers ...... 89

6.4 Minimum Clock Period ...... 90

6.5 Slack Violations ...... 90

6.6 Clock Tree Analysis ...... 91

6.6.1 Clock Latency ...... 92

6.6.2 Clock Skew ...... 93

6.7 k Worst-Delay Paths and Corners ...... 94

6.8 Conclusions ...... 98

7 Experimental Results 99

7.1 Benchmarks ...... 99

7.2 Parametric Delay Calculation ...... 100

7.3 Worst-Delay Corner ...... 104

7.4 Worst-Slack Corner ...... 105

7.5 k Worst-Delay Paths and Corners ...... 108

8 Conclusions and Future Work 111

8.1 Delay Computation ...... 111

8.2 Timing Analysis ...... 112

8.3 Future Work ...... 112

ix x List of Figures

1-1 Simplified timing verification flow...... 4

2-1 Example timing graph and sum/max operations...... 11

2-2 Upper bound computation example...... 19

3-1 Reference voltages for delay and slew calculation...... 22

3-2 Corners of x, for p =3...... 26

3-3 Maximum of affine functions and of piecewise-affine functions, for p = 1. . . . 27

3-4 Redundant affine functions in the max, for p = 1...... 28

3-5 Tightest single plane upper bound of a convex piecewise-affine function, for

p =1...... 30

3-6 Maximum/minimum error between two convex piecewise-affine functions, for

p =1...... 31

3-7 Typical partition of a digital circuit topology for delay computation...... 33

3-8 Most relevant parasitic effects considered in interconnect extraction...... 34

3-9 Illustration of the original circuit and the corresponding timing graph. . . . . 36

3-10 Illustration of the elements of a timing graph...... 36

4-1 Voltage source based cell models: original circuit and equivalent circuit. . . . 41

4-2 Waveforms of vi, v and vc, with delay, d, slew, s, and shift, k, measurements. 41

4-3 Interconnect delay and slew calculation from voltage waveforms...... 55

5-1 Timing graph of a combinational block...... 63

5-2 Detection of monotonic delay parameters...... 66

xi 5-3 Illustration of branch-and-bound on the corners of a parameter space. . . . . 69

5-4 Illustration of delay estimates...... 70

5-5 Calculation of the din estimate...... 71

5-6 Execution of WDC-Path-BnB for a small timing graph...... 73

5-7 Computation of din for different values of ∆λ...... 75

5-8 Execution of WDC-Parameter-BnB...... 78

5-9 Setup and hold in a sequential circuit...... 80

5-10 Modeling setup constraints in the timing graph...... 82

5-11 Modeling hold constraints in the timing graph...... 83

6-1 Augmented timing graph...... 88

6-2 Clock tree and its timing graph...... 91

6-3 Mirroring of the timing graph...... 92

6-4 Worst clock skew corner computation...... 93

6-5 Worst delay of the 1000 worst-delay paths and of the 1000 worst-delay corners,

in c6288. Note that the plots are in different scales...... 95

6-6 Illustration of worst-delay paths and corners...... 97

7-1 Computed delay sensitivities vs. transistor-level simulation...... 102

7-2 Histograms of errors in computed delay sensitivities...... 103

7-3 Relative errors in computed cell delay and output slew sensitivities...... 104

xii List of Tables

7.1 Information for ISCAS’85 benchmark suite...... 101

7.2 Information for ISCAS’89 benchmark suite...... 101

7.3 Results for worst-delay corner computation...... 106

7.4 Results for worst-delay corner computation, using three max bounding tech-

niques...... 106

7.5 Results for worst-slack corner computation...... 107

7.6 Results for the exact computation of the k worst-delay paths...... 110

7.7 Results for the exact computation of the k worst-delay corners...... 110

xiii xiv 1

Introduction

1.1 Motivation

An integrated circuit (IC) is a miniaturized version of an electrical circuit where, through a complex fabrication process, several conducting and insulating layers are deposited on a silicon substrate, typically less than 1cm wide, to form active devices and wires. It constitutes an extremely compact, reliable and cost-effective solution for implementing complex control and signal processing circuits for mass production. ICs are becoming pervasive in all aspects of our daily life. From the most basic micro-controller to the most complex ,

ICs are at the heart of many electronic systems such as computers, personal digital assistants

(PDAs), multimedia players, cellular phones and other, more mundane, electronic appliances such as microwave ovens or washing machines.

The high-end segment of the consumer electronics market, mostly focused in ultra-portable systems, is imposing new challenges in IC design and fabrication technologies. Such systems must be extremely compact, but incorporate increasing sophistication, thus demanding the integration of more functionality into each IC, while maintaining its small footprint. Increas- ing the functionality of one IC frequently implies that more devices must be packed into the same IC area. This is achieved by resorting to the so-called process scaling, which consists in reducing device sizes and also increasing their proximity, such that more devices, and there- fore more functionality, can be accommodated into the same IC area. In an effort to address this necessity, foundries have continuously reduced feature sizes well into the nanometer scale.

1 The latest 22nm technology node, still under development, will enable the integration of several billions of devices in a single IC. At such small scales, device dimensions can be in the order of just a few tens of atoms, and therefore even the slightest fluctuation in the parameters of the fabrication process can have a huge impact in circuit performance and reliability [61]. This phenomenon, that consists in increased sensitivity of IC performance to variations in process parameters, is generically designated by variability, and is the subject of great concern both for designers and process engineers. Another consequence of scaling down device sizes is the fact that, since devices are smaller and closer, the electromagnetic interactions between them become considerably more relevant and can easily prevent correct circuit operation, if not properly accounted for during the design stage.

Most portable systems are battery-operated, which imposes stringent constraints on power consumption. Such constraints demand the continuous development of novel design and fabrication approaches, completely focused in power efficiency. At the architectural level, circuits are now partitioned into several power domains, such that particular regions of the

IC may either be powered-down when not in use, or may operate at different power supply voltages. Additionally, such voltages have been dramatically reduced, currently reaching 1V and below. As a consequence, the noise margins, created to ensure resilience of digital circuit operation to noise, are now significantly reduced. This reduction can cause reliability issues and imposes additional constraints in electronic system design, as voltage supplies now need to be more precise and stable, and all the system-wide sources of noise must be accurately predicted and considered during design.

The development of high-bandwidth transmission technologies, as well as the availability of cheap high-capacity storage, requires ICs capable of real-time processing of massive amounts of information, which is only made possible by increasing operating clock frequencies, cur- rently reaching 3GHz. The maximum operating clock frequency of an IC is limited by the performance of its slowest components. Therefore, correct operation at higher clock frequen- cies requires that such components meet rigorous performance specifications. For modern IC technologies, where process variability can have a significant impact in the performance of the fabricated IC, such specifications can be extremely difficult to ensure at design stage.

2 The the sheer complexity of modern IC designs as well as the strict performance re- quirements that they must meet, have placed the task of design verification in a prominent position. Design verification consists in validating, prior to circuit fabrication, that all the design specifications have been meet, both in terms of functionality and performance. Even though proper verification ensures that the IC design is correct, according to the available models and specifications, some of the fabricated ICs may still be faulty. On one hand this may occur because, even though the fabrication process is calibrated to ensure the best yield, such yield is never 100%. On the other hand, the accuracy of the models that are used in the verification procedure is necessarily limited. Therefore, even though a design may seem correct, in reality it may exhibit some flaws that can only be detected using more accurate models than the ones available. Under variability this problem is exacerbated, since pre- dicting circuit behavior becomes an extremely complex task, as many relevant process and operational parameters need to be considered in order to produce accurate models. Since successful verification is strongly dependent on the accuracy of the models being used, the inability to produce such models with the adequate accuracy can constitute a serious problem.

A simple way to address variability, and the consequent difficulty in predicting IC behavior prior to fabrication, is to take more conservative design options. In practice, this corresponds to the addition of a design slack that compensates for any adverse parameter variations that may take place during fabrication. Even though this approach can successfully limit the negative impact of parameter variability in correct circuit operation, it will also lead to sub- optimal and wasteful designs, that either do not meet the required performance specifications or are too expensive to be commercially viable.

The more appealing, but simultaneously more complex, approach to deal with variability is to directly incorporate variability models in the verification methodology, thus enabling the accurate prediction of variability effects during the design stage.

Prior to tape-out1 an IC design undergoes an exhaustive verification procedure, commonly designated by design sign-off, that is expected to ensure that the fabricated IC will perform

1This term is common jargon to designate the procedure of sending an IC design to fabrication. In the early days stream tapes were the only medium capable of holding the large amount of information necessary to describe an IC design. Therefore, the designer would save the design information in a tape and mail it to the foundry.

3 delay timing extraction modeling analysis layout electrical timing timing netlist graph report

Figure 1-1: Simplified timing verification flow.

according to its required specifications. For analog ICs, typical design sign-off consists in de- tailed electrical simulation, using an appropriate simulator such as Spice [44]. The electrical behavior of the IC design is exhaustively simulated for a meaningful range of input stimuli and operating conditions (power supply voltage, temperature, etc), and the waveforms of the output response are subsequently used to compute accurate performance figures (gain, distor- tion, noise, etc). Even though detailed electrical simulation techniques are computationally expensive, the fairly small number of elements present in most analog designs enables their practical applicability. However, for typical digital IC designs, containing several millions of elements, such techniques are overly expensive, and do not constitute a viable option.

Therefore, the sign-off of digital IC designs is usually conducted resorting to a sequence of approximated/simplified modeling and analysis techniques, that separately target the vali- dation of particular aspects or metrics of IC performance. Two of the most relevant such metrics are power and timing (speed).

1.2 Timing Verification Methodology

Timing verification is concerned with verifying, prior to fabrication, if a given IC design, implemented in a target technology, will be able to operate reliably at a specified clock fre- quency. Such procedure is conducted by an appropriate computational engine that, given the design specification and the technology models provided by the foundry, calculates per- formance estimates and, based on such estimates, reports the existence of critical timing conditions that may limit correct circuit operation. From that information it is possible to predict the range of frequencies at which the fabricated circuit can be clocked, as well as other operational constraints.

4 The typical timing verification sign-off flow involves three major steps, as illustrated in

Figure 1-1. The IC layout, a map that contains the detailed location of each circuit element

(device or wire) within the corresponding IC layer, is the main input information of this

flow. Additionally, each step will also consume specific information about the IC fabrication technology.

The first step is the the extraction procedure, which analyzes the IC layout and produces the netlist of an electrical circuit that characterizes, in terms of lumped elements, the electrical behavior of the IC. There are many different extraction techniques, with varying degrees of accuracy and speed, that are customized to target specific applications. Usually, the more accurate techniques produce netlists with a larger number of elements, and/or more complex elements.

The second step is the delay modeling procedure, which partitions and maps the circuit into a network of interconnected delay elements, and then subsequently calculates the delay associated to each element. Most often, each delay element corresponds to either an intercon- nect wire or to an active element (cell). In practice it is common to compute a small set of delays for each element, rather than just one delay. Such delays are intended to characterize the circuit behavior for several fabrication and operating conditions. Since such conditions usually represent extreme cases, they are designated by corners. The network of delay ele- ments is most often represented by a timing graph, where graph edges correspond to delay elements and graph vertices correspond to connections between them.

Finally, the third step is the timing analysis procedure, which analyzes the network of delay elements in order to validate the timing specifications. In a timing perspective, we are concerned with analyzing the delay of particular paths within the circuit. The analysis procedure enables the computation of such delays. The delay of particular paths can limit the operating clock frequency of the circuit, since the next processing activity within such a path, triggered by the clock signal edge, can only start after the previous processing activity has been completed. The timing report produced usually contains a wealth of information that either certifies that the design meets the required timing specifications, or provides enough information to guide the designer in correcting potential timing violations.

5 1.3 Objectives

For older fabrication technologies, the timing verification of an IC design for a small number of process and operational conditions (corners) would be more than enough to ensure the correct operation of the IC, once fabricated. However, such simplistic approaches are no longer valid for the latest IC technologies, where variability issues assume particular relevance.

Therefore, new modeling and analysis methodologies must be developed. Such methodologies must meet a few basic requirements:

they must be able to model and analyze with sufficient accuracy the timing behavior • of IC designs targeting the latest IC technologies, known to be extremely sensitive to

parameter variations;

they must enable the production of detailed and meaningful reports that may provide • effective guidance for design correction and optimization;

they must be efficient, from a computational standpoint, such that a typical IC, with • many millions of elements, may be processed in a fair amount of time;

they must produce very compact models, such that the modeling information of all the • elements of an IC may fit in a fair amount of memory/storage;

they must allow, even if at a very coarse level, a tradeoff between processing efficiency • and accuracy, such that they may be useful to the designer in a broad range of contexts

and design stages.

This dissertation proposes the development of a delay modeling and timing analysis methodology that addresses the requirements previously stated. An additional self-imposed requirement is that the proposed methodology must constitute a natural variation-aware extension to the currently established timing verification methodologies. We believe this requirement to be of paramount importance, since a radical shift away from the currently es- tablished timing verification paradigms, could ultimately entail an overhaul in the verification process, with the undesirable impact in designer productivity and time-to-market.

6 1.4 Original Work

The original contributions reported in this dissertation can be divided in two main topics:

generation of parametric delay models (delay modeling); •

efficient computation of worst-case timing conditions (timing analysis). •

We start by introducing the foundations of a parametric delay modeling framework that, by accounting for parameter variations, addresses the issues raised by the latest IC tech- nologies. This framework follows closely the well established delay modeling strategies that have been successfully used for many years in the nominal case, but introduces the neces- sary modifications and extensions to handle parameter variations in an accurate and efficient fashion.

Subsequently, we derive the necessary mathematical formulations for the computation of the proposed parametric delay models. Specifically, we describe how to produce cell and interconnect delays and output slews as affine functions of the parameters. In this work the general development of linear time-varying (LTV) perturbation theory [73, 54] is adapted for extraction of variation-aware delay models tailored to the specific needs of delay calculation for pre-characterized standard cells.

Afterwards, we propose an efficient and automated methodology for computing the worst- timing conditions (corners) in a digital integrated circuit, when parametric delay models are available. Specifically, we address the computation of worst-delay corners of combinational blocks and of worst-slack corners of sequential circuits, as well as their associated paths.

The proposed methodology casts the computation of the worst-timing corners as a search problem, which provides an intellectual paradigm that is more general and useful than most previous approaches.

Finally, we discuss several problems of practical relevance in timing analysis and optimiza- tion and demonstrate that they can be addressed in the framework of the formulation and techniques that have been proposed. Further, we discuss a few useful and practical extensions to such framework.

7 1.5 Dissertation Layout

This chapter provides a quick overview of previous approaches for performing timing analysis and verification of digital integrated circuits. We start by discussing general nominal case approaches, and afterwards introduce variability-aware approaches, recently developed for timing analysis that target the latest IC designs and technologies.

The remainder of this dissertation is organized in the following manner. Chapter 2 re- views the most relevant established approaches for performing timing analysis and verifica- tion of digital ICs. Additionally, it discusses the main requirements imposed by the latest

IC technologies, and discusses the most relevant variability-aware approaches that have been proposed for addressing them. Chapter 3 introduces the proposed parametric delay modeling strategy. Chapter 4 details the mathematical formulations underlying the computation of the parametric delays models presented in Chapter 3. Chapter 5 proposes an efficient and variation-aware automated methodology for computing the worst-timing corners in a digital

IC, assuming that process parameters are defined by simple value ranges. Chapter 6 discusses relevant practical applications of the methodology proposed in Chapter 5. Chapter 7 reports experimental evidence that validate the contributions proposed in Chapters 3, 4, 5 and 6.

Finally, Chapter 8 draws some concluding remarks and presents future research directions.

8 2

Background

This chapter provides a brief overview of the most relevant proposed methodologies for per- forming timing analysis and verification of digital integrated circuits. We start by reviewing general nominal case approaches, from early timing simulation techniques to static timing analysis. Afterwards, we discuss the new requirements imposed by variability effects, that characterize modern IC technologies. Finally, we present recently developed timing analysis methodologies that effectively address variability. Such methodologies can be divided in two main classes: statistical approaches and corner-based approaches.

2.1 Timing Simulation

Many years ago, with the advent of sequential ICs, the problem of verifying if a given circuit could operate at a target clock frequency has emerged as a critical stage in digital

IC design. Clearly, this requirement can be accurately verified by building the circuit and testing it in its target operating environment. However, designers only resort to this option in the final stages of IC design, prior to mass production, since the cost of fabricating small batches of custom ICs is extremely high. Additionally, testing and debugging the fabricated

IC is not a trivial task. Therefore, during most of the IC design cycle, designers must resort to less expensive and more convenient methods that can capture circuit behavior and predict its performance as accurately as possible, before fabrication.

9 An obvious option is to simulate the behavior of the circuit using an accurate transistor- level simulator such as Spice [44], Spectre [11] or HSpice [64]. However, the computational cost of such approach is prohibitively high, even for small sized circuits, not only because accurate transistor-level electrical simulation is a computationally expensive task, but also due to the large number of transistors and parasitic elements usually involved. Furthermore, the accuracy of the results thus obtained is dependent upon the set of input vectors chosen, which may not properly exercise all the relevant circuit functionality. This is a known drawback of using simulation as a means of verification.

To overcome the computational cost of accurate simulation, researchers have developed less accurate, but significantly more efficient simulation algorithms. These simulators employ simplified device models and equation formulation techniques. They have been available for many years and are extensively used by the design community. Early examples of simpli-

fied transistor-level simulators are Splice1 [60], Adept [48], Cinnamon [69], XPSim [3],

Specs [71], Swec [38] and Aces [20]. Pursuing the same objective, but following a slightly different approach, several enhanced switch level simulators have been developed, such as

Motis [14, 15], Crystal [52, 4], Tv [31], Samson [59], Sls [68] and ELogic [33]. The two most significant commercial simplified circuit simulators currently available are Ultra-

Sim [12] and NanoSim [66]. These modern simulators employ several acceleration techniques, that most often explore the hierarchical structure of the IC design or the existence of circuit blocks described at various abstraction levels.

Simulation was the preferred method used by early timing analyzers for estimating and verifying circuit delay. These analyzers, also known as timing simulators, compute circuit de- lay by simulating the response of the circuit to a given set of input patterns. Even though this approach is quite accurate, it is still very inefficient, as the task of choosing the appropriate input patterns can be extremely complex. Such input patterns must exercise the circuit in a realistic manner, that enables accurate delay computation. Additionally, since the number of paths in a circuit can grow exponentially with the number of inputs, it becomes unfeasible to simulate all the possible input combinations. Consequently, other timing analysis methods, that do not require input patterns and their exhaustive simulation, had to be pursued.

10 2.1 2.1 0.0 1.7 4.0 1.2 2.4 5.2 2.4 0.0 1.6 2.3 0.8 9.7 3.2 8.9 0.0 3.2 1.9 1.3 7.0 2.5 sum 1.9 4.5 0.0 1.9 2.1 max

sum

Figure 2-1: Example timing graph and sum/max operations.

2.2 Static Timing Analysis

In the attempt to address the drawbacks associated with performing timing anal- ysis through simulation, researchers have developed the so-called static timing analyz- ers [29, 28, 31, 52, 51]. This designation derives from the fact that circuit delay is statically computed, without requiring the simulation of any specific input patterns. Such analyzers approximate the timing behavior by simple conservative (e.g. worst-case) models, and reduce the problem of computing the delay of a circuit to a graph problem, where edges represent component delays.

Component delays are the delays introduced either by cells (blocks of active devices), or by interconnect (wires). Cell delays are usually pre-characterized in a technology library, where delay information is stored in the form of lookup tables. Therefore, most often cell delay calculation consists in a table lookup followed by appropriate interpolation. Interconnect delay can be computed resorting to a wide range of techniques, from the highly accurate electrical simulation to the application of rudimentary delay models, like the Elmore delay model [27]. One such technique is later discussed, in Chapter 4.

While in timing simulation the computation of component delays and the timing analy- sis procedure are performed simultaneously, in static timing analysis it becomes a two-step process, where component delays are calculated prior to timing analysis, as shown in Fig-

11 ure 1-1. After the delay calculation step, component delays are usually stored in the form of an annotated graph, designated by timing graph, whose structure mimics the structure of the circuit, such that a path in the graph corresponds to a path in the circuit. In such graph, component delays are annotated as edge properties. Therefore, the subsequent static timing analysis procedure usually consists in a sequence of graph operations that target the verification of particular timing properties (specifications).

An example timing graph is depicted in Figure 2-1, where edges are annotated with delays, that in this case are given by real numbers with one decimal place. The boxed numbers are designated by arrival times, and usually quantify the earliest or the latest time instant that a given signal transition can reach the corresponding circuit point, when traveling from a circuit input. The meaning of the arrival time values depends on whether we assume the early or the late mode of operation. In early mode, we are concerned with computing the earliest time instant that a signal transition can reach any given circuit point. Conversely, in late mode we are concerned with computing latest time instant that a signal transition can reach any given circuit point. Therefore, arrival times are computed by adding edge delays across a path and computing the min or max (assuming either early or late mode) of such delays when they converge at a given circuit point. In this example where the late mode is assumed, the arrival time at the single circuit output is given by 9.7. This means that any signal injected in the inputs of the circuit will take at most 9.7 time units to reach the output.

A common designation for this specific arrival time is circuit delay.

Static timing analysis has become an enabling methodology for optimizing performance and ensuring that circuits satisfy certain timing and frequency requirements. To that end, timing analyzers determine approximate but safe estimates of the worst-case delay through a circuit: for every input and output signal, there are many possible paths through the circuit, each path consisting of a set of interconnected network cells. Timing analysis deals with the identification and analysis of the critical paths, also known as the longest delay paths in the circuit. In addition to finding critical-path delays, timing analyzers can also be used to do miscellaneous static analysis, like finding high-speed components off the critical path that can be slowed down to save power and several other relevant tasks.

12 2.3 Variability

The process of fabricating an IC consists in a sequence of several steps. Even though tremendous effort is put in making them repeatable, each will occur in a slightly different way for each IC [62]. Several reasons contribute to this fact. Even though temperature, hu- midity, level of contamination (e.g. air particle density) and other environmental conditions are tightly controlled during fabrication, small variations can still occur, and they will defi- nitely impact the fabricated IC. Additionally, the performance of the optical and mechanical apparatus used in the fabrication process may vary, either due to random events or even as a result of normal equipment operation.

Variations due to the manufacturing process are one of the main sources of IC performance variability. When such variations are observed between different ICs, they are designated by inter-die variations. An example is the fact that frequently one chip may be faster than another. Some of these variations, such as gate delay, due to variations in gate length or dopant concentration, are tightly correlated. If one gate is fast, then so are all the others on the same chip. On the other hand, variations in interconnect behave differently. In one fabrication step, one machine lays down all the metal for a layer of a chip. At a subsequent step, another machine, or the same, lays down the next layer. Such layers are not correlated to each other, but they are correlated across the chip. Meaning that, for example, if the

first layer is thick, then it will be thick across the chip, but this implies nothing about the thickness of the second layer.

Simultaneously, we can observe variations among the various elements inside the same chip, designated by intra-die variations. These variations have a systematic component and a random component. The systematic component [42] is used to designate variations that consistently occur in certain areas of the chip, either due to design (layout) characteristics, or due to some artifact of manufacturing, such as cross-chip gradients. On the other hand, the random component [46] is mostly related to manufacturing imperfections, that can occur randomly across the chip. Usually this component varies according to the distance of the elements on the chip. Meaning that two close elements are more likely to exhibit correlated behavior, than if they were far from each other.

13 Another source of variation are the environmental operating conditions of the chip. In this case, we can consider global variations, such as changes in global power supply voltage and global temperature changes on the chip. Also, we can consider local variations, such as local supply noise, local temperature changes, cross-talk on wires, etc.

Finally, more subtle sources of variation are device fatigue phenomena. Examples are electromigration, hot electron effects and negative bias temperature instability, which can occur both locally or globally, either in a systematic or random way.

2.4 Statistical Approaches

The impact of process variation in circuit performance is an area of increasing concern, both in the semiconductor industry, as well as academic research. In the research community, considerable work has been devoted to the development of statistical static timing analysis

(SSTA) techniques as a means of addressing this problem. Such techniques model circuit delays as functions of random variables that characterize, in a statistical sense, the behavior of process parameter variations.

The reference method for SSTA, to which all the other methods are compared with, for accuracy (not speed), is the well-known Monte Carlo [43] simulation method, that finds ap- plication in many other science and engineering fields. In a Monte Carlo analysis, a series of deterministic simulation runs are performed, replacing in each run the value of every stochas- tic input variable by a randomly generated number, according to its predefined distribution.

The final result will be a set of samples, for each problem output variable, from which its statistical distribution can be computed. This method is computationally extremely expen- sive, even though it provides a fairly accurate statistical approximation. Therefore, a correct estimation of the number of runs necessary to obtain statistically significant results is an important matter, which usually depends on the problem under analysis. This method is mostly used when no analytical solution is easy or even possible to obtain or implement.

PERT is a well-known statistical project-scheduling method, proposed in [35]. It provides a statistical aid in estimating end dates and critical paths for scheduling large projects, requiring as inputs the dependencies between the various tasks and three estimates for the duration

14 of each task: most likely time, shortest time and longest time. Given this information,

PERT is able to estimate the probability of completing a given task at any given time. The

first known attempt to perform statistical timing analysis, described in [34], uses PERT, by replacing tasks by logic gates, task dependencies by the circuit connectivity, and task delays by gates delays. Given the three parameters that characterize the delay a of a : a

(shortest), m (most likely) and b (longest), the mean and variance are computed according to Eqn. (2.1). a + 4m + b b a 2 µ = , σ2 = − (2.1) 6 6   For each path through the circuit, the overall mean and variance are computed by summing their individual components. Further, given the required arrival times at the primary outputs, the corresponding slacks are computed. Even though this method is quite primitive for modern standards, it was a major breakthrough for its time.

Several years later, [6] proposes a simple approximation scheme to perform SSTA in linear time. This scheme is only applicable to normal distributions, and assumes full statistical independence between random variables, therefore not considering correlations. As the author reviews, adding two normally distributed random variables c = sum(a, b) is simple, and consists in adding the means and the variances, as described in Eqn. (2.2).

2 2 2 µc = µa + µb , σc = σa + σb (2.2)

However, performing the max operation is more complex. Considering c = max(a, b), it is observed that

P c x = P a x b x (2.3) { ≤ } { ≤ ∧ ≤ } which makes sense because if a or b is greater than x, then c also must be, since it is the maximum of both. Considering statistical independence between a and b, we obtain,

P c x = P a x P b x (2.4) { ≤ } { ≤ }· { ≤ }

By definition, F (x) = P c x , therefore we have, c { ≤ }

F (x) = F (x) F (x) (2.5) c a · b

15 Taking the derivative left and right, we get,

f (x) = F (x) f (x) + F (x) f (x) (2.6) c a · b b · a

A method using discretized probability distributions is proposed in [39]. In this method,

PDFs are assumed to be trains of discrete impulses which are propagated through the circuit

(timing graph). The run-time of this method is in the worst case exponential in the size of the circuit. Further, correlations both due to global dependencies on the sources of variation and due to reconvergent fanouts are ignored, as is the case with [6].

Another approach [1] proposes an SSTA method that propagates PDFs through the circuit graph, with numerical convolution and multiplication being performed at each step. In the presence of reconvergent fanout paths, the random variables are replaced by stochastically larger ones, to obtain upper (and lower) bounds on the PDFs. However, the number of paths in a circuit can increase significantly with circuit size, resulting in high complexity. Further, delays of the gates and arrival times are modeled as independent random variables, thus discarding other types of correlation, not induced by reconvergent fanouts. An interesting result presented in this paper is the proof that SSTA without accounting for reconvergent fanouts produces an upper bound on the actual delay distribution, thus being safe, even though pessimistic.

An interesting contribution is introduced by [21], as it models arrival times as CDFs and delays as PDFs, thus allowing an efficient computation of the and max operations. Using this modeling technique, and assuming that all the variablesP are statistically independent, for each gate, we have

Faty = (Fatx fdx ,y )(Fatx fdx ,y )...(Fatx fdx ,y ) (2.7) 1 ⊗ 1 2 ⊗ 2 n ⊗ n

This method computes the CDF of the arrival time at the gate output by convolving (the

operator) the CDF of each input arrival time with the PDF of each input-to-output de- ⊗ lay. By performing a levelized breadth-first traversal on the circuit graph, and applying this procedure to every gate, it is possible to compute all the arrival times, for every node in the circuit. In the implementation of the algorithm, the CDFs are modeled as piecewise- linear functions and the PDFs are modeled as piecewise-constant functions. Each piece of

16 is convolved individually, and for an n-piece CDF and an n-piece PDF, are performed n2 convolutions. The resulting CDF waveform is quadratic. This waveform is then converted back to a piecewise-linear function, thus enabling its forward propagation. For the piecewise decomposition, all waveforms are sampled at fixed, predefined, probability values. Higher accuracy can be obtained by increasing the number of sampling points. In order to deal with correlation due to reconvergent fanouts, the arrival times are factored in statistically indepen- dent variables plus a common mode, that encapsulates the correlation, and is eliminated by statistical subtraction. Like [1], this method handles correlations introduced by reconvergent fanouts, but discards any other type of correlation, by assuming that delays are statistically independent variables.

A significantly different approach is proposed in [7]. The authors represent the circuit

(in fact its timing graph) as a Bayesian Network, which prescribes an efficient method to factorize the joint distributions (resulting from correlations) to an optimal set. For large circuits, where it is not possible to compute the exact distribution, a series of simplifications are performed, through network transformations, in order to reduce problem size and get a tight lower bound on the exact distribution. This method is interesting and able to accurately compute delay distributions for very small circuits, accounting for correlations. However, for larger circuits, or problems with many correlations, simplifications have to be done in order to compute a solution in an fair amount of time. These simplifications lead to a loss of accuracy.

Probably the SSTA method that produced the most impact is the one proposed in [70].

The authors propose a canonical first-order delay model, employed to express all timing quantities. All gate and wire delays, arrival times, required arrival times, slacks and slews

(rise/fall times) are expressed in a canonical first-order form. For a given random variable a, its canonical form is, n a = a0 + ai∆Xi + an+1∆Ra (2.8) Xi=1 where a is the mean, or nominal value of a, ∆X , i = 1, 2, , n represents the variation of n 0 i ··· global sources of variation X , i = 1, 2, , n, from their nominal values, a , i = 1, 2, , n are the i ··· i ··· sensitivities to each of the global sources of variation, ∆Ra is the variation of an independent random variable Ra from its mean value and an+1 is the sensitivity of the timing quantity to

17 Ra. The mean and variance of a are given by,

n+1 2 2 µa = ao , σa = ai (2.9) Xi=1 Considering another random variable b, we can compute the probability that a is larger than b. This is called the tightness probability of a, and is represented by Ta. Revisiting the problem of computing c = max(a, b), it can be analytically derived [70] that,

T = f (a , a , , a , b , b , , b ) (2.10) a 1 0 1 ··· n+1 0 1 ··· n+1

µ = a T + b (1 T ) + f (a , a , , a , b , b , , b ) (2.11) c 0 a 0 − a 2 0 1 ··· n+1 0 1 ··· n+1 σ2 = (σ2 + a2)T + (σ2 + b2)(1 T ) + f (a , a , , a , b , b , , b ) (2.12) c a 0 a b 0 − a 3 0 1 ··· n+1 0 1 ··· n+1 Thus, the tightness probabilities, mean and variance of c can be computed analytically and efficiently. Further, the canonical form of c is given by,

n n c = c + c ∆X + c ∆R = µ + (T a + (1 T )b ) + c ∆R (2.13) 0 i i n+1 c c a i − a i n+1 c Xi=1 Xi=1 The only remaining quantity to be computed is the independently random part of the result, cn+1. This is done by matching the variance of the canonical form to the variance computed analytically. Using this framework it is possible to model all the correlations and perform

SSTA efficiently. Further, [70] also describes how to perform incremental SSTA over this framework. One drawback of this approach is that it introduces an error, by adjusting to a Normal distribution the result of the max operation, which does not produce a Normal distribution.

2.5 Corner-Based Approaches

Most of the research work that targets the improvement of timing analysis techniques for accurately handling variability has been focused in statistical approaches, where delays and constraints are described by statistical distributions. This corresponds to a departure from conventional, nominal-case, approaches. Nevertheless, a few timing analysis techniques have been proposed for dealing with variability in a more conventional framework, which can be generically designated by corner-based approaches. Rather than dealing with variability in a

18 A = 10 + 2 x1 + 2 x 2 u C = 4x x v − 1 − 2 2x2 x1 B =11+3 −

x1 x2 maxA,B maxA,B x1 x2 maxA,B maxA,B 1 1 10 12 1 1 15 17 − − − − 1 1 10 10 1 1 13 13 − − 1 1 16 16 1 1 13 13 − − 1 1 14 14 1 1 9 9

max (x , x )=13+2x x [maxA,B + C](x1, x2)=13 2x1 2x2 A,B 1 2 1 − 2 − −

Figure 2-2: Upper bound computation example. statistical manner, such approaches target the validation of the timing behavior of the circuit for a series of extreme variability settings, commonly designated by corners.

In this context, the authors of [49] propose a linear-time approach for computing an upper bound on the worst delay of a circuit, covering all process corners. Such upper bound is an linear function of process parameter, that we will designate by affine function 1, In this approach, much like in [70], affine arrival time functions, designated by hyperplanes, are pushed through the timing graph, and when necessary a conservative approximation to the max between two input hyperplanes is computed. Such approximation, designated by output hyperplane, is an upper bound on the max of both input hyperplanes, valid for every corner.

The computed output hyperplane is guaranteed to be a tight bound at the worst corner among the two input hyperplanes. Given the delay hyperplanes for the primary outputs, it is trivial to compute their worst value (as will be explained in the following chapter), therefore obtain an upper bound on the delay of the circuit.

Figure 2-2 illustrates the application of the underlying method to a simple timing graph, where it is assumed that delays depend upon two process parameters: x , x [ 1, 1]. The 1 2 ∈ − max between two affine delay functions A and B, is denoted by maxA,B, while its upper bound, computed by the underlying method, is denoted by maxA,B(x1, x2). The tables below

1An accurate definition is provided in the following chapter.

19 the graph contain the respective function values for the 4 possible corners. As can be observed in the table corresponding to vertex u, for the worst corner, that corresponds to (1, 1), the − computed max upper bound is tight, since it is actually equal to the value of the max, which is 16. However, when C is added, the new worst corner in vertex v is ( 1, 1). For this − − corner, the upper bound is no longer tight, since the the real value is 15 and the upper bound value is 17.

This simple example illustrates one of the drawbacks of this method: even if the output hyperplane of every max operation is guaranteed to be tight, at the worst corner, the output hyperplanes at the primary outputs can be quite loose. This can occur if the worst corner varies frequently across the timing graph, and therefore errors are accumulated throughout the traversal of the timing graph. This can be particularly limiting when modeling intra-die variations, for which different corners can most likely prevail in distinct parts of the circuit.

Another drawback of this method is the fact that it does not incorporate a practical way of identifying the critical path of the circuit, and much less a set of its top most critical paths, which is of extreme relevance in the context of speedpath debugging, as we have explained earlier. Therefore, the applicability of this method in the context of timing sign-off seems rather limited.

20 3

Parametric Delay Modeling

As we have briefly discussed in Section 1.2, the typical timing verification flow consists in three fundamental steps: extraction, delay modeling and calculation and timing analysis. In the context of process variability, that has been the subject of much research in recent years, by far the greatest emphasis has been devoted to timing analysis and related operations, that manage the calculation of arrival times and the verification of timing constraints, at the level of abstraction of a timing graph. An equally important, if more mundane, component is the delay modeling and calculation step, which takes as input the cell and interconnect models and produces a delay expression in a form that can be subsequently consumed by the timing analysis engine.

This chapter introduces the foundations of a parametric delay modeling framework that, by accounting for process parameter variations, addresses the requirements of the latest nano- metric IC technologies. This framework follows closely the well established delay modeling strategies [57, 56, 58], that have been successfully used for many years in the nominal case, but introduces the necessary modifications and extensions to handle process parameter variations in an accurate and computationally-efficient fashion.

The outline of this chapter follows. Section 3.1 introduces precise parametric delay and slew definitions. Section 3.2 defines affine functions and derives related useful results. Sec- tion 3.3 reviews the general mechanics of delay computation. Section 3.4 briefly discusses interconnect and cell characterization in a parametric timing context. Finally, Section 3.5 describes the parametric timing graph as a compact representation of delay information.

21 node a node b

digital circuit

delay VDD

VH

VT VT

VL slew VSS voltage waveform at node a voltage waveform at node b

Figure 3-1: Reference voltages for delay and slew calculation.

3.1 Delay and Slew Definitions

The information processed by modern general-purpose digital circuits is usually encoded in node voltage waveforms, since it is the voltage value at particular circuit nodes that determines their corresponding logic values. Delay and slew definitions are therefore related to such voltage waveforms, observed in particular nodes of the circuit. Voltage values are usually within two predefined limits: VSS and VDD. These limits vary according to the fabrication technology. Most often, VSS = 0V . For early CMOS IC technologies, the value of

VDD was 5V , while for the latest low-power technologies VDD can be as low as 1V . Every node voltage can either be stable, at approximately one of the limit values, VSS or VDD, or it can be switching between those two values. The shape of the voltage waveforms during that switching transition period depends on the electrical characteristics of the circuit components involved and on the shape of the respective electrical stimuli. It is in general assumed that both the waveforms of the applied stimuli as well as the resulting switching transition waveforms can be roughly approximated by rising or falling ramps. Switching transitions are considered to be relevant events in a timing context, unlike permanently stable voltage values. Even though both switching transitions and stable values may result from useful processing activity in a digital circuit, stable values cannot correspond to timing-critical situations, and therefore are not an interesting subject of study in a timing verification context.

22 The primary motivation for measuring or calculating delays between voltage waveforms in specific nodes of a circuit is to evaluate the processing delay introduced by the circuit, when reacting to a stimulus, which is an essential contribution to evaluate if its performance meets the target specifications. Such stimulus is injected in specific nodes of the circuit and the corresponding response is later observed in other nodes. The elapsed time since the injection of the stimulus, until the corresponding processed response is available, is the delay. In a more precise definition, the delay between two nodes a and b is the time interval elapsed since a switching transition in the voltage waveform at node a crosses the value VT , until the switching transition in the voltage waveform at node b also crosses the value VT , induced by the former. VT is another technology-dependent value. Measuring delays between two nodes, when there is no cause/effect relation between the switching transition of their respective voltage waveforms is, in general, meaningless. Delay measurement between the voltage waveforms at two circuit nodes, a and b, is illustrated in Figure 3-1.

As we have mentioned before, switching transitions in voltage waveforms can, in general, be well approximated by either rising or falling ramps. This fact can be observed in Figure 3-

1, that depicts typical voltage transition waveforms. The slope of such approximation ramps, is a relevant figure, because it directly impacts the processing performance of subsequent circuit devices. Further, it can also help in the detection of design flaws, such as excessive loading in particular nodes. The slew is a measure of that slope. Unlike delay, that relates voltage waveforms in two distinct circuit nodes, slew measurement is performed in the voltage waveform of the same node. Precisely stating, the slew of a given rising voltage waveform is the time interval elapsed since it crosses the value VL, until it crosses the value VH . For falling waveforms, the two crossing values are exchanged. Once more, VL and VH depend on the technology. Typical values are: VL = 10% VDD and VH = 90% VDD. Slew calculation at a given circuit node b is illustrated in Figure 3-1.

A straightforward interpretation of the delay and slew definitions presented herein can lead to the conclusion that both figures are simply numbers, more precisely real values, as they correspond to the difference of time instants. For earlier IC technologies that would, in fact, be the case, since the information conveyed by such simple models would satisfactorily

23 meet the analysis requirements. However, in the latest nanometric IC technologies, where process parameter variations have a relevant impact in circuit performance, it is extremely important to know how do they impact delays and slews. It becomes therefore necessary to characterize delays and slews resorting to more complex models, particularly as functions of process parameters, rather than just by their fixed nominal values.

While digital circuits are strongly nonlinear with respect to circuit inputs, delays and slews are often close to linear with respect to process parameters. Therefore, sensitivities (e.g. first- order derivatives) present themselves as a simple and most likely sufficiently accurate measure of the impact of parameter variations over delays and slews. Consequently, rather then being represented by real values, delays and slews will be represented by affine functions [63] of parameter variations, corresponding to a first-order Taylor series expansion (linearization) around a nominal point, λ0, in the parameter space,

∂d d(λ) = d(λ ) + (λ λ ) (3.1) 0 ∂λ − 0 λ0

Considering the parameter space to have size p, and representing d as a function of the incremental parameter variation vector, ∆λ = λ λ , around a nominal value λ , then − 0 0 Eqn. (3.1) can be rewritten more compactly as

p T d(∆λ) = d0 + di∆λi = d0 + d ∆λ (3.2) Xi=1

where d0 = d(λ0) is the nominal value of d and di is the sensitivity of d to parameter λi, i = 1, 2, . . . , p, computed at the nominal point λ0.

The application of a linear parametric formulation in the context of statistical timing analysis was first proposed in [70]. This representation is mathematically equivalent to the canonical formulation prescribed in [70], but the interpretation and subsequent treatment is, as we shall see, quite different.

24 3.2 Affine Functions

As discussed in the previous section, delays and slews will be modeled as affine function of parameter variations. Later in this chapter will become clear that other quantities (such as capacitances and resistances) will also be modeled in a similar manner. This section presents a more complete and general definition of affine function and develops the results necessary for their efficient manipulation, in the several tasks throughout this work where that will be necessary.

We start by defining affine functions, and explaining how the extreme values of affine functions can be computed. Subsequently, useful results about the max of affine functions are derived. The min of affine functions will not be explicitly addressed, as similar results can be derived by symmetry, following the same procedures presented for the max.

3.2.1 Definition

An affine function [63], A, is a real-valued function of a p-dimensional vector of real-valued parameters (variables), x = (x , x , . . . , x ), where x Rp, such that, 1 2 p ∈ A(x) = a + a x + a x + + a x , a R , i = 0, 1, . . . , p (3.3) 0 1 1 2 2 ··· p p i ∈ ∀

In the following, the value of each parameter xi is assumed to be limited to an interval,

x [xmin, xmax] , i = 1, 2, . . . , p (3.4) i ∈ i i ∀ or written in a more compact form,

x [xmin, xmax] (3.5) ∈ where xmin and xmax are the vectors of lower and upper limits, respectively. Assuming a = (a1, a2, . . . , ap), a more compact representation of Eqn. (3.3), similar to that of Eqn. (3.2), can be obtained

T A(x) = a0 + a x (3.6)

3.2.2 Extreme Values

Affine functions are convex [8]. Informally, a function is said to be convex if its graph lies below the straight line segment connecting any two of its points. The typical and intuitive

25 x3

x1

x2

Figure 3-2: Corners of x, for p = 3. example of a convex function is that whose graph is cup shaped. The convexity implies that the smallest and the largest value for a given affine function is obtained by setting each parameter to one of its extreme values. The largest value of A, given by in Eqn. (3.6), is

T max[A(x)] = A(x∗) = a0 + a x∗ (3.7) x for a maximizing parameter assignment, x∗ = (x1∗, x2∗, . . . , xp∗), such that

min xi if ai 0 x∗ =  ≤ , i = 1, 2, . . . , p (3.8) i ∀  max xi if ai > 0  The smallest value of A can be trivially computed by symmetry. Each of the 2p possible values of x, where all the parameters assume one of their extreme values (either xmin or xmax), will be designated as a corner. This designation becomes clear by observing Figure 3-2, where all the 8 corners are depicted by gray dots, for p = 3.

3.2.3 Sum

Given two affine functions, A and B, their sum, C, is given by

T T C(x) = A(x) + B(x) = (a0 + b0) + (a + b) x = c0 + c x (3.9)

Each coefficient of C is computed by coefficient-wise sum of the coefficients of A and B,

c = a + b , i = 0, 1, 2, . . . , p (3.10) i i i ∀

Therefore, it can be easily concluded that the sum of affine functions is also an affine function.

26 max max

Figure 3-3: Maximum of affine functions and of piecewise-affine functions, for p = 1.

3.2.4 Exact Max

The computation of the max of affine functions is of great interest in several areas of operations research, and will also be extensively used throughout this work. As illustrated in the left plot of Figure 3-3, for p = 1, the max of affine functions, is a piecewise-affine function. Additionally, the right plot of Figure 3-3 shows that the max of piecewise-affine functions is itself a piecewise-affine function. An important property of the max operator over affine functions or convex piecewise-affine functions is that it always produces convex functions. Therefore, the max of affine functions is always a convex function.

Considering a set of m affine functions in the form

T A(j) = a(j) + a(j) x , j = 1, 2, . . . , m (3.11) 0 ∀ the piecewise-affine function, , that represents their max can be implicitly written as, A

T (x) = max [A(j)(x)] = max [a(j) + a(j) x] (3.12) A j=1,2,...,m j=1,2,...,m 0

Rather than explicitly enumerating the pieces that belong to the piecewise-affine function

, we have chosen to implicitly represent it as the max of the m original affine functions, A since the explicit computation of such pieces could potentially be very expensive, and is not relevant at this point. The piecewise-affine functions that will be studied in the context of this work always correspond to the max of affine functions, and are assumed to have a representation in the form of Eqn. (3.12).

27 max

xmin xmax

Figure 3-4: Redundant affine functions in the max, for p = 1.

3.2.5 Simplification of the Max

The computational cost of storing and manipulating piecewise-affine functions represented in the form of Eqn. (3.12) increases with the number of affine functions, m. Therefore, it is of paramount importance to keep their number to a minimum, discarding any redundant affine functions, whenever possible. Figure 3-4 illustrates the existence of affine functions

(dashed lines) that, within the range of the parameters, are irrelevant for the computation of the max. Such affine functions are redundant and can therefore be removed from the max without error. A given affine function, A(k), can be removed from the corresponding max piecewise-affine function, , if the following linear program (LP) has no feasible solution, A

min empty x s.t.A(k)(x) > A(j)(x) , j = 1, 2, . . . , m j = k (3.13) ∀ ∧ 6 xmin x xmax , i = 1, 2, . . . , p i ≤ ≤ i ∀

The solution of the LP of Eqn. (3.13) corresponds to the computation of one point, within the range of x, where the value of A(k) is greater than the value of any other of the affine functions A(j), j = k. If such point exists, then A(k) is relevant and cannot be removed. If it 6 does not exist, then A(k) can be safely removed without error. Since the objective function of the LP is empty, any feasible solution, or the proof that such solution does not exist, are acceptable results.

This analysis has a non-negligible computational cost (the cost of finding an initial feasi-

28 ble solution for the LP). However, this cost can easily be amortized if the affine function is deemed irrelevant, and therefore it is eliminated from further computations involving the cor- responding max piecewise-affine function. Hence, it is advisable to only apply this technique to affine functions that are extremely likely to be eliminated. An appropriate inexpensive heuristic can be used to predict such condition.

3.2.6 Bounding the Max

For many applications it is not practical, or even possible, to store and manipulate the exact piecewise-affine representation of the max of two or more affine functions. Therefore, it becomes necessary to devise adequate approximations to the max that, at a smaller com- putational cost, may still provide relevant information for the underlying application. Quite often, an upper/lower bound on the max is a good enough approximation.

An affine function, U, defined as

T U(x) = u0 + u x (3.14) is an upper bound on the piecewise-affine function that represents the max of m affine func- tions, , if the following condition holds, A

(x) U(x) , x [xmin, xmax] (3.15) A ≤ ∈

Bounding an arbitrarily complex convex piecewise-affine function with just one affine function, e.g. using a single p-dimensional plane, may seem overly inaccurate but, most often, this approach represents the best compromise between speed and accuracy.

The simplest and least expensive single plane upper bound on the max of affine functions can be computed by just picking for each coefficient of U the maximum of the corresponding coefficients of every affine function maxed in , such that A

(j) ui = max [a ] , i = 0, 1, . . . , p (3.16) j=1,2,...,m i ∀

This upper bound is inexpensive to compute, but it usually is quite loose. Nevertheless, it may be useful in applications where speed is preferred over accuracy, or when the problem size renders other more accurate approaches unfeasible.

29 xmin xmax

Figure 3-5: Tightest single plane upper bound of a convex piecewise-affine function, for p = 1.

The tightest upper bound on the max, with only one bounding function (one plane), as illustrated in Figure 3-5, can be computed by solving the following LP,

min  u0,u s.t.  U(x(q)) (x(q)) , q = 1, 2,..., 2p (3.17) ≥ − A ∀ U(x(q)) (x(q)) , q = 1, 2,..., 2p ≥ A ∀ where x(q) is the q-th corner of x. The variables of this LP are the coefficients of U. This upper bound is the tightest (for a single plane), but it is in general quite expensive to compute, as it involves generating constraints for each of the 2p possible corners of x as well as solving the LP, thus exhibiting exponential run-time complexity.

A range of intermediate approximations, lying between the loosest and the tightest upper bounds, can be conceived. One such intermediate approximation, claimed to run in linear time, has been recently proposed in [50].

3.2.7 Bounding Error

Given an upper bound to a piecewise-affine function in the form of Eqn. (3.12), it may be useful to assess the quality of such bound by computing the maximum error between the bounding function and the bounded function, within the range of values of x. Rather than restricting the bounding function to be an affine function, it may be useful to generalize the error computation to the case where the bounding function is also a piecewise-affine function

30 ǫmin U U

ǫmax A A

xmin xmax xmin xmax

Figure 3-6: Maximum/minimum error between two convex piecewise-affine functions, for p = 1.

, in the form of Eqn. (3.12), U

(k) (k) (k)T (x) = max [U (x)] = max [u0 + u x] (3.18) U k=1,2,...,r k=1,2,...,r

The maximum error,  , between a given bounding function, , and the bounded function, max U , as illustrated in the left plot of Figure 3-6, is given by, A

min max max = max [ (x) (x)] , x [x , x ] (3.19) x U − A ∈

In practice, the value of max can be computed by first solving the following LP, for each affine function U (k), where k = 1, 2, . . . , r,

max U (k) γ = (k) x − max s.t. γ A(j) , j = 1, 2, . . . , m (3.20) ≥ ∀ U (k) U (l) , l = 1, 2, . . . , r l = k ≥ ∀ ∧ 6 xmin x xmax , i = 1, 2, . . . , p i ≤ i ≤ i ∀ and subsequently computing its value from the partial error values,

(k) max = max [max] (3.21) k=1,2,...,r

The minimum error,  , between and can also be computed, and provides a measure min U A of tightness of the upper bound, . Additionally, it can also be used to verify the correctness U of such bound. As illustrated in the right plot of Figure 3-6, if  is negative, then does min U not effectively bound , within the entire range of values of x. The minimum error,  , is A min

31 defined as,

min max min = min [ (x) (x)] , x [x , x ] (3.22) x U − A ∈

The value of min can be computed by first solving the following LP, for each affine function A(j), where j = 1, 2, . . . , m,

min γ A(j) = (j) x − min s.t. γ U (k) , k = 1, 2, . . . , r (3.23) ≥ ∀ A(j) A(l) , l = 1, 2, . . . , m l = j ≥ ∀ ∧ 6 xmin x xmax , i = 1, 2, . . . , p i ≤ i ≤ i ∀ and subsequently computing its value from the partial error values,

(j) min = min [ ] (3.24) j=1,2,...,m min

3.3 Mechanics of Delay Computation

Delay computation is concerned with computing delays between voltage waveforms in particular nodes of a digital circuit, having as the ultimate objective the verification that the timing performance of such circuit meets its target specifications. There are several approaches to delay computation, that have been employed in the past, with different levels of success. The most accurate approaches are based in the exhaustive transistor-level simulation of large portions of the circuit under analysis, or of its totality. Such approaches are highly accurate, however their computational cost is so huge that they can only be applied to small sized circuits, in situations where high accuracy is of paramount importance.

Another accurate approach is to compute the delay of each path separately, also resorting to accurate transistor-level simulation. Such computation amounts to simulating the behav- ior of the path (e.g. of all the circuit elements contained therein), for several propagation conditions if necessary, and extracting the corresponding delay information. For processing an entire circuit, every path needs to be independently analyzed. Since the number of paths in a digital circuit may grow exponentially with the number of nodes, this approach can be- come overly expensive for full circuit analysis, even for moderately sized circuits. Therefore,

32 cell inputs outputs ( ) cell cell interconnect

cell port taps cell ) interconnect interconnect

cell

Figure 3-7: Typical partition of a digital circuit topology for delay computation. it is only used in very specific and localized situations, when the delay of a small number of selected paths needs to be computed with very high accuracy.

The best compromise between accuracy and speed is achieved by partitioning the circuit in small blocks, as illustrated in Figure 3-7, and computing their delays independently. As result of a typical partition scheme two types of blocks are obtained: cell blocks and interconnect blocks. Cell blocks correspond to aggregates of active switching elements, and are instances of library cells. Interconnect blocks correspond to electrically connected wires (designated by nets), that connect the inputs and outputs of cell blocks. Since cell blocks are just instances of a limited number of library cells, it is wise and common to pre-characterize such cells for several possible conditions, and subsequently use that information to efficiently compute approximate cell instance block delays. As for interconnect, since each block is essentially different from all the other blocks, the only accurate method to compute its delay is to perform electrical simulation. Nevertheless, there are several techniques that allow the completion of such task in a computationally-efficient manner.

For each block, the corresponding input/output delays are independently computed, irre- spective of the path taken by the signal before entering the input of the block and after leaving the output of the block. This is an obvious source of inaccuracy, since the voltage waveforms that reach the block inputs will exhibit distinct slew values, depending on the path that was taken. In order to cope with this and ensure that safe results are produced, conservative assumptions are made. These assumptions ensure that each computed input/output delay

33 layer resistance via resistance

overlap capacitance frindge capacitance lateral capacitance

Figure 3-8: Most relevant parasitic effects considered in interconnect extraction. corresponds to the worst conditions, e.g. that the path taken by the digital signal is such that the worst input/output delay is obtained. This approach is therefore more conservative than other, intensive simulation-based approaches, described earlier. However, delay overes- timation is the necessary price to pay for ensuring the computational efficiency required to handle circuits with millions of elements.

As will be more thoroughly discussed in the next section, block delay models are usually parametrized by the slew of their input voltage waveforms. Therefore, once the circuit is properly partitioned and all the cell and interconnect delay models are in place, the task of the delay computation engine is to forward propagate slews and invoke the appropriate delay models for computing delays and output slews given the input slews.

3.4 Interconnect and Cell Characterization

In the sequence of the previous section, we proceed with a general overview on the char- acterization of interconnect and cell blocks, within the parametric delay modeling framework proposed in this work.

As we have mentioned, interconnect blocks are essentially groups of wires, that connect cell inputs and outputs. The signal propagation in such wires is influenced not only by the electrical properties of the material but also by electromagnetic interactions between signals in neighboring wires (coupling). The generation of an electrical model, in terms of lumped ele- ments (e.g. netlist), that models specific characteristics of the circuit elements it is generically designated by extraction. This task is usually performed by a dedicated engine that analyzes the shape and location of every wire in the circuit layout and generates an appropriate model

34 that accounts for all the relevant electrical interactions. These engines usually rely on pattern matching schemes that compare layout patterns with a library of pre-characterized patterns

(e.g. wire locations and shapes), computed by accurate electromagnetic model simulation performed by an appropriate field solver. Several models can be generated from extraction, with different levels of complexity, targeted at specific applications. In the problem at hand, the models will be used to compute interconnect delays and slews. In this context, resis- tive and capacitive effects are the most relevant, and therefore inductive effects are usually neglected. In order to obtain simplified RC models, the coupling capacitances are often converted to grounded capacitances. The most relevant effects considered in interconnect extraction for delay computation are illustrated in Figure 3-8.

In the context of parametric modeling, not only the nominal values of the lumped RC elements are necessary, but also the corresponding sensitivities, that model the impact of process parameter variations. For the case of interconnect, the relevant process parameters are usually wire thickness, wire width, wire spacing, etc. Sensitivities can be computed by differences, by performing repeated layout extraction under different parameter conditions.

This is the usual approach that enables parametric extraction, using traditional nominal case extractors. However, the new generation of parametric extractors is capable of natively extracting parameter sensitivities using analytic approaches, inherently more accurate and ro- bust than differencing-type approaches. In the following, the parametric interconnect model is assumed to be an RC network, without grounded resistors and with only grounded capac- itors, where each element is characterized by a nominal value and by a set of sensitivities to the process parameters, in the form of Eqn. (3.2).

Mainly for historical reasons, the most common modeling strategy for cell library char- acterization is based in delay look-up tables (LUTs) sometimes referred to as dotlib (.lib) or Liberty [65] tables. This is a simplified model where delay and power information is maintained in the form of a few parameters. In this simplified model the timing behavior of a cell is usually characterized by a set of look-up tables that, for each input/output pair, describe the delay and output slew of the cell as a function of the input slew and output load. The output load is assumed to be a lumped capacitance. Such a model is depicted in

35 circuit timing graph

Figure 3-9: Illustration of the original circuit and the corresponding timing graph.

complete path G

PI(G) P O(G) v e

partial path

Figure 3-10: Illustration of the elements of a timing graph.

Figure 4-1-(b). Additionally, in Figure 4-2, the corresponding voltage waveforms are illus- trated, where the standard delay and slew definitions apply. In the context of this work, we will assume that only nominal cell delays are characterized by look-up tables, and non-zero delay sensitivities to the process parameters will be a consequence of the fact that cell input slews and output loads may also depend on the process parameters. Nevertheless, specific cell parameters can be treated in a similar manner as interconnect parameters.

3.5 Timing Graph

After the delay computation stage, rather than keeping detailed circuit information, a more compact representation is usually adopted, only containing the circuit topology and delay and slew information, relevant for the subsequent timing analysis stage. This representation is usually in the form of a timing graph, as illustrated in Figure 3-9. A timing graph, G = (V,E), is a graph that maps circuit topology, where vertices, v V , correspond to pins (nodes) in ∈ the circuit, and directed edges, e E, correspond to pin-to-pin delays in cells or interconnect. ∈ Each edge is annotated with the corresponding delay. Further, some vertices are annotated with timing constraints, such as required arrival times. The primary inputs are vertices with

36 no incoming edges. All vertices with no outgoing edges are primary outputs, but there may also be primary outputs with outgoing edges. The sets of primary inputs and outputs of G are respectively PI(G) and PO(G). A complete path is a sequence of edges, connecting a primary input to a primary output. A partial path is a sequence of edges connecting any two vertices. A complete path will be referred to simply as a path. The former definitions are illustrated in Figure 3-10.

37 38 4

Parametric Delay Calculation

The previous chapter outlined the basic principles of the parametric delay modeling strat- egy that is pursued in this work. In this strategy, circuits are partitioned into cell and interconnect blocks and, for each block, delays and output slews are computed, using input slew information, as well as specific block information. For cells, that specific information is tabulated in a technology library. For interconnect, that information is an extracted para- sitic RC network that translates into an electrical description the relevant signal propagation properties of the wires. This chapter derives the necessary mathematical formulation for the calculation of such parametric delays and output slews. Specifically, it describes how to produce cell and interconnect delays and output slews as affine functions of process parame- ters [25, 26]. It is assumed that one of several recently proposed approaches for interconnect reduction under process parameter variations is available to generate tractably sized reduced order models [72, 55, 37]. Even though not essential to the application of the techniques proposed in this chapter, this assumption it is relevant from a practical standpoint.

The key technology in the proposed approach is a specific type of perturbation analysis.

While digital circuits are strongly nonlinear with respect to the circuit inputs, cell delays are often close to linear with respect to process parameters. In this work the general devel- opment of linear time-varying (LTV) perturbation theory [73, 54] is adapted for extraction of variation-aware delay models tailored to the specific needs of delay calculation for pre- characterized standard cells. LTV perturbation theory has been widely used in RF analysis with great success [67] and is at the heart of many interesting new developments. The ad-

39 vantage of this type of approach over, for example, differencing repeated delay calculation runs is that it is essentially an analytical method. Differencing type approaches can suffer from severe robustness problems that make them difficult to use reliably. In addition, this technique can potentially be made very fast, handling parametric models with ten to twenty parameters at minimal penalty relative to a non-variational calculation.

The outline of this chapter is as follows. We start by introducing the basics of delay computation and explaining the general delay computation procedure for the nominal case, when no variations are taken into account. Then, in Section 4.2, we introduce the general perturbation formulation and discuss the specific specialization of the more general technique to cell-level interconnect-related delay. In Section 4.3, we also discuss how perturbation analysis can be performed when only delay table look-up models are available for the standard cells. A key point is that analytic expressions for delay sensitivities can be obtained without having to have closed-form expressions for the cell delay (however, see [47] for such closed- form expressions). Finally, Section 4.4 describes the details of one implementation of the formulations derived in previous sections. Conclusions are drawn in Section 4.5. 4.1 Nominal Delay Calculation

This section discusses the nominal delay computation for pre-characterized library cells and interconnect. For cell delay calculation, a new effective capacitance formulation is pro- posed. Interconnect delay is calculated using an accurate, yet efficient, simulation-based approach.

4.1.1 Cell Delay and Cell Loading

The cell delay modeling strategy outlined in the previous chapter assumes a voltage source model for the cell characterization, as illustrated in Figure 4-1-(b), since delay and slew values implicitly characterize the output voltage waveforms of the cell, corresponding to vc in this particular case. However, in recent years, current source models [32, 17] are gaining more prominence, since they are more effective in handling complex interconnect loading effects.

Even though throughout this work we assume voltage source delay models, the proposed techniques can also be directly applied when using current source delay models.

40 R vm R vc

port Im vi vi Ic interconnect v taps v RC network C cell cell

(a) original circuit (b) equivalent circuit

Figure 4-1: Voltage source based cell models: original circuit and equivalent circuit.

si

vi s

v k so t T tH vc tL

d

Figure 4-2: Waveforms of vi, v and vc, with delay, d, slew, s, and shift, k, measurements.

41 In Figure 4-1-(b), the output load is assumed to be a single lumped capacitance that some- how models the capacitive effects introduced by the interconnect and by the input pins of the cells connected to same net. In reality, however, the interconnect attached to the driver cell is a complex RC network that in deep submicron processes is very poorly modeled by a lumped capacitance. The loading effect of interconnect on the cell, i.e. the impact of downstream interconnect on the cell delay itself, cannot be accurately obtained simply by looking at the total capacitance on the net. To try to account for the effects of complex interconnect, while still preserving table-based cell models, the concept of effective capacitance [57, 47] has been widely adopted. For the remainder of this document we will consider that the C shown in

Figure 4-1-(b) is such an effective capacitance.

The idea behind the effective capacitance consists in determining the value of C that in a certain sense approximates as accurately as possible the behavior of the original interconnect

RC network. Several criteria can be used when computing such an approximation, as will be discussed later in this chapter. In Figure 4-1 the output stage of a cell - or more accurately, of an output pin of a cell - is modeled by a voltage source, producing a voltage ramp v, with slew s, and a series resistor, with resistance R, that models the output resistance of the pin. The figure depicts the output stage of a cell loaded by the effective capacitance C

(b), and by the original interconnect RC network, obtained by layout parasitic extraction of the interconnect (a). In the following, without loss of generality and in order to simplify the description, we restrict ourselves to the case of rising output waveforms for non-inverting digital cells. Clearly any other case can be derived in a similar manner.

The simple RC circuit in Figure 4-1-(b) is an approximated model of the output stage of a cell connected to an effective capacitance, that is itself an approximation of the interconnect load. For a given input slew, si, and a given effective capacitance, C, the estimated cell delay, d, and output slew, so, can be computed by a table look-up in the timing characterization of the cell. Using this information, and with the help of Figure 4-2, we can easily derive the expressions for the three time instants at which the waveform of the output voltage, vc,

42 should cross VL, VT and VH , respectively,

si so tL = VT + d (VT VL) (4.1) VH VL − VH VL − s−i − tT = VT + d (4.2) VH VL s− s t = i V + d + o (V V ) (4.3) H V V T V V H − T H − L H − L Assuming the voltage v to be a rising ramp of slew s, shifted in time by k,

0 if 0 t < k  ≤  VH VL sVDD v(t, s, k) =  − (t k) if k t < k + (4.4)  s − ≤ VH VL  − sVDD VDD if t k +  ≥ VH VL  −  the output voltage, vc, produced by the simple RC circuit of Figure 4-1-(b) is given by,

0 if 0 t < k  ≤ t−k  VH VL sVDD vc(t, s, k, R, C) =  − ( RC + t k + RCe− RC ) if k t < k + (4.5)  s VH VL  − − ≤ −  sV DD t−k VH VL RC(V −V ) sVDD VDD − (e H L 1)RCe− RC if t k +  − s − ≥ VH VL  −  In order to simplify our notation, in the following we will assume,

s   k φ =   (4.6)    R       C      Using Eqn. (4.5), we can compute a waveform for v (e.g. s and k) and a resistance R, such that the waveform of the response vc crosses (tL,VL), (tT ,VT ) and (tH ,VH ), thus matching the tabulated behavior of the cell and its output response. These constraints can be stated by the following three equations,

vc(tL, φ) = VL (4.7)

vc(tT , φ) = VT (4.8)

vc(tH , φ) = VH (4.9)

The waveform of v can be seen as the “ideal” output voltage of the cell, under a zero output load. We should not lose track of the fact that our goal is to determine an appropriate value

43 for the effective capacitance C. The previous derivations assumed that such a value was somehow known. However, all that is required is that C should approximate the behavior of the original interconnect RC network as accurately as possible. Several criteria [18] can be used when defining what effective capacitance value provides a good approximation of the behavior of the original interconnect RC network. In this work we consider that the effective capacitance that better approximates the behavior of the original interconnect RC network is the one that accumulates the same charge, over the transition period (e.g. when the output voltage switches from VL to VH ). This means that in both circuits the total charge accumulated during their respective transition periods must match. Formally,

0 tH tH Qc = Qm Ic dt = Im dt (4.10) 0 ⇔ tL t Z Z L

0 0 where vm(tL) = VL and vm(tH ) = VH , corresponding to the points where the output voltage of the cell in the original circuit crosses VL and VH .

From Eqns. (4.7), (4.8), (4.9) and (4.10) a value of φ can be computed, that both matches the output waveform vc with the tabulated timing information at tL, tT and tH , and also that matches the charge drawn by the original interconnect RC network and the effective capacitance, over the transition period. Since Eqns. (4.7), (4.8), (4.9) and (4.10) contain nonlinear terms, an implicit iterative method must be used to solve them. The Newton-

Raphson method was therefore chosen to compute the roots of the function F : R4 R4, → that encodes the error in Eqns. (4.7), (4.8), (4.9) and (4.10), and is defined as,

v (t , φ) V c L − L   v (t , φ) V c T − T F (φ) =   (4.11)    vc(tH , φ) VH   −     Qc Qm   −   

For this iterative method, the step, ∆φn+1, is computed by solving the following equation,

J(φ ) ∆φ = F (φ ) (4.12) n n+1 − n

44 where J is the Jacobian of F , given by,

dvc dvc dvc dvc ds tL dk tL dR tL dC tL

 dvc dvc dvc dvc  ds tT dk tT dR tT dC tT J(s, k, R, C) =   (4.13)  dvc dvc dvc dvc   ds tH dk tH dR tH dC tH     dQc dQm dQc dQc dQm dQc     ds − ds dk dR − dR dC    The explicit analytical formulas for the sensitivities of vc and Qc can be easily computed by taking the derivatives of the known analytical formulas of vc and Ic. However, special care should be taken when computing the derivatives with respect to C, that require appropriate chain ruling, as tL, tT and tH are also a function of C. Therefore,

dv ∂v ∂v dt c = c + c L (4.14) dC ∂C ∂t dC tL tL tL

where dt dt ds dt dd V V ds dd L = L o + L = T − L o + (4.15) dC ds dC dd dC −V V dC dC o H − L For tT and tH the same derivation applies. Similarly, the charge derivative is,

dQ ∂Q dQ dt dQ dt c = c + c L + c H (4.16) dC ∂C dtL dC dtH dC

dQm The value of dk is zero, because shifting the input waveform, v, by time k does not change the charge, Qm, accumulated by the circuit during the transition period, as the transition period is merely shifted in time. Since the capacitance C is only included in the equivalent

dQm circuit, not in the original circuit, it cannot impact Qm, and therefore the value of dC is also

dQm dQm zero. The computation of the sensitivities ds and dR requires a more complex procedure, that is detailed below. Given, 0 t dQm d H = Im dt (4.17) ds ds t0 Z L The application of the Leibniz Integral Rule1 to Eqn. (4.17) results in

0 0 0 tH dQm ∂Im ∂tH 0 ∂tL 0 = dt + Im(tH ) Im(tL) (4.18) ds t0 ∂s ∂s − ∂s Z L

1 Leibniz Integral Rule df The states that given a variable α [α0, α1] and continuous functions f and dα over [α0, α1] [a(α), b(α)], then ∈ × d Z b(α) Z b(α) ∂ db da f(x, α) dx = f(x, α) dx + f(b, α) f(a, α) dα a(α) a(α) ∂α dα − dα

45 0 The derivative of the voltage vm to the slew, s, for time tL, is given by,

0 dv ∂v ∂v ∂t m = m + m L (4.19) ds 0 ∂s 0 ∂t 0 ∂s tL tL tL

0 0 0 By the definition of tL, vm(tL) = VL, therefore the value of vm(tL) remains constant, for whatever value of s considered, producing the following result,

0 0 ∂vm dv ∂v ∂v ∂t ∂t ∂s t0 m = 0 m + m L = 0 L = L (4.20) ∂vm ds t0 ⇔ ∂s t0 ∂t t0 ∂s ⇔ ∂s − 0 L L L ∂t tL

0 The same result can be derived for time tH . Combining Eqns. (4.18) and (4.20) yields,

0 ∂vm ∂vm t 0 0 H 0 0 dQm ∂Im ∂s tL ∂s tH = dt + Im(t ) Im(t ) (4.21) 0 ∂vm L ∂vm H ds ∂s 0 − 0 tL ∂t ∂t Z tL tH

dQm The exact same method can be used to derive dR ,

0 ∂vm ∂vm t 0 0 H 0 0 dQm ∂Im ∂R tL ∂R tH = dt + Im(t ) Im(t ) (4.22) 0 ∂vm L ∂vm H dR ∂R 0 − 0 tL ∂t ∂t Z tL tH

The computation of the values of the terms of Eqns. (4.21) and (4.22) can be performed in several ways, depending on the representation of the interconnect RC network. A numerical approach for such computation is described later in this chapter.

Upon convergence, the Newton-Raphson method produces a solution,

s∗   k∗ φ∗ =   (4.23)    R∗       C∗      where C∗ is the effective capacitance of the interconnect RC network. Given C∗, the delay, d, and the output slew, so, of the cell can be computed by a simple table look-up in its timing characterization. This completely characterizes the cell output waveform within the constraints of the simple model. Such a waveform constitutes the input to the interconnect delay model.

4.1.2 Interconnect Delay

Assuming that the cell output voltage waveform has been computed, signals are then propagated along the path through an interconnect net. The input of such nets, the port,

46 is tied to the output of a cell, and the net outputs, the taps, connect to the inputs of several other cells. At the timing level, the difference in the timing of the transition at the cell output (port) and next cell inputs (taps) we refer to as intrinsic interconnect delay.

There are various methods of computing the interconnect delay ranging from closed-form expressions, descendants of the Elmore delay formula, to numerical solution of the underlying interconnect equations. In this work we assume that the circuit equations of the cell driver plus interconnect network are solved numerically, either via direct integration or an equivalent process like recursive convolution. Likewise the slew at the output nodes must be computed to be used in the analysis of the following cell.

The general state-space representation of an RC network (either in its original of reduced form) is d C x(t) + Gx(t) = u(t) (4.24) dt y(t) = N T x(t) (4.25) where x Rn is the vector of circuit state variables, u is the input excitation, y is the output ∈ response, C and G are the matrices describing the reactive (capacitances) and dissipative

(conductances) parts of the circuit and N selects the output response.

Assuming a cell characterization in terms of voltage source models, as illustrated in Fig- ure 4-1-(a), the input excitation is the voltage waveform, vm, and the output response are the voltage waveforms in the taps, vtap. Therefore, we have,

u(t) = Bvm (4.26)

T vtap(t) = L x(t) (4.27) where B is a matrix describing the node where the input voltage is injected, and L is an incidence-type matrix describing which voltage nodes are monitored (taps). In the particular case of voltage source models, the current drawn by the interconnect RC network, Im, is also relevant, both for computing the effective capacitance and the input voltage waveform.

Hence, an additional equation should be added,

T Im(t) = M x(t) (4.28) where M selects the output current out of the state vector x.

47 4.2 Variation-Aware Methodology

This section discusses the parametric analysis of the intrinsic interconnect delay itself.

The impact of the interconnect parameters on the cell delay (i.e. variation in cell loading effects) is taken up in the next section.

4.2.1 General Perturbation Formulation

The starting point of our analysis is the general formulation of time-varying linear pertur- bation theory (see [54] for details). We assume the existence of a set of nonlinear differential- algebraic equations whose topology is fixed, but whose constitutive relations depend on a continuous way on a set of parameters. Without loss of generality the basic circuit equations can be written as d q(x, λ) + i(x, λ) = u(t) (4.29) dt where x again represents the circuit state variables, for example, node voltages, q Rn, ∈ the dynamic quantities such as stored charge, i Rn, the static quantities such as device ∈ currents, t, time, and u(t) Rn, the independent inputs such as current and voltage sources. ∈ In departure from the usual case, we introduce a p-element parameter vector λ Rp. These ∈ parameters represent properties of the circuit, such as wire width or thickness, that induce variation in the circuit behavior through the q and i functions.

The perturbation approach to modeling the parameter variation treats the parameters as fluctuations ∆λ around a nominal value λ0, and assumes the circuit response x can be treated similarly, i.e.

λ = λ0 + ∆λ (4.30)

x(t) = x0(t) + ∆x(t). (4.31)

Expanding i and q as function of x, λ and keeping the first order variations, we get

∂q ∂q q(x, λ) = q(x , λ ) + ∆λ + ∆x (4.32) 0 0 ∂λ ∂x ∂i ∂i i(x, λ) = i(x , λ ) + ∆λ + ∆x. (4.33) 0 0 ∂λ ∂x

48 Assuming a solution to the nominal case, x0(t) is obtained, that is

d q(x , λ ) + i(x , λ ) = u(t) (4.34) dt 0 0 0 0 then substituting the perturbation expansions in Eqns. (4.32) and (4.33) into Eqn. (4.29) and using Eqn. (4.34) to eliminate the nominal-case terms, we obtain the equations for the

first-order perturbation expansion as

d ∂q ∂i d ∂q ∂i ∆x + ∆x = ( )∆λ + ∆λ (4.35) dt ∂x ∂x − dt ∂λ ∂λ     The simplest way to compute waveform sensitivities from Eqn. (4.35) is by solving it once for each parameter in turn, as

d ∂q ∂x ∂i ∂x d ∂q ∂i for each k: + = ( ) + . (4.36) dt ∂x ∂λ ∂x ∂λ − dt ∂λ ∂λ  k  k  k k  This gives the final expression p ∂x x(t, λ) = x0(t) + (t)∆λk. (4.37) ∂λk Xk=1 Once the sensitivities in the waveforms are known, the next step is to translate to sensitiv- ity of delay. As discussed, delay can be computed as d = t t where t , t are the crossing 2 − 1 2 1 times of the two waveforms of interest. The sensitivity in a crossing time can be related to the sensitivity of the waveform value x(t) at that point via the slew, ∂x/∂t. Suppose there is a small change ∆T in the crossing time of a given waveform. With a linear model, the corresponding change in the voltage is

∂x ∆X = ∆T. (4.38) ∂t

Conversely, if the perturbation in the waveform ∆X can be computed, the change in crossing time is given by ∆X ∆T = ∂x . (4.39) ∂t Therefore we can compute the sensitivity of the delay as

∂x ∂x ∂λk ∂λk ∂d t2 t1 = ∂x ∂x (4.40) ∂λk − ∂t t2 ,λ0 ∂t t1 ,λ0

Note that for this computation, the waveform sensitivity is only needed at a few points in time, a fact that can be used to speedup computations.

49 This is the formulation for a general first-order perturbation analysis. In the following we restrict ourselves to the problem at hand, namely modeling the linear interconnect sub- circuits assuming variations in parameters affecting the interconnect elements.

4.2.2 Specialization to Interconnect

Our concern in this work is with the special case of interconnect parameters, so simplifi- cations of the general theory are possible. On-chip cell-level interconnect models are usually written in terms of capacitances and resistances, or equivalently, capacitances and conduc- tances. Inductance is typically neglected at this level and for the sake of simplicity we will proceed likewise; it is however easy to see that the derivation is quite similar when inductance is involved. Therefore, in this case,

q(x, λ) = C(λ)x i(x, λ) = G(λ)x (4.41) so that ∂k ∂G ∂q ∂C = x = x (4.42) ∂λk ∂λk ∂λk ∂λk Let us then assume, for now, that for every element in the interconnect RC network (resistor or capacitor), a parametric affine model is available. Such a model contains the nominal values for the elements and also the sensitivities to each parameter. Therefore, the conductance and the capacitance matrices have the form:

p p

G = G0 + (Gk∆λk) ,C = C0 + (Ck∆λk) (4.43) Xk=1 Xk=1 where G0 and C0 are the nominal values of the elements in the interconnect network and the sensitivities ∂G and ∂C to each parameter λ are given by ∂λk ∂λk k ∂G ∂C = Gk, = Ck. (4.44) ∂λk ∂λk

The nominal value corresponds to the solution of the equations with each ∆λk = 0, that is λ = λ0. Assuming the parametric formulation for G presented in Eqn. (4.43), and for x presented in Eqn. (4.31) we obtain, for instance for i(x, λ):

p

i(x, λ) = G0 + (Gk∆λk) (x0 + ∆x) (4.45) " # Xk=1 50 Simplifying and eliminating the (non-linear) cross-product terms, we obtain:

p i(x, λ) G x + G ∆x + (G x ∆λ ) (4.46) ≈ 0 0 0 k 0 k Xk=1 implying that: ∂i ∂i i0 i(x0, 0) = G0x0, = G0, = Gkx0. (4.47) ≡ ∂x ∂λk

An identical procedure can be applied to q(x, λ) leading, as expected, to:

p q(x, λ) C x + C ∆x + (C x ∆λ ) (4.48) ≈ 0 0 0 k 0 k Xk=1 and therefore, that:

∂q ∂q q0 q(x0, 0) = C0x0, = C0, = Ckx0 (4.49) ≡ ∂x ∂λk

Eqns. (4.34) and (4.35) which describe the general perturbation analysis framework, can therefore, in the specialization of parameter-varying interconnect, be written as:

d C x (t) + G x (t) = u(t) (4.50) 0 dt 0 0 0

p d d C [∆x] + G ∆x = (C x (t))∆λ + G ∆λ (4.51) 0 dt 0 − dt k 0 k k k Xk=1   The delay modeling problem is completed by adding the notion of inputs and outputs to form state-space models. In the case of cell-level interconnect, the inputs are represented by drivers, the output stages of cells. If the cell library is characterized using current source models, then the input is a fixed current source,

u(t) = Bidrv(t) (4.52) where B is simply an incidence matrix indicating at which node each driver is connected to.

Similarly, if the cell library is characterized using voltage source models (as in the case under study), we have

u(t) = Bvdrv(t) (4.53) as in Eqn. (4.26), where vdrv = vm. Other models may be used, like nonlinear current source models [32, 17].

51 Recalling Eqn. (4.27), the full set of equations is now d C x (t) + G x (t) = u(t) (4.54) 0 dt 0 0 0 T v0,tap(t) = L x0(t) (4.55) p d d C [∆x] + G ∆x = (C x (t))∆λ + G ∆λ (4.56) 0 dt 0 − dt k 0 k k k k=1   TX ∆vtap = L ∆x (4.57)

These equations can be written more compactly if we define d s (t) = C x (t) + G x (t) (4.58) k − k dt 0 k 0   where x0(t) is the nominal solution computed above. sk can be interpreted as the “equivalent source” that will allow the determination of the sensitivity to the kth interconnect parameter.

With this definition, the final, complete set of equations is then rewritten as d C x (t) + G x (t) = u(t) (4.59) 0 dt 0 0 0 T v0,tap = L x0(t) (4.60) p d C [∆x] + G ∆x = s (t)∆λ (4.61) 0 dt 0 k k k=1 XT ∆vtap = L ∆x (4.62)

4.2.3 Interconnect Sensitivity Calculation

The process of sensitivity calculation can now be concisely stated. First, solve Eqns. (4.59) and (4.60) to get the nominal case responses. Then, for each parameter k, solve d ∂x ∂x C + G = s (t) (4.63) 0 dt ∂λ 0 ∂λ k  k   k  ∂v ∂x tap = LT (4.64) ∂λ ∂λ k  k  to get the sensitivity of the response waveforms. From the sensitivity waveforms, the delay sensitivity can be computed using Eqn. (4.40) at the appropriate time points.

4.3 Cell Delay Sensitivity Calculation

In the preceding section, we have seen how to perform variation-aware delay computation, by computing the sensitivities of the response waveforms in interconnect blocks. However, it

52 is also necessary to show that similar sensitivities can be computed at the output of cells, in particular assuming that cell delay computation is still based on table lookup models.

To show this, we refer back to the derivation in Section 4.1 and in particular to

Eqns. (4.7), (4.8), (4.9) and (4.10). If we perform an expansion around a nominal point

φ0, keeping the first order variations, and eliminating the nominal-case terms, we obtain,

∆vc(tL, ∆φ) = 0 (4.65)

∆vc(tT , ∆φ) = 0 (4.66)

∆vc(tH , ∆φ) = 0 (4.67)

∆Qc(tL, tH , ∆φ) = ∆Qm (4.68)

Noticing the dependence of tL on d and so, and their dependence on si and C, we obtain,

dvc dvc dvc dvc ∂vc dtL ∆vc(tL, ∆φ) = ∆s+ ∆k+ ∆R+ ∆C + ∆si = 0 (4.69) ds dk dR dC ∂t dsi tL tL tL tL tL

where dt ∂t ∂t ∂s ∂t ∂d L = L + L o + L . (4.70) dsi ∂si ∂so ∂si ∂d ∂si

The derivations for tT and tH follow the same procedure. For Eqn. (4.68) a similar expansion can be performed,

dQ dQ dQ dQ dQ dQ dQ c m ∆s + c ∆k + c m ∆R + c ∆C + c ∆s = ∆Q (4.71) ds − ds dk dR − dR dC ds i m     i where dQ ∂Q dt ∂Q dt c = c L + c H (4.72) dsi ∂t dsi ∂t dsi tL tH

The following expressions relate ∆si and ∆Qm to the parameter variation vector, ∆λ,

ds ∆s = i ∆λ (4.73) i dλ dQ ∆Q = m ∆λ (4.74) m dλ

dsi dQm dsi where dλ and dλ are the sensitivity vectors. The value of dλ is an input of the problem, and results from the parametric output slew computation on the interconnect of the input

53 dQm net, as described in Section 4.1.2. Similarly to the derivation of Eqns. (4.18) and (4.22), dλ is given by 0 0 0 tH dQm ∂Im ∂tH 0 ∂tL 0 = dt + Im(tH ) Im(tL) (4.75) dλ t0 ∂λ ∂λ − ∂λ Z L Resorting to Eqns. (4.69), (4.71), (4.73), and (4.74), we can now jointly represent

Eqns. (4.65), (4.66), (4.67), and (4.68) in matrix form as

ds dQ J∆φ = Q i + W m ∆λ (4.76) dλ dλ   where J is given by Eqn. (4.13), and Q and W are given by

∂vc dtL 0 − ∂t tL dsi     ∂vc dtT 0 − ∂t tT dsi Q =   ,W =   (4.77)  ∂vc dtH   0   ∂t tH dsi     −     dQc       1   − dsi        The computation of J has been detailed in Section 4.1. All the derivatives in Q can either be computed analytically or by accessing the timing characterization of the cell.

∆C can be extracted from ∆φ by using W T to “select” the capacitance row,

T T 1 dsi dQm ∆C = W ∆φ = W J − Q + W ∆λ (4.78) dλ dλ  

Acknowledging the dependence of the delay d and the output slew so on the input slew si and the capacitance C, the following expressions can be derived,

dd dd ∆d = ∆si + ∆C (4.79) dsi dC

dso dso ∆so = ∆si + ∆C (4.80) dsi dC where dd , dd , dso and dso can be computed by direct analysis of the look-up table that dsi dC dsi dC contains the timing characterization of the cell. Substituting Eqns. (4.73) and (4.78) in

Eqns. (4.79) and (4.80), we can derive the sensitivities of the delay and output slew to the parameters,

dd dd dsi dd T 1 dsi dQm = + W J − Q + W (4.81) dλ ds dλ dC dλ dλ i   dso dso dsi dso T 1 dsi dQm = + W J − Q + W (4.82) dλ ds dλ dC dλ dλ i  

54 delay1 tap 1

slew1 port

delay2

required time point tap 2

interconnect slew2

Figure 4-3: Interconnect delay and slew calculation from voltage waveforms.

4.4 Practical Implementation

The previous sections developed the theoretical framework for parametric delay calcu- lation. This section details a possible implementation of such framework. The described implementation was used to produce the experimental results presented further ahead.

4.4.1 Interconnect Delay

As illustrated in Figure 4-1-(a), given an input ramp vm we want to compute the transient response of the extracted interconnect RC network. The state-space representation of this

RC network is given by Eqns. (4.24) and (4.25). Given the good tradeoff between simplicity and accuracy, we chose to solve this system using the Trapezoidal Method2. The resulting discretized system is,

2C 2C + G xˆ = G xˆ + B [v (t ) + v (t )] (4.83) h n+1 h − n m n m n+1    

Given the initial conditionx ˆ0 = 0, the transient response of the system can be trivially computed from Eqn. (4.83), for every required time point. The voltage waveforms for each tap can thus be directly computed from the state vector using Eqn. (4.27). The step h can

2C be adjusted as necessary, at the penalty of an additional LU factorization of h + G for each adjustment. 

2The Trapezoidal Method is a linear multistep method for the solution of ordinary differential equations. Given a 0 function y = f(t, y), an initial value y(t0) = y0, and an integration step h, the subsequent values of y, can be computed by 1 yˆn+1 =y ˆn + h [f(tn, yˆn) + f(tn+1, yˆn+1)] 2

55 As illustrated in Figure 4-3, given the voltage waveforms of the port and taps, the corre- sponding tap delays and slews can be easily computed by saving the time points where each voltage waveform crosses VL, VT and VH . Rather than storing the voltage waveforms for each time point, only the values of the required time points are stored and the waveform values are promptly discarded, two iterations later. The transient simulation stops immediately after the last required time point is obtained.

When simulating the RC network for a rising input ramp, the simple initial condition xˆ0 = 0 can be assumed. However, when simulating a falling input ramp, that is not true, and the computation of an initial DC point is required. Since the delay and output slew produced by an RC network are the same for rising or falling input ramps, all simulations are performed assuming a rising input, even for falling input ramps, and the results are converted accordingly.

4.4.2 Effective Capacitance and Cell Delay

The computation of the effective capacitance, requires the computation of the voltage and current waveforms in the equivalent circuits as well as in the original circuit. Since an analytical characterization of the equivalent circuit is known, all voltage and current waveforms and their derivatives can be directly computed.

Concerning the original circuit, the simulation of the interconnect RC network can be performed as in Eqn. (4.83). However, there is a slight difference: in this case the input excitation to the RC network is given by v, rather than vm. This issue can be easily solved by combining Eqns. (4.24), (4.26) and (4.28) with

v(t) v (t) I (t) = − m (4.84) m R resulting in a new state-space equation, that includes v rather than vm,

d C x(t) + (G + BRM T ) x(t) = Bv(t) (4.85) dt that once discretized becomes,

2C 2C + G + BRM T xˆ = G BRM T xˆ + B [v(t ) + v(t )] (4.86) h n+1 h − − n n n+1    

56 Considering once morex ˆ0 = 0, we can easily compute the voltage and current waveforms in the RC network.

The first-order Taylor series expansion of v, given by Eqn. (4.4), in s around a nominal point so, is

0 if 0 t < k  ≤ v(t, s s , k) =  VH VL VH VL s0VDD (4.87) 0  − (t k) −2 (t k)(s s ) if k t < k +  s0 0 VH VL −  − − s0 − − ≤  − s0VDD VDD if t k +  ≥ VH VL  −  The variation of v in the transition area is therefore given by,

V V ∆v(t, ∆s, k) = H L (t k)∆s (4.88) s0 −2 | − s0 −

Expanding Eqn. (4.85) in a first-order Taylor series around a nominal slew, s0, and eliminating the nominal terms produces,

d T VH VL C ∆x + (G + BRM )∆x = B −2 (t k)∆s (4.89) dt − s0 −

Discretizing again using the Trapezoidal Method yields,

2C ∆xˆ 2C ∆ˆx V V + G + BRM T n+1 = G BRM T n B H − L (t + t 2k) h ∆s h − − ∆s − s2 n n+1 −     0 (4.90)

Using Eqn. (4.86) it is possible to compute, in a semi-analytical way, the sensitivities of all state variables to the slew, s. Two sensitivities of interest are,

∂Iˆ ∆ˆx m = M T n (4.91) ∂s ∆s tn

∂vˆ ∂v ∂Iˆ m = R m (4.92) ∂s ∂s − ∂s tn tn tn

where ∂v can be analytically derived from Eqn. (4.4). This semi-analytical approach to ∂s tn compute derivatives provides much better accuracy than computation by differences. An equation similar to Eqn. (4.89) can be derived for an expansion around a nominal resistance, R0, d C ∆x + (G + BR M T )∆x = BM T x∆R (4.93) dt 0 −

57 Discretizing with the Trapezoidal Method produces,

2C ∆xˆ 2C ∆ˆx + G + BR M T n+1 = G BR M T n BM T (x ˆ +x ˆ ) (4.94) h 0 ∆R h − − 0 ∆R − n n+1     The sensitivities for the resistance are thus obtained,

∂Iˆ ∆ˆx m = M T n (4.95) ∂R ∆R tn

∂vˆ ∂v ∂Iˆ m = R m (4.96) ∂R ∂R − 0 ∂R tn tn tn

The numerical approximations of the derivatives necessary for evaluating Eqns. (4.21) and (4.22) are given by Eqns. (4.91), (4.92), (4.95) and (4.96), respectively. Numerical approximations of the current, Im, can be computed from Eqns. (4.83) and (4.28). The cell delay and output slew can be computed by performing a lookup in the corre- sponding timing library tables, indexed by the effective capacitance, C∗, and by the value of the input slew, si.

4.4.3 Interconnect Delay Sensitivity

The discretization of Eqn. (4.56) leads, for each k, to ˆ ˆ 2C0 ∂x 2C0 ∂x + G0 = G0 (4.97) h ∂λk h − ∂λk   tn+1   tn

dxˆ dx ˆ C + + G (ˆx +x ˆ ) − k dt dt k n n+1 " tn tn+1 ! #

where the time derivatives of x, computed by differences, are given by

dxˆ xˆ xˆ = n+2 − n (4.98) dt 2h tn+1

Since prior knowledge of the nominal values, given by Eqn. (4.83), is necessary to compute the parameter sensitivities, given by Eqn. (4.97), both equations must be solved in parallel, in the same iterative cycle, or all the nominal waveform values would have to be stored.

Clearly, inside each iteration, the value ofx ˆn+1, given by Eqn. (4.83) must be computed ˆ first, so that it can be used in Eqn. (4.97) to compute ∂x . Actually, there is an even ∂λk tn+1 ∂xˆ dxˆ stronger constraint, as the computation of requires . Sincex ˆn+2 is necessary ∂λk dt tn+1 tn+1

58 ˆ to compute dxˆ , then the computation of ∂x must be delayed by one iteration, taking dt tn+1 ∂λk tn+1 place in the same iteration as the computation ofx ˆn+2. Therefore, the nominal values and sensitivities computed in one iteration correspond to two consecutive time points, separated by the step, h. The values of the sensitivities of the state variables to the parameters, at the time points of interest, tL, tT and tH , can be computed by interpolation of the values produced by Eqn. (4.97). Given these sensitivity values, it is trivial to compute the corresponding sensitivities of delays and slews, as described in Section 4.2.

4.4.4 Cell Delay Sensitivity

The computation of delay and output slew sensitivities from Eqns. (4.81) and (4.82) is straightforward. The only possible difficulty arises from the computation of the partial derivative of the charge to the parameters,

0 t ∂Q H ∂I m = m dt (4.99) ∂λ t0 ∂λ Z L Recalling Eqn. (4.28), then 0 t ∂Q H ∂x m = M T dt (4.100) ∂λ t0 ∂λ Z L ∂xˆ The solution of Eqn. (4.97) returns a sequence of values, ∂λ , which can be used to compute tn an approximation to the integral on the RHS, by numerical integration,

∂Qˆ ∂xˆ m = M T (4.101) ∂λ 0 0 ∂λ tn tn [tL,tH ] ∈X

4.5 Conclusions

This chapter details an analytical parametric delay calculation methodology suitable for use in a statistical static timing methodology, corner analysis, or any other methodology that consumes affine delay models. The proposed approach is based on a specific type of perturbation analysis, allowing for the analytical computation of the quantities needed for parametric delay propagation. It is also shown how perturbation analysis can be performed when only the standard cell delay table look-up models are available. The techniques pro- posed are robust and, for small variation ranges, should provide adequate approximation to

59 transistor-level calculations, at a fraction of the computational cost. Furthermore, such tech- niques can be directly applied when cell characterization is based either in voltage or current source models.

60 5

Worst-Timing Corner

The development of novel techniques for the efficient generation of accurate parametric delay models is the topic discussed in the previous chapter. Several new timing analysis methodologies have recently been developed, that make use of these parametric models for early prediction and detection of IC performance issues due to process variability. The most significant such example is statistical static timing analysis (SSTA), where parameters are treated as distributions rather than fixed numerical values. Several promising SSTA modeling techniques have been proposed [39, 7, 13, 70], some if which are already implemented in commercially available tools. Even though SSTA is slowly finding its way into industrial design flows, its adoption has been more limited that initially expected. The usage of SSTA information is not intuitive for most experienced designers, with their mindset still tied to more traditional deterministic design methodologies. Further, for most companies it is very important to keep their design pipeline running steadily with existing design methodologies, which makes them somewhat reluctant to adopt radically different timing analysis approaches, that could ultimately entail an overhaul in the design and verification flows [45]. SSTA requires complex parameter characterization, like multidimensional statistical distributions, and even though some foundries have recently started to provide information of that sort in a more consistent manner, that is not yet generalized. As a result, SSTA is mostly used as an aid in design optimization, while design sign-off is still performed resorting to traditional corner analysis techniques. The advent of new technology nodes can have a relevant impact in SSTA as it was reported [2] that in the 22nm node the Gaussian assumption, underlying

61 most of the commercially available SSTA tools, no longer seems to hold, as measurement data from process parameter variations can no longer be accurately fit to Gaussian distributions.

Even though SSTA techniques have received the most attention in the literature, the parametric delay modeling technologies they advocate have much wider applicability. In particular, they can be used in reducing pessimism and automating well established timing verification methodologies. Conventional IC timing sign-off consists in verifying a design for a set of carefully selected combinations of process and operating parameter extremes, commonly referred to as corners, that are expected to cover the worst-case fabrication and operating scenarios. However, there is no established systematic methodology for picking such worst- case corners in a realistic manner, and this task usually relies on the experience of design and process engineers. Compounding the problem, for feature sizes in the nanometric scale, the number of parameters to be considered increases significantly. In an effort to overcome this clear limitation of established timing sign-off methodologies, this chapter proposes an efficient and automated methodology for computing the worst-timing corners in a digital integrated circuit, when parametric delay models are available [23, 24]. Specifically, we address the computation of worst-delay corners of combinational blocks and of worst-slack corners of sequential circuits. In this approach, parameters only need to be characterized by their respective value ranges, as opposite to SSTA where they need to be characterized by detailed statistical distributions. The proposed methodology casts the computation of the worst- timing corners as a search problem, which provides an intellectual paradigm that is more general and useful than most previous approaches.

While it has become commonplace in the literature to argue for a shift away from a corner-based analysis to a statistical methodology, there are important reasons to improve the efficiency of a corner-like methodology. First, such techniques are easily integrated within currently used design and verification paradigms, as they essentially represent a natural variation-aware extension. Second, they impose less stringent requirements on parameter characterization. It is usually easier to measure, specify, and more importantly, guarantee, a bound or a range than a full distribution. In some cases, for example variation bounded by on-line process control, it may be more realistic to specify a bound for a parameter than, for

62 1 1

 i j  n inputs  m outputs   (PIs)   (POs) n m      combinational block 

Figure 5-1: Timing graph of a combinational block.

example, a distribution of some particular kind (e.g. Gaussian). Finally, efficient worst case analysis can be seen as a complementary technique to SSTA, by providing insight into specific unusual or undesired circuit operating conditions. This last setting is a primary motivator for our work.

The organization of this chapter is the following. Section 5.1 precisely formulates the worst-delay corner problem. Section 5.2 discusses exhaustive approaches for its solution.

Section 5.3 presents an efficient algorithm for quickly detecting the existence of monotonic parameters, whose effect can be easily handled. Afterwards, Section 5.4 proposes two branch- and-bound based algorithms for the efficient computation of the worst-delay corner. Sec- tion 5.5 extends such algorithmic approaches to the more general case of sequential circuits, by casting the computation of the worst-slack corners of a sequential circuit as an instance of the worst-delay corner problem. Finally, Section 5.6 closes the chapter with brief conclusions.

5.1 Worst-Delay Corner

The timing graph of a combinational circuit block with n inputs and m outputs is il- lustrated in Figure 5-1. Assuming that component delays, annotated in edges, are affine functions of the process parameter variations, in the form of Eqn. (3.2), then any delay, di,j(∆λ), from an input i to an output j can be accurately represented by a piecewise-affine function, in the form of Eqn. (3.12).

The worst-delay corner (WDC) problem, consists in computing an assignment, ∆λ∗, to the parameter variation vector, ∆λ, that produces the worst delay, di,j(∆λ), from any input

63 i = 1, . . . , n to any output j = 1, . . . , m. For the remainder of this work we assume

∆λ [0, 1]p (5.1) ∈

This assumption simplifies the subsequent manipulation of affine delay functions without loss of generality. Any affine function can be easily normalized such that the variables (the parameter variations, in this case) lie within a given range.

In late mode, that targets the detection of slow paths, the worst delay is the largest delay.

late Assuming that di,j (∆λ) is the piecewise-affine function of the delay in late mode from input i to output j, then the WDC problem is formulated as

late max max max di,j (∆λ) (5.2) ∆λ j=1,...,m i=1,...,n   

Conversely, in early mode, targeting the detection of fast paths, the worst delay is the smallest one, therefore the WDC problem is formulated as

early min min min di,j (∆λ) (5.3) ∆λ j=1,...,m i=1,...,n   

As discussed in Section 3.2, since input/output delays are represented by piecewise-affine functions, which are convex, their largest, as well as their smallest value, is obtained by setting each parameter variation to one of its extreme values (corners). Therefore, this problem can be cast as a combinatorial optimization problem where, by searching in a finite but typically large set of elements, we want to optimize a given cost function. In this case the set of elements can be the set of all the 2p possible parameter variation corners, and the cost function can be the delay. The major difficulty with this type of discrete problems, as opposed to continuous linear problems, is that we do not have any optimality conditions to verify if a given feasible solution is optimal or not. Therefore, in order to conclude that a feasible solution is optimal, its cost must be compared with the cost of all other feasible solutions.

This amounts to always explore the entire solution space, either explicitly or implicitly, by a complete or partial enumeration of all the feasible solutions and their associated costs.

64 5.2 Exhaustive Methods

The simplest exhaustive algorithm that can be conceived for computing the WDC consists in evaluating the delay of the circuit for each of the 2p possible parameter variation assign- ments, and verifying which assignment produces the worst circuit delay. Such assignment clearly corresponds to the WDC. If a block-based timing analysis procedure is used, then the arrival times can be computed in linear time of the number of vertices. However, since such procedure must be executed for all the 2p possible parameter variation assignments, the overall run-time complexity of the algorithm will be exponential in the number of parameters.

In the outlined approach the entire parameter variation space was traversed, searching for the assignment that would produce the worst circuit delay. However, another possible approach consists instead in performing such a search in the path space. Essentially, this corresponds to performing an exhaustive path-based timing analysis that, for each path, computes the corresponding affine delay function, by adding the delay functions of the edges along that path. Given the affine delay function of a path, its corresponding WDC can be trivially computed by direct application of Eqns. (3.7) and (3.8), or their min counterparts.

For each path, the procedure for computing the affine delay function and obtaining its WDC is linear in the number of parameters. However, since the number of paths can grow expo- nentially with the number of vertices, and this procedure must be applied to every single path, the overall procedure can have, in the worst case, an exponential run-time.

As can easily be concluded, both exhaustive methods exhibit exponential run-time com- plexity, either in the number of parameters or in the number of vertices. For very small cir- cuits, or in situations where a small number of parameters is of interest, they may constitute viable options. However, even average size circuits will render both approaches unpractical, due to the excessive run-time required for their successful completion.

5.3 Static Pruning

The problem of computing the WDC of a circuit can be simplified if the values of a subset of the parameter variations are preset to one of their limits, thus reducing the dimension of

65 Γ= 3 + ∆ λ 1 + ∆ λ Γ= + + 2 − 2 − ∆λ 1 + ∆ static λ ∆λ3 2 assignment 3 − − ∆λ Γ= 3 Γ= XX − 2 + ∆ Γ= λ3 λ1 2∆ − ∆λ 2 2 ∆λ − − ∆λ 3 λ1 − 1 + ∆

λ3 ∆ Γ= X ∆λ2 − −− Γ= ∆λ1 − 1 −

propagate common sign patterns

Figure 5-2: Detection of monotonic delay parameters. the problem. The situations where such reduction is possible can be understood by observing that for some circuit instances, or specific fabrication technologies, there may be parameters that, monotonically, produce the same effect in every circuit component, by consistently increasing or decreasing delay. When that happens, the non-zero sensitivities of the delays to that specific parameter variation exhibit the same sign, for every circuit component. Early detection of such cases is beneficial, since those parameter variations can be preset to a

fixed value, according to Eqns. (3.7) and (3.8), without requiring further analysis, yielding an immediate reduction in the size of the WDC problem. We call such assignments static assignments, as they can be performed in a pre-processing stage.

We proceed by proposing a simple O( E p) algorithm for the detection of static assign- | | ments. This algorithm relies on the propagation of common delay sensitivity sign patterns through the timing graph, in a forward levelized fashion. For that purpose, every vertex v

th is annotated with a common delay sensitivity sign pattern vector, Γv, of size p. Each k (k) element of Γv, designated by Γv , can assume one of four possible values:

+, meaning that the kth-parameter sensitivities of all the delays annotated in the edges • contained in the fanin cone of v are positive;

, meaning that the kth-parameter sensitivities of all the delays annotated in the edges • −

66 contained in the fanin cone of v are negative;

X, meaning that the kth-parameter sensitivities of the delays annotated in the edges • contained in the fanin cone of v have contradictory signs, being positive in some cases

and negative in others;

empty, meaning that the kth-parameter sensitivities of the delays annotated in the edges • contained in the fanin cone of v are zero.

The procedure for computing the Γv vector in all vertices v of a timing graph is illustrated in Figure 5-2. We start by assigning to each primary input vertex a completely empty Γ vector.

Next, we perform a forward levelized traversal through all the vertices, starting in the primary inputs and ending in the primary outputs. For every vertex, v, we analyze the parameter sensitivity signs of the delays annotated in each of its incoming edges, as well as the Γ vector of their corresponding source vertices. When the same +/ sign is observed both in the − delay sensitivities of every incoming edge, and in the Γ vectors of their corresponding source vertices, that sign is marked in the corresponding position of Γv, otherwise that position is marked with an X. Empty entries are neutral in this analysis.

When, for a given primary output, an entry of the Γ vector is +/ then, for every delay in − the fanin cone of that output, the corresponding sensitivity assumes that same sign (or is 0).

When this condition is verified for every primary output, then for all component delays the corresponding sensitivity assumes the same sign (or is 0). In that case, the associated param- eter variation can be permanently set to one of its extreme values, according to Eqns. (3.7) and (3.8).

5.4 Dynamic Pruning

This section proposes a methodology for computing the WDC using branch-and-bound techniques, that enables dynamic pruning of parts of the search space and therefore avoids an explicit enumeration of all the possible solutions. We start by briefly explaining the basic principles of branch-and-bound techniques and subsequently present path-space and parameter-space search algorithms based on them.

67 The definition of “worst” was not explicitly stated in the previous sections. In some contexts worst can mean the maximum value, while in other contexts worst can mean the minimum value. For the sake of clarity, and without loss of generality, we will proceed by assuming that the worst value is the maximum value1. This will be an underlying assumption for all the algorithms and derivations presented in the remainder of this chapter.

5.4.1 Branch-and-Bound

Most combinatorial problems, including the one at hand, can only be solved by explicitly or implicitly evaluating a specific, nonlinear, cost function over the entire discrete solution space, in order to compute the solution that yields the optimal cost. Branch-and-bound [36] techniques focus on pruning useless regions of the solution space, thus avoiding the explicit evaluation of all the possible solutions contained therein. During the execution of the algo- rithm, the best known value for the cost function is maintained, corresponding to the cost of the best solution already found. If by some simple and quick procedure we are able to determine that the cost of all the solutions contained in a certain subspace is worse than the best known cost, then it is useless to explore that subspace, since no improvement on the cost function will be obtained. Therefore, that portion of the solution space can be pruned, and an explicit evaluation of all the solutions that it may contain is avoided. Even though in the worst case this approach can be as computationally expensive as exhaustive enumeration of the entire solution space, on average, for a wide range of applications, it has proven to perform significantly better.

Figure 5-3 illustrates the application of the branch-and-bound paradigm to the corners of a parameter space of dimension p = 3. Suppose that, given a cost function we want to compute the corner where that function assumes its maximum value. Additionally, assume that it is cheap to compute an upper bound on the cost for a set of corners (e.g. an upper bound to the cost of every corner included in the set), but it is expensive to compute the exact cost of a specific corner2. Assume also in this example, that the exact cost of the

1For the algorithms that will be presented, when the worst value is the maximum value, we compute the maximum between values and upper bounds on sets of values. When the worst value is the minimum value, we compute the minimum between values and lower bounds on sets of values. 2There is an analogy with delays, since computing delay upper/lower bounds on the fanin cone of a node is much

68 ∆λ3

costwhite = 5

costblack = 4

∆λ1 costgrey = 10

∆λ2 black set can be pruned !

Figure 5-3: Illustration of branch-and-bound on the corners of a parameter space. white corner has been determined to be 5. Further, the upper bound on the cost for the corners in the black and grey sets have been determined to be 4 and 10, respectively. Since the upper bound computed for the black set is smaller than the exact cost computed for the white corner, this means that no corner in the black set can produce the maximum cost value, as the value of the white corner is already larger. Consequently, it is irrelevant to perform any further analysis on the corners of the black set, which can therefore be pruned. On the other hand, regarding the grey set, no conclusions can be drawn. This means that we must proceed with the detailed analysis of the cost of the corners in the grey set, or eventually proceed by computing more accurate cost upper bounds for subsets of the grey set, until all its corners are either pruned or analyzed.

5.4.2 Path Space Search

Both path-based and parameter-based exhaustive search algorithms described earlier can be improved by employing branch-and-bound techniques. To understand how this can be achieved, we start by introducing a path-based search algorithm that is able to efficiently compute the worst-delay corner, by finding one path where it occurs. Considering one primary output at a time, the algorithm performs an implicit search over all the complete paths that end at that output, that we will designate as the active primary output. The timing graph is traversed in a backward fashion, starting at the active primary output, going through the less expensive than computing the exact delay of every path contained in that fanin cone.

69 current vertex active PO fanin PIs v cone trail trail trail

in out dv dv

Figure 5-4: Illustration of delay estimates. internal vertices, and eventually ending at the primary inputs (if no pruning is performed).

The vertex being explored in a given step is designated by current vertex. The path taken to reach that vertex from the active primary output is designated by trail. When reconvergent fanouts exist, the same vertex can be reached from the same primary output, through distinct trails. The worst delay (e.g. maximum delay), w∗, among the delays of the complete paths already analyzed, is continuously updated, as well as its originating parameter variation assignment, ∆λ∗. For a given step, where the current vertex is v, the algorithm relies on three parametric delay estimates:

din is an upper bound on the delay of all the partial paths that start at a primary input • v and end in v (e.g. in the fanin cone of v);

dout is the exact delay of the trail path, that starts in the current vertex, v, and ends in • v the active primary output;

path in out dv = d + d , which represents an upper bound on the delay of all the complete • v v paths going through v, that include the trail.

The illustration of these estimates is presented in Figure 5-4. Note that usually as v gets

in closer to the primary inputs the upper bound given by dv gets tighter. When v is a primary in path out input: dv = 0 and dv = dv is the exact delay of the trail rather than a delay upper bound.

The rationale underlying the proposed algorithm is that if the worst delay, among all the

path complete paths going through v and including the trail, max∆λ[dv ], or an upper bound of such delay, is not larger than the worst delay already computed for some other complete

70 1: function WDC-Path-BnB(G) 2: w∗ 0 . Worst delay ← 3: ∆λ∗ . Worst corner ← ∅ 4: Initialize(G) 5: for all v PO(G) do ← 6: w, ∆λ Process-Vertex(G, v, w∗, 0) h i ← 7: if w > w∗ then 8: w∗, ∆λ∗ w, ∆λ h i ← h i 9: end if 10: end for 11: return w∗, ∆λ∗ h i 12: end function

in du d u u,v h in i dv in in in in v dv = max du + d u,v , dz + d z,v d h i h i z   z d z,vi h

Figure 5-5: Calculation of the din estimate.

path, w∗, then it is useless to further explore the fanin cone of v, as the worst delay, w∗, cannot be improved by such action.

The pseudocode of the algorithm is presented in function WDC-Path-BnB. It receives the timing graph, G, as the single argument and returns a tuple, w , ∆λ , with the worst h ∗ ∗i delay value and its originating parameter variation assignment, respectively.

The algorithm starts by invoking Initialize on the timing graph, G. This function per- forms a forward levelized breadth-first traversal of the timing graph, starting at the primary inputs and ending at the primary outputs. For each vertex v, it computes, the parametric

in formula for the delay estimate dv , that is an upper bound on the delay from any primary input to v. This formula, and delay upper bounds in general, is computed by performing a max operation over the sum of the delay of each incoming edge with the din estimate of the corresponding source vertex, as illustrated in Figure 5-5. The upper bounds can either be constant values, affine functions or piecewise-affine functions, depending on how the max function is implemented. See Section 3.2 for further details.

71 ∗ out 1: function Process-Vertex(G, v, w , dv ) 2: din In-Delay-Estimate(v) v ← 3: dpath din + dout v ← v v 4: w, ∆λ max [dpath] h i ← ∆λ v 5: if w w∗ then . Fanin cone gets pruned ≤ 6: return w∗, 0 h i 7: else if v PI(G) then ∈ 8: return w, ∆λ . Worst delay is updated h i 9: else 10: for all e Incoming-Edges(v) do ← 11: s Source-Vertex(e) . Get edge source vertex ← 12: d Delay(e) e ← 13: dout dout + d s ← v e 14: w, ∆λ Process-Vertex(G, s, w∗, dout) h i ← s 15: if w > w∗ then 16: w∗, ∆λ∗ w, ∆λ h i ← h i 17: end if 18: end for 19: return w∗, ∆λ∗ h i 20: end if 21: end function

After completing the initializations, the algorithm processes the primary outputs, one at a time. For every primary output it invokes the recursive function Process-Vertex, that performs a backward depth-first traversal of the timing graph towards the primary inputs.

In each step, a given vertex v is visited (e.g. deemed the current vertex), and one of its fanins is scheduled to be visited in the next step. Therefore, the current vertex v is always connected to the active primary output by the incomplete path used to reach v, that we already designated by trail. All the vertices along the trail were visited before v.

out For a given vertex v, the exact delay of the trail, dv , can be computed by adding the delay of all the edges in the trail. That computation is implicitly performed in Process-Vertex.

path in out dv , computed by adding dv and dv , is an upper bound on the delay of any path that contains v, starts at any primary input, and reaches the active primary output trough the

path path trail. dv is an affine function of the parameter variations. The worst value of dv , that we designate by w, and the corresponding vector of parameter variation assignments, that we designate by ∆λ, can be computed by applying Eqns. (3.7) and (3.8).

If the worst delay, w, is smaller than the largest (worst) known delay, w∗, computed so far, that means that the worst-delay path cannot not contain the trail, and therefore we stop the traversal at this vertex, and backtrace within the trail. If w is larger than w∗, and v

72 in da =0 a 3 + ∆ λ 1 + ∆ λ2 in de =3+∆λ1 + ∆λ2 e 2 − ∆λ din =0 1 + ∆ b 3 λ b 2

in in g dg =5+2∆λ1 + 2∆λ2 dc =0 c 2 + ∆ 2 λ ∆λ 1 ∆ − λ2 λ1 − 1 + ∆ f

in in λ2 df =2+∆λ1 ∆λ2 dd =0 ∆ − ∆λ1 − d 1 −

path ∗ ∗ v trail dv w ∆λ w ∆λ remarks g 5 + 2∆λ + 2∆λ 0 0 trail is empty ∅ 1 2 ∅ ∅ e e, g 5 + 2∆λ2 7 0/1, 1 0 – a a,h e , e,i g 5 + 2∆λ 7 h0/1, 1i 7 0/∅1, 1 complete path, w∗ updated h i h i 2 h i h i b b, e , e, g 5 ∆λ1 + ∆λ2 6 0, 1 7 0/1, 1 – f h f,i h g i 3 +− 2∆λ 2∆λ 5 h1, 0i 7 h0/1, 1i prune ! h i 1 − 2 h i h i Figure 5-6: Execution of WDC-Path-BnB for a small timing graph.

is a primary input, then there is a complete path with delay larger than the largest known delay computed so far, and therefore the largest known delay is updated, which corresponds to update the value of w∗ with w. If we are not at a primary input, the delay estimate is just an upper bound, and therefore it cannot be used to update w∗. We proceed until all the paths in the circuit are explicitly or implicitly explored. In the end, the largest known delay w∗ and the corresponding parameter variation assignments, ∆λ∗, are the worst delay of the circuit and the WDC, respectively.

Figure 5-6 illustrates the execution of the algorithm for a small timing graph. It should be noted that w∗ is only updated when vertex a is analyzed because only then the trail is a path complete path, and therefore dv is the exact delay of that path, and not an upper bound. Further, the fanin cone of f is not analyzed because w w . This corresponds to pruning a ≤ ∗ portion of the path space, namely paths c, f , f, g and d, f , f, g . {h i h i} {h i h i}

73 5.4.3 Parameter Space Search

The previous section proposes a branch-and-bound based algorithm for computing the

WDC by exploring the path space and finding one path where it occurs. In this section we try a different approach, and propose another branch-and-bound based algorithm that also performs such computation, but by exploring the parameter space. The worst delay obtained for a chosen unspecified, completely specified or partially specified parameter variation vectors is analyzed, and by using that information the algorithm is able to effectively prune regions of the parameter space, whenever possible.

Before proceeding, let us briefly discuss the meaning of unspecified, completely specified and partially specified parameter variation vectors, as well as their respective implications. In the previous section the computation of the delay upper bounds, given by din, was performed assuming that the values of ∆λ were not known. As a result, such upper bounds were given by affine functions of the unknown parameter variations. In that case the parameter variation vector, ∆λ, is considered to be unspecified. This situation is illustrated in Figure 5-7-(a), where ∆λ = X,X , meaning that both ∆λ and ∆λ are unspecified (e.g. unknown). ∆λ it h i 1 2 is said to be partially specified, when only part of the parameter variation vector has assigned values, while the other part is kept unspecified. Such situation is illustrated in Figure 5-7-(b), where ∆λ = X, 0 . When values are assigned to all the elements of the parameter variation h i vector, ∆λ, it is said to be completely specified. This case is illustrated in Figure 5-7-(c), where ∆λ = 0, 1 . As a result, the din value computed at the primary output is the exact h i delay of the longest path, for the given ∆λ assignment.

The general procedure for computing din for a given ∆λ, consists in updating the delay functions with the ∆λ assignment, and subsequently performing a levelized breadth-first traversal to compute the din values, using the usual sum and max operations.

Going back to the WDC computation algorithm, let us introduce a motivating example.

Suppose that in the example of Figure 5-7 the value of din is computed for ∆λ = 0, 1 . h i The result is din = 7, as indicated in Figure 5-7-(c). If we are concerned with computing the worst delay, this can be considered our starting point, w∗ = 7. Clearly, other solutions, corresponding to other values of ∆λ, are only worth exploring if they provide a worst delay

74 in da =0 a 3 + ∆ λ 1 + ∆ λ2 in de =3+∆λ1 + ∆λ2 e 2 − ∆λ din =0 1 + ∆ b 3 λ b 2

in in g dg =5+2∆λ1 + 2∆λ2 dc =0 c 2 + ∆ 2 λ ∆λ 1 ∆ − λ2 λ1 − 1 + ∆ f

in in λ2 df =2+∆λ1 ∆λ2 dd =0 ∆ − ∆λ1 − d 1 − (a) ∆λ = X,X h i

in da =0 a 3 + ∆ λ1 in de =3+∆λ1 e 2 ∆ din =0 − λ b 3 1 b

in in g dg =5+2∆λ1 dc =0 c 2 + ∆ λ1 λ1 1 + ∆ f din =2+∆λ din =0 f 1 d ∆λ1 d 1 − (b) ∆λ = X, 0 h i

in da =0 a 4 in de =4 e

3 din =0 b 3 b

in in g dg =7 dc =0 c 1 0 f din =1 din =0 f d 0 d (c) ∆λ = 0, 1 h i

Figure 5-7: Computation of din for different values of ∆λ.

75 that is larger than 7. Consider now the computation of din for ∆λ = X, 0 , as illustrated h i in in Figure 5-7-(b). The result is d = 5 + 2∆λ1. Considering the worst parameter settings, given by Eqns. (3.7) and (3.8), the maximum delay value achievable by ∆λ = X, 0 is 7. h i Since 7 is not larger than the value of 7 computed for ∆λ = 0, 1 , then the parameter h i subspace defined by ∆λ = X, 0 can be pruned. This corresponds to the elimination of two h i corners: ∆λ = 0, 0 and ∆λ = 1, 0 . This simple example illustrates the parameter space h i h i based branch-and-bound algorithm proposed for the computation of the WDC. The algorithm essentially probes subspaces of the solution space and directs the search accordingly. Since the graph topology is not relevant for the algorithm, we can consider that the timing graph is just used as a black-box3 that enables the computation of a din value, given an unspecified, completely specified or partially specified parameter variation vector, ∆λ.

The pseudocode for the proposed algorithm is presented in function WDC-Parameter-

BnB, that receives and returns the same information as the previously studied WDC-Path-

BnB. As illustrated in the previous example, the algorithm tries to prune regions of the parameter variation space by analyzing the worst delay produced by particular assignments of the parameter variation vector (probing assignments). It is therefore necessary to keep track of all the partially and completely specified assignments already analyzed. For that purpose we will use a decision tree, that is usually implemented by a binary search tree.

Each node in the decision tree represents one element of the parameter variation vector and can have at most a left and right child. Each child is a subtree. The left child represents a partially or a completely specified assignment of the parameter variation vector, where the corresponding element assumes value 1. For the right child this value is 0. The leaves of the tree are the delay estimates computed considering the parameter variation vector assignments defined by the path from the root node to the leaf node. Therefore, assuming that the root node is at level 1, if a leaf is at level (p+1), then it means that it corresponds to a completely specified assignment, and therefore it produces an exact worst delay. On the other hand, if a leaf is at level k p, it means that it corresponds to a partially specified assignment, and ≤ therefore it produces an upper bound on the worst delay.

3This can be particularly useful in the case of IP blocks, for which a detailed description may not be available, but

76 1: function WDC-Parameter-BnB(G) 2: w∗ 0 . Worst delay ← 3: ∆λ∗ <> . Worst corner ← 4: T DT-Init() ← 5: while ∆λ Decide(T ) do ← 6: DT-Register-Decision(T ,∆λ) 7: w Worst-Delay(G,∆λ) ← 8: if w w∗ then ≤ 9: DT-Register-Prune(T ,∆λ) 10: else if Is-Complete(∆λ) then 11: w∗, ∆λ∗ w, ∆λ h i ← h i 12: end if 13: end while 14: return w∗, ∆λ∗ h i 15: end function

The algorithm starts by calling DT-Init, that initializes the data structures of the decision tree. Afterwards, it enters a cycle, where for each iteration the function Decide, based on the current state of the decision tree, and the regions of the parameter variation space that need to be explored, will produce a partial or complete assignment, for the parameter variation vector, ∆λ. This assignment is then annotated to the decision tree by invoking DT-

Register-Decision. Subsequently, the worst delay estimate for this assignment is computed by Worst-Delay, and stored in w. If the worst delay estimate, w, is not larger than the worst known delay estimate achieved so far, w∗, then it means that any assignment contained in this partial assignment will not provide an improvement over w∗ and therefore can simply be ignored. In this case, DT-Register-Prune is invoked for inserting a marker in the decision tree, that will prevent Decide from further exploring this region of the parameter variation space. No further expansions will be performed beyond this node, effectively pruning the corresponding subtree from consideration. If the worst delay estimate is larger than the worst known delay estimate computed so far the assignment to ∆λ is completely specified, this means that such improves the largest known delay estimate and therefore w∗ and ∆λ∗ are updated accordingly. If the worst delay estimate is larger than the worst known delay estimate computed so far, but the assignment to ∆λ is only partially specified, no conclusion can be drawn, since the worst delay estimate obtained for a partially specified assignment is just an upper bound, whose value will eventually get smaller as new elements of the parameter a timing model, able to perform the required delay bound computations, may be.

77 1 2 ∆λ2 ∆λ2 1 1

7+2∆λ1 9 ∆λ → 1 1

7 update w∗, ∆λ∗ h i

3 4 ∆λ2 ∆λ2 1 1 0

∆λ1 ∆λ1 5+2∆λ1 7 → 1 0 1 0 prune !

7 7 prune ! 7 7

Figure 5-8: Execution of WDC-Parameter-BnB.

variation vector are assigned (specified). The algorithm proceeds until all the regions of the parameter space (all possible parameter variation vector assignments) are either explicitly explored or pruned.

Figure 5-8 illustrates the decision tree produced by the execution of the algorithm for the timing graph of Figure 5-7. In steps (1) and (2) we generate a complete parameter variation assignment, ∆λ = 1, 1 , in order to obtain the fist estimate for w , which is 7. In step (3) h i ∗ we analyze the complete assignment ∆λ = 0, 1 and conclude that it produces a delay of 7, h i which is not larger than the current w∗ = 7. In step (4) we analyze the partial assignment ∆λ = X, 0 and conclude it produces a delay of 5 + 2∆λ , that in the worst case assumes h i 1 the value 7. Since this delay upper bound is not larger than the worst known delay found so far, we can discard (e.g. prune) the remaining subtree. This effectively prunes part of the parameter variation space, more precisely the corners ∆λ = 0, 0 and ∆λ = 1, 0 . At this h i h i point, all the parameter variation space has been explored, and the final solution is w∗ = 7 and ∆λ = 1, 1 . ∗ h i

78 5.4.4 Decision Heuristics

Both algorithms WDC-Path-BnB and WDC-Parameter-BnB perform the exploration of a specific solution space. The order by which the elements (solutions) of each space is explored, though irrelevant for the completeness or correctness of the algorithms, is extremely relevant for their performance. Exploring in the first place solutions that have a better cost function (e.g. worse delay) can potentially lead to early pruning of regions of the solution space that would otherwise be explicitly analyzed. In WDC-Path-BnB the order by which the paths are explored is determined by the ordering of the primary outputs in PO(G), and also by the ordering of the edges returned by Incoming-Edges. In WDC-Parameter-

BnB the order by which the parameter variation assignments are explored is determined by

Decide. Several decision heuristics can be developed for optimizing the order by which the solutions are explored, specifically targeting particular problem structures.

5.5 Worst-Slack Corner

From a theoretical standpoint, the automated computation of the WDC of a combinational circuit, for which several algorithms were discussed and developed in the previous sections, is an interesting and extremely important new addition to the state-of-the-art of variation-aware timing analysis techniques. However, its direct applicability to real designs is reduced, since most modern digital integrated circuits of relevant size and function are of a sequential nature, not combinational. Furthermore, most often timing analysis is concerned with verifying that a given circuit meets specific timing constraints rather than just computing its worst delay. Nevertheless, as will become clear, the techniques developed in the previous sections constitute the foundations for the solution of more complex timing analysis problems, that will be addressed in this section.

Timing constraints, the blueprint of any timing analysis problem, induce slacks, which provide a quantification of their tightness. In a parameter variability context, slacks are affine functions of process parameters, and therefore their value, and the tightness of their originating timing constraints, varies from corner to corner. This section is concerned with

79 in out l1 l1

CK CK

clock

combinational lin lout n block m data CK CK

out T + lj

tsetup thold

Figure 5-9: Setup and hold in a sequential circuit. the automated computation of worst-slack corners, for different types of constraints, present in real timing analysis problems. The outcome of such computation is the corner or set of corners for which a given timing constraint can be critical and thus directly impact the correct operation of the circuit.

5.5.1 Sequential Timing Constraints

Sequential circuits consist of combinational blocks interleaved by registers, usually imple- mented with flip-flops, as illustrated in Figure 5-9. Typically they are composed of several stages, where a register captures data from the primary outputs of a combinational block and injects it into the primary inputs of the combinational block in the next stage. Reg- ister operation is synchronized by clock signals generated by one or multiple clock sources.

Clock signals that reach distinct flip-flops (sinks in the clock tree) are delayed from the clock source by a given clock latency. Within a clock period T , we assume that data is in- jected into a combinational block by a register of n flip-flops with parametric clock latencies in in l1 (∆λ), . . . , ln (∆λ), respectively, and captured by a register of m flip-flops with parametric out out clock latencies l1 (∆λ), . . . , lm (∆λ), respectively. If the clock network is a tree, which is a common situation, then large portions of the net are shared among multiple paths. In this case, it is feasible to use a very accurate method (even perform electrical level simulation), to compute good estimates of the clock latencies.

80 5.5.2 Setup Time and Late Mode

Proper operation of a flip-flop requires that the input data line must be stable for a specific period of time before the capturing clock edge. This period of time is designated by setup time, and we will represent it by tsetup. Setup times are one of the standard performance figures provided in cell specification libraries (e.g. Liberty [65]). Let us consider the setting

in depicted in Figure 5-9 and assume that a flip-flop, with clock latency li , connected to the i-th primary input of the combinational block, is injecting data, and another flip-flop, with clock

out latency lj , connected to the j-th primary output of the combinational block, is capturing the result. Assuming that the clock edge is generated in the clock source at time 0, then it

in will reach the injecting flip-flop at time li , making the data available at the primary input of the combinational block. If the propagation delay in the combinational block in late mode

(i.e. considering that the output of a cell is changed by the last input that changed), from

late the i-th primary input to the j-th primary output, is di,j , then the results will be available in late in the output at most at time li +di,j . The next clock edge will reach the capturing flip-flop out at time T + lj . For a correct operation, the results must be available at the j-th primary output of the combinational block tsetup time before the next clock edge reaches the capturing flip-flop. Therefore, the setup time in the capturing flip-flop is observed only if the following condition holds,

lin + dlate T + lout t (5.4) i i,j ≤ j − setup This condition must hold for every i, j input/output flip-flop pair. For a given output h i flip-flop j this set of constraints can be compactly written as

in late out max (li + di,j ) T + lj tsetup (5.5) i=1,...,n ≤ −

setup This expression induces a slack, sj , defined as,

setup out in late s = T + lj tsetup max (li + di,j ) (5.6) j − − i=1,...,n that is non-negative when the conditions are met and negative otherwise. The worst-slack

setup corner for sj is the corner where its value is minimized, formally

setup setup min(sj ) = max( sj ) (5.7) ∆λ − ∆λ −

81 in T lout + t l1 − − 1 setup in timing graph of out li T lj + tsetup a combinational − − block

in out ln T lm + tsetup max - late mode − −

Figure 5-10: Modeling setup constraints in the timing graph.

setup Ignoring the sign and expanding sj we obtain

setup in late out max( sj ) = max max (li + di,j ) T lj + tsetup (5.8) ∆λ − ∆λ i=1,...,n − −   The corner, ∆λ , that maximizes the value of ssetup among all outputs j = 1, . . . , m is given ∗ − j by

in late out max max max (li + di,j ) T lj + tsetup (5.9) ∆λ j=1,...,m i=1,...,n − −    Comparing Eqns. (5.2) and (5.9) we can easily detect that they exhibit a similar structure.

For building Eqn. (5.9), having Eqn. (5.2) as a starting point, we only need to add the clock

in latency, li , inside the max in i, corresponding to the inputs, and subtract the required arrival time, T + lout t , inside the max in j, corresponding to the outputs. Therefore, it can j − setup be concluded that worst setup slack corner problem can be cast as an instance of the WDC problem, if the original timing graph of the combinational block is modified by adding edges with the input clock latency and required arrival time, as illustrated in Figure 5-10.

5.5.3 Hold Time and Early Mode

For a correct operation of a flip-flop, the input data line must be stable for a particular period of time after the capturing clock edge. This period of time is designated by hold time, and we will represent it by thold. Like setup times, hold times are also described in cell specification libraries. Assuming the same connectivity as before and the delay in early mode

early to be di,j , the hold time in the capturing flip-flop is observed only if the following condition holds (see Figure 5-9 for an illustration)

lin + dearly lout + t (5.10) i i,j ≥ j hold

82 in out l l thold 1 − 1 − in timing graph of out li lj thold a combinational − − block

in out ln lm thold min - early mode − −

Figure 5-11: Modeling hold constraints in the timing graph.

As before, this condition must hold for every i, j input/output flip-flop pair. For a given h i output flip-flop j this set of constraints can be compactly written as

in early out min (li + d ) lj + thold (5.11) i=1,...,n i,j ≥

hold This expression induces a slack, sj , defined as

hold in early out sj = min (li + d ) lj thold (5.12) i=1,...,n i,j − − that is non-negative when the conditions are met and negative otherwise.

hold Following the same steps as before, we obtain the worst-slack corner for sj ,

hold in early out min(sj ) = min min (li + di,j ) lj thold (5.13) ∆λ ∆λ i=1,...,n − −   hold The corner, ∆λ∗, that minimizes the value of sj among all outputs j = 1, . . . , m is given by

in early out min min min (li + di,j ) lj thold (5.14) ∆λ j=1,...,m i=1,...,n − −    As before, by comparing Eqns. (5.14) and (5.3) we can easily conclude that the worst hold slack corner problem can be cast to the WDC problem, formulated in Section 5.1, if the original timing graph of the combinational block is modified as illustrated in Figure 5-11.

5.5.4 Multi-Cycle Paths

Multi-cycle paths are paths between registers where the output register captures the result more than one clock cycle after the input register has injected the data. For a multi-cycle path with cycle c N (c > 1), Eqs. (5.4) and (5.10) can be rewritten to, ∈

lin + dlate c T + lout t (5.15) i i,j ≤ × j − setup

83 lin + dearly (c 1) T + lout + t (5.16) i i,j ≥ − × j hold The cycle c is usually a user-specified timing constraint. Again, quick examination of

Eqns (5.15) and (5.16) indicates that this problem is again an instance of the WDC problem in a slightly modified setting.

5.5.5 Transparent Latches

Similarly to flip-flops, transparent latches are sequential storage elements. However, trans- parent latches have a latch enable input, rather than a clock input. When latch enable is active the latch output is permanently updated with the value of the input. When latch enable is inactive, the output is kept unchanged. The fundamental difference to a flip-flop is that while in the case of a flip-flop the input is captured in a clock edge, in the case of a transparent latch the input is captured for a given (active) value of the latch enable input.

Usually, the latch enable input is driven by the clock signal.

The same setup and hold constraints, previously studied, are valid for transparent latches, and can be modeled resorting to the same techniques described earlier in this section. In the case of transparent latches, such constraints must be verified around the time instant where the latch enable input changes from active to inactive. During the period of latch trans- parency, the transparent latch behaves like a combinational circuit element, and therefore can be analyzed using the techniques discussed in Sections 5.4.2 and 5.4.3. As a result of the latch transparency, some paths may become multi-cycle paths that, as we have seen, can also be analyzed within the proposed framework. We can therefore conclude that the analysis of transparent latches can be entirely performed by resorting to the same techniques that were previously discussed.

5.6 Conclusions

This chapter discusses the exact computation of the worst-delay corner of combinational

ICs and the worst-slack corner of sequential ICs, when cell and interconnect delays are char- acterized by affine functions of process and operational parameters, as described in Chap- ter 3. Two efficient algorithmic approaches are proposed, that explore the path space and

84 the parameter space, respectively. The efficiency of such approaches results from the use of branch-and-bound techniques that enable an effective pruning of the search space. Even though for extremely hard instances the proposed algorithms may exhibit exponential run- time complexity, for typical ICs they perform significantly better than that. The proposed approaches cast the computation of the worst-delay and worst-slack corners as a search prob- lem, thus providing an intellectual paradigm that is more general and useful than most previous approaches.

85 86 6

Applications and Extensions

This chapter discusses several problems of practical relevance in timing analysis and opti- mization and demonstrates that they can be addressed in the framework of the formulation and techniques proposed in the previous chapter. The list of such problems will likely be enlarged in the future as more problems are tackled. This fact alone is a testimony of the

flexibility and generality of the procedures discussed earlier and a good indicator of their potential for solving practical problems in timing verification and optimization. We start by describing a generic setting under which most such problems can be formulated and then proceed to show how their solution can be addressed.

The outline of this chapter is as follows. Section 6.1 introduces the concept of augmented timing graph, that will be used in subsequent applications. Section 6.2 details how to compute the worst-slack corner for a single register. Section 6.3 details how to compute the worst-slack corner over all registers. Section 6.4 describes the computation of the minimum clock period of a sequential circuit, and the corresponding corner. Section 6.5 explains how to efficiently detect the existence of slack violations in a design. Section 6.6 explains how the methods developed in the previous chapter can be applied in the analysis of clock trees. Section 6.7 details the procedure for computing the worst k paths of a circuit, and the corresponding corners. Finally, Section 6.8 presents some concluding remarks.

87 sink source

l lj − i li l − j d dc c CK CK D D S Q S Q R R D flip-flop combinational block D flip-flop

Figure 6-1: Augmented timing graph.

6.1 Augmented Timing Graph

In the usual timing verification flow, the generation of the timing graph, precedes the timing analysis procedure. Pins in the circuit are mapped to vertices in a timing graph, and cell and interconnect paths are mapped to edges between them. Each edge is annotated with the relevant information, including the corresponding set of rise fall delays. Most often × these delays are computed in early and late mode, so as to subsequently allow for different analysis types, targeting distinct objectives. As previously stated, it is assumed that delays

(and latencies) are given by affine functions of process parameters, as Eqn. (3.2). For the sake of clarity, and without loss of generality, it will be assumed that registers are made up of D flip-flops, with asynchronous set and reset.

The generation of the timing graph for a sequential circuit, targeting the applications at hand, will follow the same usual guidelines described in the previous paragraph, and therefore can be done using existing delay calculation engines. However, by augmenting the timing graph with a small amount of additional information we can enable additional types of analyzes to be performed while at the same time greatly simplifying certain procedures that will be described in the following. As illustrated in Figure 6-1, the timing graph will be augmented by adding two virtual vertices: a source and a sink. Additionally, the clock pin of each flip-flop will be mapped into two vertices in the augmented timing graph, instead of one. For every flip-flop, one of these two vertices will connect to the source and the other to the sink, with outgoing and incoming edges, as illustrated. These edges will be annotated with the corresponding clock latencies, with positive or negative sign. Finally, we introduce

88 an additional constraint edge, between the vertices that maps the D pin and the CK pin of a flip-flop. The late and early mode delays of these edges are given by,

dlate = T + t (6.1) c − setup

dearly = t (6.2) c − hold This augmented timing graph, incorporates essentially the same information contained in the timing graphs that model setup and hold constraints, illustrated in Figures 5-10 and 5-11.

It should be noted that the additional information present in the augmented timing graph does not prevent its use by traditional timing analysis engines, as it can simply be ignored.

However, this information ensures a more efficient implementation of subsequent corner-based analysis procedures, targeting the identification of setup and hold violations.

6.2 Worst-Slack Corner of a Single Register

The computation of the worst-slack corner of a given flip-flop can be performed by apply- ing the corner-finding algorithm presented in Section 5.4.2 to the augmented timing graph, starting in the new sink vertex (as if it was a primary output), but only following the incoming edge that connects to the vertex that map the CK pin of the target flip-flop. The algorithm will essentially perform a branch-and-bound search in the augmented timing graph, through the fanin cone of the flip-flop, until it reaches either a primary input or the source vertex.

The result will be the worst-slack corner for that register. The setup and hold worst slacks can be computed by running the algorithm either in late or early mode, respectively.

The worst-slack corner of a register can be computed in a similar way, but only allowing the corner-finding algorithm to follow the incoming edges of the sink vertex that connect to vertices that map CK pins of flip-flops belonging to that particular register.

6.3 Worst-Slack Corner Over All Registers

The computation of the worst-slack corner over all the registers is an important problem in optimization of the timing behavior of a circuit. It can obviously be done by applying the algorithm described in Section 6.2 for each register in sequence. However, there is a better and

89 simpler way to compute the same information by applying the the corner-finding algorithm presented in Section 5.4.2 to the augmented timing graph, as described in Section 6.1, starting at the sink vertex, but not restricting the incoming edges that can be visited. By considering all the incoming edges that are connected to the sink vertex, the algorithm will compute the worst-slack corner over all flip-flops, and therefore over all registers. This procedure is more efficient and the joint analysis of all paths pertaining to all registers simultaneously will lead to pruning that is efficient for all register computations.

6.4 Minimum Clock Period

Assuming the model discussed in Chapter 5, the minimum clock period (maximum fre- quency) of a sequential circuit is limited by the setup time constraints in every flip-flop.

Larger periods allow more time for signals to propagate through the combinational blocks, and therefore produce larger setup slacks. The minimum clock period must produce a non- negative slack in every flip-flop. Meaning that it should, at most, produce slack 0 in the

flip-flop where the worst slack is detected. Therefore, the minimum clock period will be given by the value of Eqn. (5.9), when T = 0. Formally,

in late out Tmin = max max max (li + di,j ) lj + tsetup (6.3) ∆λ j=1,...,m i=1,...,n −    As can be understood from Eqn. (5.7), the value computed by Eqn. (5.9) is min (ssetup). − ∆λ Since in any sound design, when T = 0, the slack should be negative, then the value computed by Eqn. (6.3) should be positive. Tmin can be computed by the same procedure that computes

late the worst-slack corner for setup over all registers, but setting T = 0 in dc .

6.5 Slack Violations

Slack violations occur when the worst-slack corner of a flip-flop produces a negative slack.

Under this conditions, the correct operation of the circuit may be compromised. Most often, designers are only concerned with detecting the existence of such violations, rather than computing worst slack conditions.

90 The detection of a slack violation in a flip-flop, and the computation of its corresponding corner/value, can be performed by the same procedure that computes the worst-slack corner of a flip-flop. However, the pruning threshold will be the worst violating slack found so far, rather than the worst slack found so far. This is a less strict pruning condition, and can therefore, in cases where several violations occur, potentially increase the amount of pruning, resulting in a more efficient procedure. For computing all the flip-flops where slack violations occur, as well as their worst-case values, the same procedure can be executed repeatedly for every flip-flop. This corresponds to executing the algorithm for every incoming edge of the sink vertex, in the augmented timing graph.

Detecting whether a flip-flop has a slack violation or not (even though we may not detect the worst violation), can be performed by using the procedure outlined in Section 6.3, but terminating the search when the first slack violation is detected. Similarly, detecting whether a circuit has at least one slack violation can be done by running the procedure that computes the worst-slack over all registers, with the modified threshold as described in the previous paragraph, but terminating as soon as the first violating slack is found.

6.6 Clock Tree Analysis

latches

source

clock tree circuit timing graph

Figure 6-2: Clock tree and its timing graph.

As previously explained, the operation of a sequential circuit is synchronized by one or several clock signals. The distribution of a clock signal across the chip is performed by a clock network. Even though in some high-performance designs the clock network is a multidriven mesh, the typical configuration is that of a clock tree, with a single source generating the clock signal and several sinks connected to the synchronization elements (most often flip-flops). As

91 illustrated in Figure 6-2, rather than just containing interconnect wires, usually the clock tree also contains driver cells, such as buffers, introduced to regenerate the clock signal.

Since the quality of the clock tree has an overwhelming impact in the overall performance of a sequential circuit, its accurate analysis has been a topic of intense research. In this section, the techniques developed in previous chapters are applied to the analysis of the two most relevant aspects of a clock tree, latency and skew, in the presence of process parameter variations.

6.6.1 Clock Latency

d3,5 v5 v5 d3,5

v3 v3 d3,6 d3,6 d2,3 d2,3 v v d1,2 6 6 d1,2 v1 v2 v2 v1 d4,7 v7 v7 d4,7 d2,4 d2,4 v4 v4 d4,8 d4,8

v8 v8 timing graph mirrored timing graph

Figure 6-3: Mirroring of the timing graph.

The delay that clock signals suffer when traveling from the source of the clock tree to each of its sinks is designated by clock latency. Every sink has an associated, and most likely unique, latency. Large clock latencies can be a limiting factor in circuit performance.

The timing graph of a clock tree is, as it would be expected, a tree, where the input is the clock source and the outputs are the sinks of the clock tree. The edges are annotated with interconnect and driver cell delays. If the direction of the edges is inverted, as illustrated in

Figure 6-3, the resulting timing graph is mirrored. In this new timing graph, the clock sinks become the inputs of the graph and the clock source becomes its single output, as illustrated in Figure 6-3. If the WDC problem is solved for the single output of this new mirrored timing graph, the resulting corner is the one that produces the worst delay from the clock source to any sink. That is the worst clock latency corner.

92 v 7 d4,7 − d2,4 d4,8 v4 − v2 d2,3 − d3,5 v8 d v3 v5 − 3,6 v6

Figure 6-4: Worst clock skew corner computation.

6.6.2 Clock Skew

The delay between the arrival of the clock signal to two distinct but sequentially-adjacent registers in a sequential circuit is designated by clock skew. Clock skew is potentially a serious problem in sequential circuit design and a major source of design failures. Typically timing analysis is performed assuming a synchronized clock that arrives at all sink points simultaneously. Clock period is optimized under those circumstances. However, if some sinks in the clock tree were triggered before others, the allotted time for signals to travel between registers in the combinational blocks could be decreased and potentially incorrect signals captured, since the clock trigger may arrive before the signals had time to travel between registers. In general, clock skew is an undesirable effect. However, clock scheduling techniques have been developed, that optimize circuit performance by targeting a specific clock skewness between particular sinks of the clock tree.

An interesting problem is the one of computing the corner in the parameter space that produces the worst clock skew in a clock network. An exhaustive method for solving that problem is to compute the parametric expression of the clock skew between every two sinks in the clock tree, by subtracting their corresponding latencies, and then using Eqn. (3.8) to determine the corner in the parameter space that produces the worst value in such expression.

However, and considering that contemporary designs have clock trees containing millions of sinks, this method is not practical, since it requires O(k2) skew computations, where k is the number of clock tree sinks. As will be shown, this problem can be cast as a sequence of

WDC problems, where pruning may be used to significantly reduce the number of explicit skew computations.

As an illustrative example, let us consider the clock tree whose timing graph is represented

93 in Figure 6-3. The clock skew, ∆l, between v5 and every other sink is given by the following expressions (that are functions of ∆λ)

∆l = l l = d d (6.4) 5,6 5 − 6 3,5 − 3,6

∆l = l l = d + d (d + d ) (6.5) 5,7 5 − 7 2,3 3,5 − 2,4 4,7 ∆l = l l = d + d (d + d ) (6.6) 5,8 5 − 8 2,3 3,5 − 2,4 4,8

The corner that produces the worst skew between v5 and any other sink of the clock tree is given by

max [max (∆l5,6, ∆l5,7, ∆l5,8)] (6.7) ∆λ this procedure is quite simple, but requires the explicit enumeration of all possible sink pairs.

In practice the previous procedure subtracts the delays (latencies) of the paths between the clock source and two sinks of the clock tree. This problem can be formulated as WDC problem if the original timing graph is modified such that one of the sinks becomes the output vertex and the remaining ones become input vertices. For computing the worst skew between v5 and any other sink of the clock tree the WDC problem can be solved, in the timing graph presented in Figure 6-4. This graph was obtained by picking the original timing graph, shown in Figure 6-3, and inverting the direction of all the edges except the ones in the path between the clock source, v1 and the sink under analysis, v5, thus making v5 the output of the graph. Additionally, the delays annotated in every edge, except the edges in that same path, are multiplied by 1 so that the delay of any path between one input and one output, actually − corresponds to the subtraction of the latency of v5 with the latency of some other sink of the clock tree. Therefore, computing the worst-delay corner in this modified graph actually corresponds to computing the worst clock skew corner to v5. Obviously, for computing the worst clock skew corner of the clock tree the same procedure would have to be repeated for the other sinks of the clock tree and get the worst.

6.7 k Worst-Delay Paths and Corners

The plots presented in Figure 6-5 show the worst delay value for the 1000 worst-delay paths (left plot) and for the 1000 worst-delay corners (right plot). Clearly, in both cases

94 1 1

1 0.9995

0.9999 0.999 0.9999

0.9985 0.9998

0.998 0.9998

Normalized Path Delay 0.9997 0.9975 Normalized Circuit Delay 0.9997

0.997 0.9996

0.9965 0.9996 0 100 200 300 400 500 600 700 800 900 1000 0 100 200 300 400 500 600 700 800 900 1000 Paths Corners

Figure 6-5: Worst delay of the 1000 worst-delay paths and of the 1000 worst-delay corners, in c6288. Note that the plots are in different scales. there is a small, yet relevant, number of paths and corners that exhibit a worst delay close to the worst circuit delay. Given the inaccuracies involved in delay modeling and calculation and the fact that small random variations that cannot be modeled will most likely occur, any of such paths and corners can in reality be the worst-delay path or corner of the fabricated

IC. Therefore, in order to ensure that all possible worst-delay conditions in the fabricated

IC have been analyzed, it is important be able to compute not only the worst-delay path or corner, but the k worst-delay paths or corners. In a more general setting, it is important to be able to compute k worst corners of a timing graph, whether they represent delay, slack, slack over a flip-flop, a register or all registers simultaneously. This problem also becomes relevant in the context of optimization when one wants to improve the timing of the circuit not only for the worst case setting but for a set of worst case settings.

Surprisingly, the modifications that need to be introduced in the path space search and parameter space search algorithms, described in Chapter 5, to enable the computation of the k worst-delay paths or corners are very few. Instead of storing only the worst-delay path or corner, and respective delay values, it is necessary to store and update the k worst-delay paths or corners. Additionally, the pruning threshold is the smallest of the k worst-delay values, rather than the (largest) worst-delay value.

The first requirement is to have a data structure, possibly a list, of size k, for storing all the k elements (paths or corners), as well as their associated worst-delay (or worst-slack) values. For convenience, in this data structure the elements will be ordered by ascending

95 order of their worst delay values. This means that among the elements stored in the data structure, the least critical will be in the first position and the most critical will be in the last position.

The insertion and removal of elements from the data structure must follow two simple rules, to ensure that only the most critical elements are stored:

while the number of elements in the data structure is smaller than k, all the candidate • elements are inserted in the data structure, in ascending order of their worst-delay

values;

when the number of elements in the data structure is k, the candidate elements are only • inserted in the data structure if their worst-delay value is larger than the worst-delay

value of the first element (e.g. the element with the smallest worst-delay value); the

first element is subsequently removed in order to keep the number of elements equal to

k.

Candidate elements are either complete paths, in path space search, or completely specified corners, in parameter space search, e.g. all the elements that, in the original algorithms, would enable the worst-delay corner and the respective worst-delay value to be updated. The pruning procedure must also be slightly modified:

while the number of elements in the data structure is smaller than k, no pruning is • performed and all the subspaces are explored;

when the number of elements in the data structure reaches k, pruning is initiated and • the corresponding threshold value will be the worst-delay value of the first element.

At every step, the described procedure keeps in the data structure at most k elements, that have the worst-delay value, among all the candidate elements already analyzed. On termination, the k most critical paths or corners are stored in the data structure.

Clearly, the amount of pruning is correlated with the value of k. For larger values of k, more elements (paths or corners) are stored in the data structure, and therefore the first element will most likely have a smaller worst-delay value. This means that the pruning

96 worst corner 2nd worst corner 3rd worst corner path delay

paths worst parth 2nd worst path 3rd worst path

Figure 6-6: Illustration of worst-delay paths and corners.

threshold, w∗, will be a smaller number, and consequently less elements (paths or corners) get pruned. It is therefore expected that larger values of k may lead to larger execution times.

When applying the former procedure to the path space search algorithm, proposed in

Chapter 5, we obtain the k paths that exhibit the largest worst-delay value. The worst-delay of each of the k paths can occur for different corners, or for the same corner. When applying the former procedure to the parameter space search algorithm, proposed in Chapter 5, we obtain the k corners that produce the worst circuit delay.

Figure 6-6 illustrates the relation between worst-delay paths and worst-delay corners.

The vertical lines correspond to the delay of each path. Since delays are represented by affine functions, they can assume any value between their minimum and maximum values, that occur at the corners specified by Eqn. (3.8) and it symmetrical. Small horizontal lines mark the path delay values at each of four corners. The worst-delay corner always occurs in the worst-delay path. However, the second worst-delay corner can occur in the worst-delay path or in the second worst-delay path. In this case it occurs also in the worst-delay path.

Consequently, the third worst-delay corner occurs in the second worst-delay path, and so on.

97 6.8 Conclusions

In this chapter we have discussed problems of practical relevance in timing analysis and optimization and demonstrated that they can be addressed in the framework of the formula- tion and techniques proposed in the previous chapter. We started by introducing the concept of augmented timing graph. This graph is subsequently used for the applications related to setup and hold slack computations. The utilization of the techniques proposed in the pre- vious chapter was illustrated for the following applications: compute the worst-slack corner for a single register, compute the worst-slack corner over all registers, compute the minimum clock period, detect the existence of slack violation, compute clock skew and clock latency in a clock tree and compute the worst k paths of a circuit.

98 7

Experimental Results

This chapter present experimental evidence that validates the methodologies proposed in the previous chapters. Section 7.1 describes the benchmark circuits as well as the procedure that was followed for their generation. Afterwards, Section 7.2 presents experimental results for the delay modeling and calculation methodologies proposed in Chapters 3 and 4. This corresponds to the delay modeling step illustrated in Figure 1-1. Sections 7.3 and 7.4 present experimental results for the worst-delay and worst-slack corner computation algorithms pro- posed in Chapter 5. Finally, Section 7.5 presents experimental results for the computation of the k worst-delay paths and corners, as proposed in Section 6.7. The analyzes reported in

Sections 7.3, 7.4 and 7.5 can be part of a timing analysis step, as illustrated in Figure 1-1.

All the experiments reported in this chapter were conducted on a machine with an AMD

Opteron 850, operating at a clock frequency of 2.4GHz, and with 32GB of RAM. We note however that, for all the experiments, the memory consumption never exceeded 1GB.

7.1 Benchmarks

The benchmark circuits used in the following experiments belong to the well-known

ISCAS’85 [10] and ISCAS’89 [9] suites. The ISCAS’85 benchmark suite contains 11 com- binational designs, while the ISCAS’89 benchmark suite contains 31 sequential designs.

From the initial design descriptions, in behavioral Verilog format, all the designs were synthesized and mapped to an industrial 90nm technology, using Cadence RTL Com- piler v06.20. Subsequently, a layout was generated for each benchmark circuit using Ca-

99 dence First Encounter v06.20. Routing was performed using a maximum of 8 metal layers.

Information for ISCAS’85 and ISCAS’89 benchmark circuits after layout generation is presented in Tables 7.1 and 7.2. Column ”Function” describes the functionality of the circuit, when available. Columns ”#PI” and ”#PO” report the number of primary inputs and primary outputs, respectively. Columns ”#Logic” and ”#Seq” report the number of combinational and sequential library cells, respectively. Column ”#Net” reports the number of interconnect nets.

7.2 Parametric Delay Calculation

After layout generation, each circuit was extracted using the extraction engine built into

Cadence First Encounter v06.20. As process parameters, we have considered the widths and thicknesses of the 8 metal layers needed to route each design, resulting in a total of 16 parameters. During parasitic extraction, the nominal values and sensitivities of each parasitic element (resistors and grounded capacitors) were computed, relative to each one of the 16 parameters. Afterwards, variational cell and interconnect delay computation was performed using the models and techniques discussed in Chapters 3 and 4. Subsequently, for each circuit, a timing graph was generated and affine formulas for cell and interconnect delays were annotated as edge properties. Tables 7.1 and 7.2 report in their ”#Vertex” and ”#Edge” columns the number of vertices and edges of the corresponding timing graphs.

In order to validate the parametric interconnect delay and slew computations proposed in

Chapters 3 and 4, we have selected a set of nets from a sequential design containing a total of

3671 nets, including nets in the internal logic, nets in the clock tree and nets in the pad wiring.

For each of these nets, we computed the parametric delay and slew expressions for each of its taps (resulting in 13870 taps among all nets), while the port was excited by a rising voltage ramp. To assess the accuracy of the proposed methodology, the delay and slew sensitivities were compared to results obtained via transistor-level simulations performed using the circuit simulator Spectre. In Figure 7-1 we present scatter plots of the sensitivities computed by both methods, for 4 parameters. In Figure 7-2 we present histograms of the relative errors

100 Circuit Function #PI #PO #Logic #Net #Vertex #Edge c17 tiny example 5 2 4 9 21 22 c432 priority decoder 37 7 88 124 415 575 c499 error correction and translation 41 32 133 174 633 886 c880 ALU and control 60 26 147 207 674 908 c1355 re-mapping of c499 41 32 133 174 633 886 c1908 error correction and translation 33 25 178 211 756 1065 c2670 ALU and control 233 140 282 515 1321 1700 c3540 ALU and control 50 22 443 494 1882 2756 c5315 ALU and selector 178 123 554 734 2644 3701 c6288 16-bit multiplier 1584 32 32 1653 5131 6998 c7552 ALU and control 207 108 820 1031 3483 4807

Table 7.1: Information for ISCAS’85 benchmark suite.

Circuit Function #PI #PO #Logic #Seq #Net #Vertex #Edge s27 tiny example 6 1 14 4 24 59 72 s208 1 fractional multiplier 12 1 73 9 94 274 372 s349 4-bit multiplier 11 11 95 16 123 385 534 s344 re-synthesis of s349 11 11 95 16 123 385 534 s298 traffic light controller 5 6 104 15 124 387 534 s382 re-synthesis of s400 5 6 130 22 157 498 697 s444 traffic light controller 5 6 138 22 166 513 711 s400 traffic light controller 5 6 137 22 165 515 717 s386 controller 9 7 145 7 161 475 627 s420 1 fractional multiplier 20 1 167 17 204 600 827 s713 based on PLD 36 23 155 19 211 598 771 s641 re-synthesis of s713 36 23 162 19 218 627 815 s526n re-synthesis of s526n 5 6 180 22 208 645 891 s526 traffic light controller 5 6 166 22 193 615 859 s510 controller 21 7 228 7 256 749 985 s832 based on PLD 20 19 279 6 206 918 1212 s820 re-synthesis of s832 20 19 295 6 321 964 1272 s838 1 fractional multiplier 36 1 265 33 334 1002 1409 s1196 re-synthesis of s1238 16 14 458 19 493 1558 2134 s1238 comb. w/ rand. ff’s 16 14 459 19 494 1590 2198 s1423 – 19 5 469 75 563 1829 2609 s1488 re-synthesis of s1494 10 19 489 7 506 1644 2263 s1494 controller 10 19 516 7 534 1721 2363 s15850 real-chip based 13 87 420 136 576 1825 2559 s9234 1 real-chip based 30 39 726 150 920 2999 4337 s13207 real-chip based 11 121 848 340 1228 4121 6122 s5378 – 37 49 799 166 1058 3468 5049 s38584 real-chip based 13 278 5243 1187 6455 22863 34003 s38417 real-chip based 30 106 5517 1597 7153 25750 39948 s35932 – 37 32 6825 1763 8916 29305 43056

Table 7.2: Information for ISCAS’89 benchmark suite.

101 for other 4 parameters. Both figures clearly show that the computed sensitivities accurately match those obtained by simulation.

Figure 7-1: Computed delay sensitivities vs. transistor-level simulation.

In order to validate the cell delay and output slew computations we proceeded as follows.

For a given standard cell of that same 90nm technology, and using Spice-level models, we generated a Dotlib-style lookup table of size 7x7, for delay and output slew, as a function of input slew and load. Using these tables, and applying the proposed methodology, we com- puted the delay and output slew sensitivities for one of the cell instances in the previously mentioned circuit, considering its loading net obtained from extraction. Using the method- ology proposed in Section 4.3 we generated the sensitivities of delay and output slew to 12 parameters. Next, varying the parameter values, a similar set of sensitivities was also com- puted with Spectre, using accurate Spice-level models for the cell. The delay and output slew sensitivity values obtained using the proposed method were then assessed by computing its relative error versus the Spectre-generated data. These relative errors are shown in

Figure 7-3 (left plot). As can be observed, the errors are in general small, usually in the low percentage range. The only exception to this rule is the pathological case of the slew sensi-

102 Figure 7-2: Histograms of errors in computed delay sensitivities.

tivity to parameter #2, whose absolute value is small, the smallest of all the sensitivities and near machine precision. In order to investigate this behavior, we introduced a variation in the input slew depending on parameter #2, so that the delay and output slew sensitivity values to this parameter would become larger. As a result we observed that when this happened the relative error dropped to the normal range, as shown in Figure 7-3 (right plot). Considering that the size of the Dotlib-style lookup table used was only 7x7 (typical value), providing a rough approximation of the behavior of the cell, and that the parasitic network was also approximated by a single lumped capacitance, we believe that the accuracy of the computed values is fairly good. Better accuracy should be obtained by using larger lookup tables, with more data points, or by extending the proposed model for handling tables depending on other parameters.

For nets with a number of nodes exceeding a predefined threshold, and in order to improve delay calculation efficiency, a variational model order reduction scheme [53] is employed. Such schemes are usually able to achieve significant reductions in circuit size at minimum error penalty.

103 0.18 0.18 Delay 0.16 Output Slew 0.17

0.14 0.16 0.15 0.12 0.14 0.1 0.13 0.08 0.12 0.06 0.11

0.04 0.1 Rel. Error in Sensitivity to Parameter 0.02 0.09

0 0.08 1 2 3 4 5 6 7 8 9 10 11 12 Rel. Error in Output Slew Sensitivity to Parameter #2 0 10 20 30 Parameter # % Var in Input Slew due to Parameter #2

Figure 7-3: Relative errors in computed cell delay and output slew sensitivities.

7.3 Worst-Delay Corner

The algorithms for the computation of the WDC, proposed in Chapter 5, where imple- mented in C++. The experimental results obtained for the ISCAS’85 combinational circuits using such implementation are reported in Table 7.3. Both path space search and parameter space search approaches are considered. For each approach an exhaustive and a branch-and- bound based procedure were evaluated. For each procedure, ”#Search” and ”CPU” columns report the amount of search and the run-time in seconds. For path space search methods the amount of search is the number of vertex visits, while for parameter space search methods the amount of search is the number of decisions. For branch-and-bound based algorithms, loose delay upper bounds were computed, as given by Eqn. (3.16).

The correctness of the algorithms proposed in Chapter 5 is confirmed by the fact that the

WDC, and the respective worst-delay value, computed by each procedure for a given circuit, is the same. Since such information is irrelevant for the purpose of performance comparison between the various algorithms, we have decided not to report it.

The results presented in Table 7.3 clearly show that the branch-and-bound techniques are quite effective in reducing the amount of search, when searching both the path and the parameter spaces. Run-times are also reduced accordingly. In view of these results, we can conclude that the computational overhead incurred by the branch-and-bound techniques is largely compensated by the CPU time saved during the search. The path space search

104 approaches seem to be the most effective, even in the exhaustive case. An exception is the design c6288 for which the exhaustive path space search algorithm does not terminate after

3000 seconds, most likely due to the huge number of paths. It is also in this design that the efficiency of the branch-and-bound techniques is most noticeable, as the path space search procedure with branch-and-bound completes in less than 9 seconds.

Table 7.4 presents the results for WDC computation using the efficient branch-and-bound path space search procedure, employing three different techniques for computing max upper bounds. The techniques discussed in Section 3.2.6, and given by Eqns. (3.16) and (3.17) are designated by ”Loose” and ”Tightest”, respectively. The technique proposed in [49] is designated by ”Linear-Time”.

The results presented in Table 7.4 show that, as expected, the amount of search can be reduced by using tighter bounds. However, in the case of the ”Tightest” bound, the run- time increases dramatically. This can be easily explained by recalling that the computation of the ”Loose” and ”Linear-Time” bounds exhibits linear run-time complexity, while the computation of the ”Tightest” bound involves solving an LP that, in the worst case, can have exponential run-time complexity. Therefore, we can conclude that in the case of the

”Tightest” bound the computation of the din estimates consumes a far greater amount of time than the time subsequently saved during the search procedure.

The simple ”Loose” bound seems to successfully limit the amount of search for most circuits. Nevertheless, the ”Linear-Time” bound enables additional savings in the amount of search, at a very small computational cost. That is particularly noticeable in the case of circuit c6288. Therefore, even though the algorithm proposed in [49] suffers from the limitations described in Section 2.5, it can be useful in optimizing the performance of the branch-and-bound algorithms proposed in Chapter 5.

7.4 Worst-Slack Corner

Table 7.5 presents the results for worst-slack corner computation in sequential circuits, for setup and hold slacks, obtained by applying path space search WDC computation procedures to the augmented timing graph, proposed in Section 6.1. For the branch-and-bound based

105 Path Space Search Parameter Space Search Circuit Exhaustive Branch-and-Bound Exhaustive Branch-and-Bound #Search CPU (s) #Search CPU (s) #Search CPU (s) #Search CPU (s) c17 72 <0.01 24 <0.01 65536 1.11 5319 0.39 c432 1920392 3.40 561 <0.01 65536 27.45 1701 4.01 c499 604384 1.11 523 0.01 65536 44.67 1363 6.04 c880 44044 0.09 524 <0.01 65536 47.83 779 2.97 c1355 604384 0.97 541 <0.01 65536 46.15 1951 8.45 c1908 3318560 6.22 1026 0.01 65536 67.88 2125 11.65 c2670 93614 0.17 614 0.01 65536 176.64 1171 12.81 c3540 34153708 64.83 1052 <0.01 65536 299.31 2219 44.95 c5315 4218632 8.25 740 0.02 65536 417.06 701 23.94 c6288 >1571711079 >3000 2318098 8.62 65536 803.05 1339 133.18 c7552 3788036 6.67 922 0.03 65536 550.29 1001 52.64

Table 7.3: Results for worst-delay corner computation.

Path Space Search with Branch-and-Bound Circuit Loose Linear-Time [49] Tightest #Search CPU (s) #Search CPU (s) #Search CPU (s) c17 24 <0.01 20 <0.01 20 83.17 c432 561 <0.01 510 <0.01 510 9378.77 c499 523 0.01 459 0.01 459 21076.86 c880 524 <0.01 523 <0.01 523 10676.10 c1355 541 <0.01 354 0.01 347 34349.00 c1908 1026 0.01 759 <0.01 759 28061.60 c2670 614 0.01 595 0.02 595 27129.24 c3540 1052 <0.01 868 0.03 868 70188.30 c5315 740 0.02 671 0.03 671 70426.66 c6288 2318098 8.62 18159 0.12 – >100000 c7552 922 0.03 746 0.04 746 90641.14

Table 7.4: Results for worst-delay corner computation, using three max bounding techniques.

106 Worst Setup Slack Worst Hold Slack Circuit Path Exhaustive Path Bnb Path Exhaustive Path BnB #Search CPU (s) #Search CPU (s) #Search CPU (s) #Search CPU (s) s27 174 <0.01 28 <0.01 174 <0.01 54 <0.01 s208 1 1182 <0.01 58 <0.01 1182 <0.01 42 0.01 s349 4606 <0.01 88 0.01 4606 <0.01 98 <0.01 s344 4662 0.02 126 <0.01 4662 0.01 62 <0.01 s298 3646 0.01 68 0.01 3646 0.01 74 <0.01 s382 6554 <0.01 82 <0.01 6554 <0.01 64 <0.01 s444 9530 0.02 166 <0.01 9530 0.02 68 0.01 s400 7202 0.02 104 0.01 7202 0.03 90 <0.01 s386 2406 0.01 118 <0.01 2406 0.01 40 0.01 s420 1 3974 <0.01 86 0.01 3974 0.01 156 0.01 s713 9126 0.02 80 0.01 9126 0.02 166 0.01 s641 9934 0.02 138 <0.01 9934 0.02 108 <0.01 s526n 6582 <0.01 68 <0.01 6582 0.01 74 0.01 s526 6854 0.01 144 0.01 6854 0.01 94 0.01 s510 4742 0.01 150 0.01 4742 <0.01 110 0.01 s832 5182 0.01 68 <0.01 5182 0.01 104 0.01 s820 5278 0.01 108 0.01 5278 0.02 100 <0.01 s838 1 11526 0.01 176 <0.01 11526 0.03 64 <0.01 s1196 6366 0.01 144 <0.01 6366 <0.01 44 0.01 s1238 6098 0.01 48 0.01 6098 0.01 64 0.01 s1423 452286 0.85 334 0.03 452286 0.90 214 0.02 s1488 9298 0.01 140 0.02 9298 0.02 176 0.01 s1494 9094 0.01 98 0.01 9094 0.02 82 0.02 s15850 28746 0.05 186 0.02 28746 0.05 50 <0.01 s9234 1 159966 0.30 402 0.03 159966 0.30 208 0.03 s13207 36522 0.08 112 0.04 36522 0.12 58 0.06 s5378 67706 0.12 156 0.04 67706 0.16 70 0.03 s38584 394194 0.81 138 0.43 394194 0.77 30 0.42 s38417 15284978 29.28 298 0.61 15284978 30.51 34 0.63 s35932 318414 1.01 222 0.70 318414 1.06 86 0.68

Table 7.5: Results for worst-slack corner computation.

107 algorithms, loose max upper bounds were computed, as given by Eqn. (3.16).

As in the case of the WDC, the correctness of the proposed algorithms was demonstrated by the fact that both the exhaustive and the branch-and-bound versions produced the worst- slack corner, and respective worst-slack values. Such results are irrelevant for performance comparison and were therefore omitted.

Once more the branch-and-bound techniques yield a significant reduction in the amount of search, which impacts run-time accordingly. Such reduction is most noticeable for design s38417. Not surprisingly, this problem seems to be much easier to solve than the WDC prob- lem. This would be expected, since the depth of the combinational blocks in the sequential benchmark circuits is typically much smaller than the depth of the combinational benchmark circuits presented in Tables 7.3 and 7.4. Consequently, the number of potential paths between two registers is also much smaller than the number of potential complete paths, between the primary inputs and the primary outputs of a combinational circuit, which the problem much easier to solve.

Considering that most practical circuits are of sequential nature, where the depth of the combinational blocks between registers is rather limited, one can envision that the proposed branch-and-bound techniques may be successfully employed in the analysis of even fairly large practical circuit designs, at an acceptable computational cost.

7.5 k Worst-Delay Paths and Corners

Table 7.6 presents results for the computation of the k worst-delay paths, using the al- gorithm proposed in Section 6.7. The ”Path Exhaustive” column reports results for the exhaustive search of the path space, that analyzes all the paths, but only keeps the single worst-delay path. The ”Path BnB” columns report the results for search in the path space using branch-and-bound techniques, for three values of k. For the branch-and-bound based algorithm, the max upper bound proposed in [49] was used, given its good tradeoff between run-time and tightness.

As can be observed in Table 7.6, the top 100 and top 1000 worst-delay paths can still be computed in a fairly small amount of time, even though pruning is obviously reduced, and

108 there must be an overhead for keeping the k worst-delay paths. That is true even for c6288, that exhibits a large number of paths.

Table 7.7 presents results for the computation of the k worst-delay corners, using the algorithm proposed in Section 6.7. The ”Parameter Exhaustive” column reports results for the exhaustive search of the parameter space, that analyzes all the corners, but only keeps the single worst-delay corner. The ”Path BnB” columns report the results for search in the parameter space using branch-and-bound techniques, for three values of k. Once more, the max upper bound proposed in [49] was used.

Analyzing Table 7.7 it is possible to conclude that, as in the case of the path space search, the amount of pruning decreases as k increases, which is expected since the pruning threshold is smaller. Nevertheless, for moderate values of k, the computation of the k worst-delay corners seems to still be feasible.

109 Path Exhaustive Path BnB Circuit k = 1 k = 1 k = 100 k = 1000 #Search CPU (s) #Search CPU (s) #Search CPU (s) #Search CPU (s) c17 72 < 0.01 20 < 0.01 72 < 0.01 72 < 0.01 c432 1920392 3.40 510 < 0.01 6646 0.02 41664 0.31 c499 604384 1.11 459 0.01 4702 0.03 19821 0.13 c880 44044 0.09 523 < 0.01 4695 0.01 16900 0.15 c1355 604384 0.97 354 0.01 4523 0.01 19962 0.18 c1908 3318560 6.22 755 < 0.01 10124 0.05 54852 0.47 c2670 93614 0.17 595 0.02 6411 0.03 27629 0.22 c3540 34153708 64.83 868 0.03 24605 0.11 131781 0.86 c5315 4218632 8.25 671 0.03 10789 0.08 55575 0.40 c6288 > 1571711079 > 3000 18159 0.12 1056387 4.54 7878747 53.00 c7552 3788036 6.67 746 0.04 11103 0.11 91632 0.68

Table 7.6: Results for the exact computation of the k worst-delay paths.

Parameter Exhaustive Parameter BnB Circuit k = 1 k = 1 k = 100 k = 1000 #Search CPU (s) #Search CPU (s) #Search CPU (s) #Search CPU (s) c17 65536 1.11 3463 0.33 7761 0.71 12033 1.15 c432 65536 27.45 853 2.46 1707 4.51 6435 17.39 c499 65536 44.67 405 2.32 1271 6.66 6473 32.69 c880 65536 47.83 315 1.60 1337 6.43 6371 30.58 c1355 65536 46.15 647 3.39 1573 7.98 6483 33.75 c1908 65536 67.88 301 2.05 799 4.95 5679 35.35 c2670 65536 176.64 239 3.04 1077 13.30 6941 85.81 c3540 65536 299.31 559 12.76 1735 39.08 6759 146.45 c5315 65536 417.06 205 7.80 821 30.33 5799 213.71 c6288 65536 803.05 241 25.40 803 83.46 5677 580.72 c7552 65536 550.29 411 23.65 1439 83.10 6453 365.96

Table 7.7: Results for the exact computation of the k worst-delay corners.

110 8

Conclusions and Future Work

This dissertation addresses the complex problem of performing timing analysis of inte- grated circuit designs, while accounting for process parameter variations, which have a huge impact in circuit performance, in the latest IC technologies. Several novel contributions to the development of a variation-aware timing analysis flow have been proposed. This chapter summarizes the major contributions of this dissertation and presents a few relevant topics for future research.

8.1 Delay Computation

The first major contribution of this dissertation is the development of a variation-aware an- alytical delay computation methodology, suitable for use in a statistical static timing method- ology, in corner analysis, as well as in other timing analysis techniques requiring variation- aware delay models. Our approach, based on a specific type of perturbation analysis, allows for the analytical computation of delays as affine functions of parameter variations. We also show how perturbation analysis can be performed when only the standard delay table lookup models are available for the standard cells. The techniques proposed are robust and show good correlation with transistor level simulations. Further, such techniques can be directly applied when cell characterization is based either in voltage or current source models.

111 8.2 Timing Analysis

The second major contribution of this dissertation is the development of an efficient, branch-and-bound based, automated methodology for computing the exact worst-delay and worst-slack process corners of combinational and sequential digital ICs, respectively, given an affine parametric characterization of the cell and interconnect delays. Experimental evidence shows that the proposed methodology is particularly effective, leading to reductions in CPU time up to several orders of magnitude, when compared to exhaustive search. This fact reveals the possibility of handling large sequential circuit designs at a moderate computational cost.

The proposed methodology finds application in a number of practical problems in timing verification and optimization, some of which are briefly described in Chapter 6. Additionally, it can be easily integrated in a traditional design flow as an alternative to SSTA, or com- plementing it, by providing insightful information into undesirable circuit behavior, in the presence of variability.

To the best of our knowledge, this methodology constitutes the only efficient systematic methodology currently available for computing the exact worst-delay/slack corners of a digital

IC, while accounting for process variability.

8.3 Future Work

The scope and depth of the work reported in this dissertation is necessarily limited by the limited time and resources available for its execution. Therefore, several new research directions, that were potentially interesting and worth to explore, could not be pursued within the scope of the work reported in this dissertation. In this section we briefly enumerate them.

The variation-aware delay computation techniques proposed in Chapter 4, produce de- lay models as affine functions of process parameter variations. We believe that it would be interesting to research possible extensions/modifications to the proposed techniques for en- abling the generation of delay models that include nonlinear contributions from parameter variations.

The amount of pruning performed in both the path-based and parameter-based branch-

112 and-bound search algorithms, proposed in Sections 5.4.2 and 5.4.3, respectively, is highly dependent on the order by which the corresponding search spaces are explored. Clearly, for both algorithms, if the problem solution is explored early in the search process, more pruning will be achieved, thus improving performance. Therefore, one interesting improvement would be the development of an heuristic that, driven by the sensitivity sign patterns of edge delays, could direct the search algorithm to explore, early in the search process, the regions of the search space where the problem solution is more likely to lie.

Each of the two branch-and-bound algorithms mentioned in the previous paragraph, con- duct their search in a different space: one in the path space and the other in the parameter

(corner) space. Such spaces are related, in the sense that when a region of one space is pruned this may correspond to pruning a region of the other space. For example, pruning a corner in the parameter space, corresponds to pruning all the paths for which that corner is the WDC, in the path space. On the other hand, pruning a path in the path space may not necessarily correspond to pruning some corner in the parameter space. Investigating the connection between the parameter and the path spaces could potentially lead to interesting results. This could ultimately enable the simultaneous application of both algorithms as a means of optimizing the search process, by relating the information produced by each of them.

For some circuits, there may not be any input pattern that activates a given path or a set of paths. Such paths are usually designated by false paths [30]. There is extensive literature [16, 19, 41, 40, 22] discussing false path identification in a nominal timing context

(without considering variability). The worst delay of false paths is not relevant in a timing analysis context, as they are never exercised and, consequently, they do not impact circuit operation. Therefore, if the WDC computed for a given circuit occurs in a false path, it will not correspond to the real WDC of the circuit, thus leading to an overestimation of the worst delay of the circuit. Since high-level synthesis systems [5] are prone to generate circuits with many false paths, it is useful to be able to detect and ignore such paths in order to provide more accurate worst delay estimates. Therefore, it would be interesting to extend our methodology to handle false paths, by incorporating the ability to detect and ignore

113 them into the algorithms proposed in Sections 5.4.2 and 5.4.3. From a preliminary analysis, we believe that it would be feasible, yet hard, since the false path problem is known to be

NP-complete. Further, the addition of variability makes false path determination much more difficult.

The methodology proposed in Chapter 5, than enables the efficient computation of the

WDC of a circuit, assumes that edge delays are given by affine functions of the process param- eters. Nevertheless, when edge delays are given by other types of functions, the application of the proposed methodology may still be possible. Even though affine functions seem to be sufficiently accurate to represent delays in current digital IC technologies, more complex delay models may be necessary in the future. Therefore, it would be useful to investigate extensions to the proposed methodology, for enabling its application to a broader range of delay functions.

114 Bibliography

[1] Aseem Agarwal, David Blaauw, Vladimir Zolotov, and Sarma Vrudhula. Statistical Tim-

ing Analysis using Bounds. In Proceedings of Design, Automation and Test in Europe,

Exhibition and Conference, pages 10062–10067, Munich, Germany, March 2003.

[2] R. Aitken, R. Lauwereins, J. Tracy Weed, V. Kiefer, and J. Hartmann. Special Session

– Caution Ahead: The Road to Design and Manufacturing at 32 and 22 nm. In Proceed-

ings of Design, Automation and Test in Europe, Exhibition and Conference, page 510,

Munich, Germany, April 2008.

[3] R. Bauer, J. Fang, A. Ng, and R. Brayton. XPSim: A MOS VLSI Simulator. In

Proceedings of The International Conference on Computer Aided-Design, Santa Clara,

November 1988.

[4] Romy Bauer, A. Ng, A. Raghunathan, and C. Thompson M Saake. Simulating MOS

VLSI Circuits Using Super-Crystal. In VLSI Conference. North-Holland, 1987.

[5] R. Bergamaschi. The Effects of False Paths in High-Level Synthesis. In Proceedings of

The International Conference on Computer Aided-Design, November 1991.

[6] Michel Berkelaar. Statistical delay calculation, a linear time method. In International

Workshop on Timing Issues in the Specification and Synthesis of Digital Systems, De-

cember 1997.

[7] Sarvesh Bhardwaj, Sarma B. K. Vrudhula, and David Blaauw. Tau: Timing analysis

under uncertainty. In Proceedings of The International Conference on Computer Aided-

Design, pages 615–620, San Jose, CA, November 2003.

115 [8] Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University

Press, 2004.

[9] F. Brglez, D. Bryan, and K. Kozminski. Combinational Profiles of Sequential Bench-

mark Circuits. In Proceedings of the IEEE International Symposium on Circuits and

Systems, pages 1929–1934, May 1989. See also the ISCAS-89 benchmark directory at

http://www.cbl.ncsu.edu/benchmarks.

[10] F. Brglez and H. Fujiwara. A Neutral Netlist of 10 Combinatorial Benchmark Circuits

and a Target Translator in FORTRAN. In Proceedings of the IEEE International Sym-

posium on Circuits and Systems, pages 695–698, June 1985. See also the ISCAS-85

benchmark directory at http://www.cbl.ncsu.edu/benchmarks.

[11] Cadence Design Systems, Inc. Spectre Circuit Simulator User Guide, June 2006.

[12] Cadence Design Systems, Inc. Virtuoso UltraSim Simulator User Guide, June 2006.

[13] H. Chang and S. S. Sapatnekar. Statistical Timing Analysis Considering Spatial Correla-

tions using a Single Pert-like Traversal. In Proceedings of The International Conference

on Computer Aided-Design, pages 621–625, San Jose, CA, November 2003.

[14] B. R. Chawla, H. K. Gummel, and P. Kozah. MOTIS - An MOS Timing Simulator.

IEEE Transactions on Circuits and Systems, CAS-22:901–909, December 1975.

[15] C. F. Chen and P. Subramanyam. The Second Generation MOTIS Timing Simulator - An

Efficient and Accurate Approach for General MOS Circuit. In International Symposium

on Circuits and Systems, Montreal, Canada, May 1984.

[16] Hsi-Chuan Chen and David H. C. Chu. Path Sensitization in Critical Path Problems.

IEEE Transaction on CAD, 12(2):196–207, February 1993.

[17] J.F. Croix and D.F Wong. Blade and Razor: Cell and Interconnect Delay Analysis using

Current-Based Models. In Proceedings of the Design Automation Conference, pages 386

– 389, Anaheim, CA, June 2003.

116 [18] Florentin Dartu, Noel Menezes, and Lawrence T. Pileggi. Performance Computation for

Precharacterized CMOS Gates with RC Loads. IEEE Trans. on CAD, 15(5):544 – 553,

May 1996.

[19] S. Devadas, K. Keutzer, and S. Malik. Computation of Floating-Mode Delay in Combina-

tional Circuits: Practice and Implementation. IEEE Transaction on CAD, 12(12):1924–

1936, December 1993.

[20] A. Devgan and R. A. Rohrer. Adaptively Controlled Explicit Simulation. IEEE Trans-

action on CAD, CAD-13(6):746–762, June 1994.

[21] Aniruhd Devgan and Chandramouli Kashyap. Block-based Static Timing Analysis with

Uncertainty. In Proceedings of The International Conference on Computer Aided-Design,

pages 607–614, San Jose, CA, November 2003.

[22] Luis Guerra e Silva, Jo ao Marques-Silva, L. Miguel Silveira, and Karem A. Sakallah.

Satisfiability Models and Algorithms for Circuit Delay Computation. ACM Transactions

on Design Automation of Electronic Systems, 7(1):137–158, January 2002.

[23] Luis Guerra e Silva, Joel Phillips, and L. Miguel Silveira. Efficient computation of the

exact worst-delay corner. In IEEE/ACM International Workshop on Timing Issues in

the Specification and Synthesis of Digital Systems (TAU), Austin, TX, February 2007.

[24] Luis Guerra e Silva, L. Miguel Silveira, and Joel R. Phillips. Efficient Computation of

the Worst-Delay Corner. In Proceedings of the IEEE/ACM Design, Automation and

Test in Europe, Exhibition and Conference (DATE), Nice, France, April 2007.

[25] Luis Guerra e Silva, Zhenhai Zhu, Joel R. Phillips, and L. Miguel Silveira. Variation-

Aware, Library Compatible Delay Modeling Strategy. In Proceedings of the Fourteenth

International Conference on Very Large Scale Integration (VLSI-SoC), October 2006.

[26] Luis Guerra e Silva, Zhenhai Zhu, Joel R. Phillips, and L. Miguel Silveira. Library

Compatible Variational Delay Computation. In G. De Micheli, S. Mir, and R. Reis,

editors, VLSI-SoC: Research Trends in VLSI and Systems on Chip, volume 249, pages

157–176. Springer, 2008.

117 [27] W. C. Elmore. The transient response of damped linear networks with particular regard

to wideband amplifiers. Journal on Applied Physics, 19(1):55–63, 1948.

[28] R. B. Hitchcock. Timing Verification and the Timing Analysis Program. In Proceedings of

the 19th ACM/IEEE Design Automation Conference, pages 594–604, Las Vegas, Nevada,

June 1982.

[29] R. B. Hitchcock, G. L. Smith, and D. D. Cheng. Timing Analysis of Computer Hardware.

IBM Journal of Research and Development, 26(1):100–105, January 1982.

[30] V. Hrapˇcenko. Depth and Delay in a Network. Soviet Math. Dokl., 19(4):1006–1009,

1978.

[31] Norman P. Jouppi. TV: An nMOS Timing Analyzer. In Randall Bryant, editor, Third

Caltech Conference on VLSI, pages 71–85, Rockville MD, 1983. Computer Science Press.

[32] Igor Keller, Nishath Verghese, and Kenneth Tseng. A Robust Cell-Level Crosstalk Delay

Change Analysis. In Proceedings of the International Conference on Computer Aided-

Design, San Jose, CA, November 2004.

[33] Y. H. Kim, S. H. Hwang, and A. R. Newton. Electrical-Logic Simulation and its Appli-

cations. IEEE Transaction on CAD, CAD-8(1):8–22, January 1989.

[34] T. W. Kirkpatrick and N. Clark. Pert as an Aid to Logic Design. IBM Journal of

Research and Development, 10(2):135–141, March 1966.

[35] J. E. Morgan L. P. Hartung. Pert / pep - A Dynamic Project Control Method. Technical

report, IBM FSD Space Guidance Center, Owego, New York, 1961.

[36] E. L. Lawler and D. E. Wood. Branch-And-Bound Methods: A Survey. Operations

Research, 14(4):699–719, July–August 1966.

[37] X. Li, P. Li, and L. Pileggi. Parameterized interconnect order reduction with Explicit-

and-Implicit multi-Parameter moment matching for Inter/Intra-Die variations. In Pro-

ceedings of the International Conference on Computer Aided-Design, pages 806–812, San

Jose, CA, November 2005.

118 [38] S. Lin, E. S. Kuh, and M. Marek-Sadowska. Stepwise Equivalent Conductance Circuit

Simulation Technique. IEEE Transaction on CAD, CAD-12(5):672–683, May 1993.

[39] J.-J. Liou, K.-T. Cheng, S. Kundu, and A. Krstic. Fast Statistical Timing Analysis by

Probabilistic Event Propagation. In Proceedings of the ACM/IEEE Design Automation

Conference, pages 661–666, Las Vegas, NV, June 2001.

[40] J. Marques-Silva and K. A. Sakallah. Efficient and Robust Test-Generation Based Tim-

ing Analysis. In International Symposium on Circuits and Systems, pages 303–306, 1994.

[41] P. C. McGeer and R. K. Brayton. Integrating Functional and Temporal domains in Logic

Design. Kluwer Academic Publishers, Norwell, Massachusetts, 1991.

[42] V. Mehrotra, S. L. Sam, D. Boning, A. Chandrakasan, R. Vallishayee, and S. Nassif.

A Methodology for Modeling the Effects of Systematic Within-die Interconnect and

Device Variations on Circuit Performance. In Proceedings of the ACM/IEEE Design

Automation Conference, pages 342–348, Anaheim, CA, June 2003.

[43] Nicholas Metropolis and Stanislav Ulam. The Monte Carlo Method. Journal of the

American Statistical Association, 247(44):335–341, September 1949.

[44] L. W. Nagel. SPICE2: A Computer Program to Simulate Semiconductor Circuits.

Technical Report ERL M520, Electronics Research Laboratory Report, University of

California, Berkeley, Berkeley, California, May 1975.

[45] Farid N. Najm. On the Need for Statistical Timing Analysis. In Proceedings of the

ACM/IEEE Design Automation Conference, pages 764–765, Anaheim, CA, June 2005.

[46] Farid N. Najm and Noel Menezes. Statistical Timing Analysis Based on a Timing Yield

Model. In Proceedings of the ACM/IEEE Design Automation Conference, pages 460–465,

San Diego, CA, June 2004.

[47] Sani R. Nassif and Zhuo Li. A More Effective CEFF . In Proceedinge of the Sixth International Symposium on Quality of Electronic Design, pages 654 – 661, San Jose,

CA, March 2005.

119 [48] Peter Odryna and Sani Nassif. The ADEPT Timing Simulation Algorithm. VLSI Sys-

tems Design, March 1986.

[49] Sari Onaissi and Farid N. Najm. A Linear-Time Approach for Static Timing Analysis

Covering All Process Corners. IEEE Transaction on CAD, 27:1291–1304, July 2008.

[50] Sari Onaissi and Farid N. Nasm. A Linear-Time Approach for Static Timing Analysis

Covering All Process Corners. IEEE Trans. Computer-Aided Design, 27(7):1291–1304,

July 2008.

[51] J. K. Ousterhout. A Switch-Level Timing Verifier for MOS VLSI. IEEE Transaction

on CAD, 4(3):336–348, June 1985.

[52] John K. Ousterhout. Crystal: A Timing Analyzer for nMOS VLSI Circuits. In R. Bryant,

editor, Third Caltech Conference on VLSI, pages 57–69, Rockville MD, 1983. Computer

Science Press.

[53] J. Phillips and L. M. Silveira. Poor man’s tbr: A simple model reduction scheme. IEEE

Trans. Computer-Aided Design, 24(1):43–55, January 2005.

[54] Joel R. Phillips. Model Computation for Statistical Static Timing Analysis. Submitted

to DAC’2006.

[55] Joel R. Phillips. Variational Interconnect Analysis Via PMTBR. In Proceedings of

the International Conference on Computer Aided-Design, pages 872–879, San Jose, CA,

November 2004.

[56] Lawrence T. Pillage and Ronald A. Rohrer. Asymptotic Waveform Evaluation for Timing

Analysis. IEEE Transactions on Computer-Aided Design, 9(4):352–366, April 1990.

[57] J. Qian, S. Pullela, , and L. Pillage. Modeling the Effective Capacitance for the RC

Interconnect of CMOS Gates. IEEE Trans. on VLSI, 13:1526 – 1535, 1994.

[58] Curtis L. Ratzlaff, Satyamurthy Pullela, and Lawrence T. Pillage. Modeling the RC-

Interconnect Effects in a Hierarchical Timing Analyzer. In IEEE Custom Integrated

Circuits Conference, 1992.

120 [59] K. A. Sakallah and S. W. Director. SAMSON2: An Event Driven VLSI Circuit Simulator.

IEEE Transaction on CAD, 4(4):668–684, October 1985.

[60] R. A. Saleh, J. E. Kleckner, and A. R. Newton. Iterated Timing Analysis and SPLICE1.

In Proceedings of The International Conference on Computer Aided-Design, Santa Clara,

California, September 1983.

[61] Lou Scheffer. Explicit Computation of Performance as a Function of Process Variation.

In International Workshop on Timing Issues in the Specification and Synthesis of Digital

Systems, Monterey, CA, December 2002.

[62] B. Stine, D. Boning, and J. Chung. Analysis and Decomposition of Spatial Variation in

Integrated Circuit Processes and Devices. IEEE Transaction on Semiconductor Manu-

facturing, 10(1):24–41, February 1997.

[63] Jorge Stolfi and L. H. de Figueiredo. Self-Validated Numerical Methods and Applica-

tions. In Operations Research, July 1997.

[64] Synopsys, Inc. HSPICE User’s Manual, August 2002.

[65] Synopsys, Inc. Liberty User Guide, October 2005.

[66] Synopsys, Inc. Nanosim User Guide, April 2006.

[67] R. Telichevesky, J. White, and K. Kundert. Efficient AC and Noise Analysis of Two-Tone

RF Circuits. In Proceedings of the Design Automation Conference, June 1996.

[68] A. J. van Genderen. Sls: An efficient switch-level timing simulator using min-max voltage

waveforms. In Proceeding of VLSI 89 Conference, pages 79–88, Munich, 1989.

[69] L. Vidigal, S. Nassif, and S. Director. CINNAMON: Coupled INtegration and Nodal

Analysis of MOs Networks. In Proceedings of the 23rd ACM/IEEE Design Automation

Conference, Las Vegas, Nevada, June 1986.

[70] C. Visweswariah, K. Ravindran, K. Kalafala, S. G. Walker, and S. Narayan. First-order

incremental block-based statistical timing analysis. In Proceedings of the ACM/IEEE

Design Automation Conference, pages 331–336, San Diego, CA, June 2004.

121 [71] C. Visweswariah and R.A Rohrer. Piecewise Approximate Circuit Simulation. IEEE

Transaction on CAD, CAD-10(7):861–870, July 1991.

[72] J. Wang, P. Ghanta, and S. Vrudhula. Stochastic Analysis of Interconnect Performance

in the Presence of Process Variations. In Proceedings of the International Conference on

Computer Aided-Design, pages 880–886, San Jose, CA, November 2004.

[73] Z. Wang, R. Murgai, and J. Roychowdhury. ADAMIN: Automated, accurate maco-

modelling of digital aggressors for power and ground supply noise prediction. IEEE

Transaction on CAD, 24:56–64, January 2005 2005.

122