Compact Variation-Aware Models for Statistical Static Timing Analysis

A Dissertation Presented to The Academic Faculty

By

Seyed-Abdollah Aftabjahani

In Partial Fulfillment of the Requirement for the Degree of Doctor of Philosophy in

School of Electrical and Computer Engineering Georgia Institute of Technology August, 2011 Copyright © 2011 by Seyed-Abdollah Aftabjahani Compact Variation-Aware Standard Cell Models for Statistical Static Timing Analysis

Approved by:

Dr. Linda S. Milor, Advisor School of Electrical and Computer Engineering Georgia Institute of Technology Dr. Yorai Wardi School of Electrical and Computer Dr. Jeffrey A. Davis Engineering School of Electrical and Computer Georgia Institute of Technology Engineering Georgia Institute of Technology Dr. Michael F. Schatz School of Physics Dr. Sung-Kyu Lim Georgia Institute of Technology School of Electrical and Computer Engineering Georgia Institute of Technology Date Approved: June 01, 2011

DEDICATION

To my beloved parents

ACKNOWLEDGEMENTS

This dissertation would not have been possible without the help and support that I have received from many individuals. I would like to express my gratitude to all who assisted me in this endeavor to push my limits of knowledge even further. I will not forget their impact upon my life.

I would like to express my deep appreciation to my thesis advisor, Professor Linda

Milor for her continuous support, encouragement, and supervision on my research.

Throughout the years that I have been her research assistant, I have had excellent opportunities to acquire many academic and research skills, which I will use throughout my life. I have learned many intricate details on how to conduct high-quality academic research from the conception of a research idea, to the conduction of a literature review, to the formulation of a problem, to the utilization of creative thinking to address a problem, to the construction of models and prototypes for analysis and evaluation of a solution, and to the publication and presentation of results.

I would also like to thank my committee members, Professor Jeffrey A. Davis,

Professor Sung Kyu Lim, Professor Yorai Wardi, and Professor Michael Schatz for spending their precious time on the guidance and review of my research.

I would like to thank the Semiconductor Research Corporation (SRC) for support of this research project under task 1419.001. I am grateful to the SRC for providing me with personal and professional development opportunities by funding my attendance at the related conferences, specifically TechCon, to network with experts in the field, to

iv present the research to the leaders in academia and industry, and to receive appropriate feedback to improve the quality of the research.

I would like to acknowledge my dear colleagues, especially Fahad Ahmed and

Muhammad Bashir for all their constructive discussions on my research, and others who

have contributed technical or editorial assistance including Professor Azad Naimee, Dr.

Reza Sarvari, Dr. Alireza Shapoori, and Alex Anderson. I would like to thank the Dr.

Kevin Martin and all technical personnel and staff of the Microelectronics Research

Center (MIRC) at the Georgia Institute of Technology for providing a superb

environment conducive to my research.

Last but not least, I wish to thank my family, especially my parents, for all their love and support. They have provided a solid foundation for me to grow in all aspects of my life, including my education.

v TABLE OF CONTENTS

ACKNOWLEDGEMENTS ...... IV

LIST OF TABLES ...... X

LIST OF FIGURES ...... XIII

SUMMARY ...... XX

CHAPTER I: INTRODUCTION ...... 1

CHAPTER II: BACKGROUND ...... 5

CHAPTER III: MODELING AND ANALYSIS OF COMPACT VARIATION-

AWARE STANDARD CELLS ...... 11

3.1. Experimental Platform and Model of Variation ...... 11

3.2. Construction of the Waveform Model ...... 13

3.3. Comparison of PCA Methods for Waveform Modeling ...... 22

3.4. Construction of the Cell Model and Timing Analysis ...... 25

3.5. Comparison of Experimental Design Methods for Cell Modeling ...... 31

3.6. Complexity Analysis ...... 39

3.7. Conclusions ...... 44

CHAPTER IV: EXTENDING AND ENHANCING THE METHODOLOGY ...... 46

4.1. Constructing a Cell Model for Deep Submicron Technology ...... 47

4.1.1. More Accurate Models with More Parameters ...... 48

4.1.2. Non-binned Transistor Model vs. Binned Transistor Model ...... 53

vi 4.1.3. Support of Symmetric Parameter Variation for all Parameters ...... 53

4.1.4. Variation Parameters Chosen for Cell Models ...... 54

4.2. Constructing a Cell Model for Very Large Parameter Variations ...... 56

4.3. Constructing a Cell Model for Resistive-Capacitive Loads ...... 65

4.3.1. Timing Characterization of Complex Loads...... 66

4.3.2. Mapping RC-Interconnect Networks to Pi-Models ...... 74

4.3.3. Cell Characterization with a Pi-Model Load ...... 84

4.3.4. RC-Interconnect Network Characterization ...... 89

4.3.5. Test Circuit and its Pi-Model-Converted RC-Interconnect Networks ...... 92

4.3.6. Timing Analysis Engine for Our Cell Models and RC-Interconnect Models 97

4.3.7. Timing Simulation and Simulation Results ...... 100

4.3.8. Conclusions and Future Work ...... 109

4.4. Investigating Accuracy Improvement Methods ...... 111

4.4.1 Accuracy Analysis of the PCA Waveform Model ...... 113

4.4.1.1 Accuracy Analysis of the PCA Waveform Model – Number and Location of

Points...... 113

4.4.1.2 Accuracy Analysis of the PCA Waveform Model for TSMC180RF–

Waveform Dataset Selection, Range of Parameter Variations, and Model

Subranging ...... 115

4.4.1.3 Accuracy Analysis of the PCA Waveform Model for FreePDK45 –

Waveform Dataset Selection, Design of Experiment, Discretization Level, Range of

Parameter Variations, and Model Subranging ...... 122

vii 4.4.1.4. Accuracy Analysis of the PCA Waveform Model for FreePDK45 – The

Iterative Method for Finding the Common PCs ...... 135

4.4.2 Accuracy Analysis of the Cell Models ...... 144

4.4.3 Conclusions ...... 145

CHAPTER V: FAST VARIATION-AWARE STATISTICAL DYNAMIC TIMING

ANALYSIS ...... 148

5.1. Introduction ...... 148

5.2. VVCCP– A compiled-code SDTA tool ...... 151

5.2.1 Fault simulation framework ...... 151

5.2.2 Transformation process of models ...... 153

5.2.3 Experiments ...... 155

5.3. Experimental results ...... 159

5.4. Conclusion and future work ...... 161

CHAPTER VI: FUTURE RESEARCH DIRECTIONS ...... 163

6.1. Short-Term Research Plan ...... 163

6.2. Long-Term Research Plan ...... 165

APPENDIX A: PRINCIPAL COMPONENTS ANALYSIS EQUATIONS ...... 169

A.1. Assumptions ...... 169

A.2. Singular value decomposition ...... 170

A.3. Properties of U ...... 172

A.4. Principal Component Transformation ...... 172

A.5. Generalized Measures of Variability ...... 173

viii A.6. Scaling of Characteristic Vectors ...... 173

A.7. Overall Measure of Variability ...... 174

A.8. Residual Analysis ...... 175

APPENDIX B: THE STANDARD CELLS USED IN THE RESEARCH ...... 178

APPENDIX C: COMPARING EXPERIMENTAL DESIGNS FOR GENERATING

THE WAVEFORM AND CELL MODELS...... 179

C.1. Full-factorial Designs ...... 179

C.2. Fractional-Factorial Designs ...... 180

C.3. Designs Based on Latin Hypercube Sampling ...... 180

C.4. Central Composite Designs ...... 181

APPENDIX D: THE SPREAD OF C1, R, AND C2 OF THE PI-MODEL FOR RC-

INTERCONNECT NETWORKS OF THE INVERTER CHAIN ...... 185

APPENDIX E: COMPLEXITY ANALYSIS FOR CELLS THAT USE A

SATURATED RAMP WAVEFORM MODEL ...... 187

APPENDIX F: THE 3-DIMENSIONAL PLOTS FOR MODEL ACCURACY

COMPARISION ...... 191

APPENDIX G: RESOURCES AND FACILITIES USED IN OUR RESEARCH . 202

RESEARCH CONTRIBUTIONS ...... 202

REFERENCES ...... 205

VITA...... 214

ix LIST OF TABLES

Table 3.1. Variation model parameters...... 13

Table 3.2. Residuals of PCA waveform models...... 25

Table 3.3. Fraction of outliers in PCA waveform models (%)...... 25

Table 3.4. Designs used for cell model construction...... 33

Table 3.5. Number of terms (number of operations) in cell models...... 33

Table 3.6. Adjusted coefficient of multiple determination for cell models (%)...... 34

Table 3.7. Sum of squares of residuals for cell models...... 35

Table 3.8. Comparing space complexity of methods for a cell (per delay/transition entry

per input)...... 40

Table 3.9. Comparing simulation time complexity of methods for a cell (per

delay/transition entry per input)...... 41

Table 3.10. Comparing characterization time complexity of methods for a cell (per

delay/transition entry per input)...... 42

Table 4.1. New or enhanced BSIM4.5.0 model parameters used in FreePDK45...... 50

Table 4.2. BSIM4.5.0 model selectors/controllers...... 51

Table 4.2. (cont.) BSIM4.5.0 model selectors/controllers...... 52

Table 4.3. Example BSIM4.5.0 model process parameters...... 55

Table 4.4. Variation model parameters for our VTC's (FreePDK45)...... 57

Table 4.5. Variation model parameters (FreePDK45)...... 64

Table 4.6. Variation model parameters (FreePDK45 - large variations )...... 64

Table 4.7. Variation model parameters (FreePDK45 - subrange)...... 65

x Table 4.8. Variation model parameters (FreePDK45 – subrange & large variations)...... 65

Table 4.9. The 2-level full-factorial design variation parameters for verifying Pi-models

and H'(s) models...... 82

Table 4.10. Comparing output waveform transition time and delay errors (%) of our Pi-

model and H'(s) model for all 11 RC-interconnect networks...... 83

Table 4.11. Comparing the inverter output waveform transition timing points errors (%)

at 10%, 50%, and 90% points and equal resistance errors (%) of our Pi-model and

H'(s) model for all 11 RC-interconnect networks...... 84

Table 4.12. Variation model parameters for (resistive-capacitive) Pi-model loads...... 85

Table 4.13. Designs used for cell model construction with a Pi-model load...... 86

Table 4.14. Sum of squares of residuals for cell models...... 87

Table 4.15. Comparing the model adequacy of the full-factorial models using coefficient

of multiple determination (%) and adjusted coefficient of multiple determination (%)

in parentheses...... 87

Table 4.16. Comparing the model prediction accuracy of the full-factorial models using

coefficient of multiple determination for prediction...... 87

Table 4.17. Number of terms (number of operations) in cell models...... 88

Table 4.18 Comparing STA time, the characterization time, the total number of

operations, the memory usage and the accuracy of all the STA methods...... 108

Table 4.19. Waveform Models Compared for Accuracy – TSMC180RF...... 117

Table 4.20. Waveform and cell models accuracy improvement methods...... 124

Table 4.21. Waveform models compared for accuracy – FreePDK45...... 129

Table 4.22. Which options improve waveform models accuracy – FreePDK45...... 135

xi Table 4.23. Waveform models compared for adequacy and prediction accuracy –

FreePDK45...... 140

Table 5.1. Run time for different experiments...... 160

Table 5.2. Number of inputs, outputs, gates, compile and run-times...... 161

Table E.1. Comparing space complexity of methods for a cell that uses a saturated ramp

waveform model (per delay/transition entry per input)...... 188

Table E.2. Comparing simulation time complexity of methods for a cell that uses a

saturated ramp waveform model (per delay/transition entry per input)...... 188

Table E.3. Comparing characterization time complexity of methods for a cell that uses a

saturated ramp waveform model (per delay/transition entry per input)...... 189

Table E.4. Comparing the complexity of methods for a cell that uses a saturated ramp

waveform model (per delay/transition entry per input)...... 190

xii LIST OF FIGURES

Figure 3.1. The dataset of time domain rising and falling waveforms generated using a

full factorial experimental design...... 13

Figure 3.2. The waveforms corresponding to rising and falling transitions transformed to

the PCA domain...... 14

Figure 3.3. (a) The acceptability region in the PCA domain, together with some points,

labeled as A, B, C, and D corresponding to corners of the PCA domain. (b) Time

domain waveforms corresponding to the corner points A, B, C, and D in (a)...... 17

Figure 3.4. Data points corresponding to the waveforms in Figure 3.1 and the

acceptability region...... 19

Figure 3.5. A common set of PC basis functions for all waveforms associated with all

cells simplifies timing analysis by avoiding (a) conversions to and from the time

domain and (b) storage of a unique set of basis functions for each cell in the library. 19

Figure 3.6. The final acceptability region, including the limit imposed the convergence

requirement...... 21

Figure 3.7. The coefficients of the principal components basis functions, computed after

each of the iterations: (a) PC1 and (b) PC2 ...... 23

Figure 3.8. Narrow tree of inverters used to evaluate the accuracy of the PCA method. . 27

Figure 3.9. Comparison of delay for the three methods for (a) a fast rising input transition

and (b) a slow rising input transition...... 30

Figure 3.10. Average relative error of delay for methods 1 (tabular) and 3 (PC) in

comparison with Hspice using data from the outputs of each of the 21 stages...... 31

xiii Figure 3.11. Average relative error (a) and relative error variance (b), of the delay for

models in comparison with Hspice using data from the outputs of each of the 21 stages

of 12 samples at nominal values of parameters...... 37

Figure 3.12. Average relative error (a) and relative error variance (b) of delay for models

in comparison with Hspice using data from the outputs of each of the 21 stages of 12

samples at random values of parameters...... 38

Figure 4.1. VTC’s are affected by variations of 20% and 5% in L and Vt, respectively. 58

Figure 4.2. Output swings are affected by process parameter variations...... 61

Figure 4.3. Output transitions are affected by parameter variations (Input transitions are

the saturated ramps in blue)...... 61

Figure 4.4. VTC’s are affected adversely by increasing the variation in L to 30% while

keeping the variation in Vt at 5%...... 62

Figure 4.5. VTC’s are affected adversely by increasing the variation in Vt to 90% while

keeping the variation in L at 5%...... 63

Figure 4.6. VTC’s are acceptable when the variations in L and Vt are set to 20%...... 63

Figure 4.7. For a , there are different levels for load modeling: (a) original RC-

interconnect network, (b) Pi-model, and (c) Ceff-model...... 68

Figure 4.8. Two RC-interconnect networks with different topologies are mapped to a

simple RC network (i.e. Pi-model) just with different values for the parameters...... 71

Figure 4.9. For an RC-interconnect network with fanout branches, the load is modeled by

a Pi-model that incorporates all interconnect segments (e.g. INT 0, INT5, and INT6);

the voltage at the end of each output interconnect segment is determined by passing the

input voltage of the interconnect network affected by a low-pass filter (e.g. R5C5 or

xiv R6C6) to accommodate for the slope and delay change of the signal through each

specific interconnect segment to the output (e.g. INT5 and INT6)...... 73

Figure 4.10. Our H'(s) is a 2-pole and 1-zero reduced order model for H(s)...... 81

Figure 4.11. Our test circuit is a JPEG2 Encoder clock tree...... 93

Figure 4.12. Choosing one of slowest critical paths of the JPEG2 Encoder clock tree. ... 94

Figure 4.13. The abstracted inverter chain used for our timing analysis. Each RC-

interconnect network is made of one or more RC-interconnect segment(s) and each

interconnect segment is made of one or more cascaded RC low-pass filters...... 95

Figure 4.14. The interconnect network types reduced to a Pi-models by our tool

(GT_MOR)...... 96

Figure 4.15. To test our Pi-model in the inverter chain at Spice level the RC-interconnect

networks were replaced by sets of Pi-model and H'(s) ...... 97

Figure 4.16. Timing simulation algorithm with RC-interconnect networks support...... 99

Figure 4.17. Comparison of delay for the three methods (Hspice, Pi-Model, and FF) at (a)

nominal parameter values and (b) random parameter values...... 103

Figure 4.18. Cell characterization using a saturated ramp could result in a delay error. 104

Figure 4.17. Comparison of delay errors, in percentage, using Pi-Model and FF-Model

and its variations, i.e. FF3-Model and FF2-Model, at (a) nominal parameter values and

(b) random parameter values...... 106

Figure 4.18. PCA waveform discretization patterns for the voltage scale...... 114

Figure 4.19. Increase in accuracy of the PCA waveform as a function of the number of

discretization levels...... 115

Figure 4.21. Waveform accuracy - Mahalanobis distance (TSMC180RF)...... 121

xv Figure 4.22. Waveform accuracy – 50% point (TSMC180RF)...... 122

Figure 4.23. Waveform model accuracy compared using Mahalanobis distance...... 132

Figure 4.24. Waveform model accuracy compared using the absolute value of relative

errors...... 133

Figure 4.25. Waveform model accuracy compared using absolute value of relative errors

of 50% points...... 133

Figure 4.26. Waveform model accuracy compared using relative errors...... 134

Figure 4.27. Waveform model accuracy compared using absolute errors...... 134

Figure 4.28. The coefficients of the principal components basis functions for the inverter

based on FreePDK45 technology, computed after each of the iterations: (a) PC1 and

(b) PC2 ...... 137

Figure 4.29. Waveforms for iteration 1 of the inverter based on FreePDK45 technology,

(a) Time Domain and (b) PCA Domain...... 138

Figure 4.30. Waveform accuracy for a waveform model based on a common set of PCs –

Mahalanobis distance (FreePDK45 – Iterations 0-4)...... 140

Figure 4.31. Waveform accuracy for a waveform model based on a common set of PCs –

Max, Average, and Max. Ave (FreePDK45 – Iterations 0-4)...... 141

Figure 4.33. Waveform accuracy for a waveform model based on a common set of PCs –

50% point (FreePDK45 – Iterations 0-4)...... 141

Figure 4.34. Waveform accuracy for a waveform model based on a common set of PCs –

Mahalanobis distance (FreePDK45 – Iterations 1-4)...... 143

Figure 4.35. Waveform accuracy for a waveform model based on a common set of PCs –

Max, Average, and Max. Ave (FreePDK45 – Iterations 1-4)...... 143

xvi Figure 4.36. Waveform accuracy for a waveform model based on a common set of PCs –

50% point (FreePDK45 – Iterations 1-4)...... 144

Figure 5.1. Fault simulation framework...... 152

Figure 5.2. VVCCP block diagram...... 153

Figure 5.3. Model transformations...... 154

Figure 5.4. Parametric and random experiments...... 156

Figure 5.5. Delay profile generation experiments...... 158

Figure 5.6. Delay profile generation pseudocode...... 159

Figure 5.7. Fault simulation framework...... 160

Figure C.1. Two-level full-factorial design...... 179

Figure C.2. Two-level fractional-factorial design...... 180

Figure C.3. Design based on Latin hypercube sampling...... 181

Figure C.4. Circumscribed central composite design...... 182

Figure C.5. Inscribed central composite design...... 183

Figure C.6. Faced central composite design...... 184

Figure F.1. Waveform model accuracy compared by the maximum of the absolute value

of relative errors using 11 points of waveforms...... 191

Figure F.2. Waveform model accuracy compared by the average of the absolute value of

relative errors using 11 points of waveforms...... 192

Figure F.3. Waveform model accuracy compared by the maximum of the average of

absolute value of relative errors for each of 11 points of waveforms...... 192

Figure F.4. Waveform model accuracy compared by the maximum of the average of the

absolute value of relative errors for each of 9 points of waveforms...... 193

xvii Figure F.5. Waveform model accuracy compared by the maximum of the absolute value

of relative errors using 9 points of waveforms...... 193

Figure F.6. Waveform model accuracy compared using the maximum Mahalanobis

distance (11 points)...... 194

Figure F.7. Waveform model accuracy compared using the maximum Mahalanobis

distance (9 points)...... 194

Figure F.8. Waveform model accuracy compared using the average Mahalanobis distance

(11 points)...... 195

Figure F.9. Waveform model accuracy compared using the standard deviation of

Mahalanobis distance (11 points)...... 195

Figure F.10. Waveform model accuracy compared using the average Mahalanobis

distance (9 points)...... 196

Figure F.11. Waveform model accuracy compared using the standard deviation of

Mahalanobis distance (9 points)...... 196

Figure F.12. Waveform model accuracy compared using the maximum of the absolute

value of relative errors of 50% points...... 197

Figure F.13. Waveform model accuracy compared using the average of the absolute value

of relative errors of 50% points...... 197

Figure F.14. Waveform model accuracy compared using the average of relative errors of

all points (11 points)...... 198

Figure F.15. Waveform model accuracy compared using the standard deviation of

relative errors of all points (11 points)...... 198

xviii Figure F.16. Waveform model accuracy compared using the average of relative errors of

50% points (11 points)...... 199

Figure F.17. Waveform model accuracy compared using the standard deviation of

relative errors of 50% points (11 points)...... 199

Figure F.18. Waveform model accuracy compared using the average of absolute errors of

all points (11 points)...... 200

Figure F.19. Waveform model accuracy compared using the standard deviation of

absolute errors of all points (11 points)...... 200

Figure F.20. Waveform model accuracy compared using the average of absolute errors of

50% points (11 points)...... 201

Figure F.21. Waveform model accuracy compared using the standard deviation of

absolute errors of 50% points (11 points)...... 201

xix LIST OF SYMBOLS AND ABBREVIATIONS

ASM Asymetic Standardized Model

AWE Asymptotic Waveform Evaluation

BSIM Berkeley Short Channel IGFet Model

BSIM3v3 Version 3 of BSIM3

CAD Computer Aided Design

Ceff Effective Capacitance

DC Direct Current

DRC Desing Rule Check

DSPF Detailed Standard Parasitic Format

Fanout The fanout of a gate (we used as an approximate of the load)

FreePDK45 NCSU 45 nm Process Technology

IBM International Business Machines ®

ITRS International Technology Roadmap for Semiconductors k’ n Process Transconductance (for an n-channel transitor)

L Transistor Channel Length

LMIN Minimum Channel Length of a Transitor

LMAX Maximum Channel Length of a Transistor

LVS Layout Versus Schematic

MNA Modified Node Analysis

MOR Model Order Reduction

MOS Metal Oxide Semiconductor

xx MOSFET MOS Field-Effect Transistor

NIMO Nanoscale Integration and Modeling

NCSU North Carolina State University

NM H Noise Margin High

NM L Noise Margin Low

NMOS N-Channel MOS

OSU Oklahoma State Univeristy

PC Principal Component

PC1 First Principal Component

PC2 Second Principal Component

PCA Principal Components Analysis

PCM PCA model transformation matrix

PCMI The Inverse of PCM

PCS Principal Components Score

PMOS P-Channel MOS

PT Path Tracing

PTM Predictive Technology Model

RF Radio Frequency

ROM Reduced Order Model

RSPF Reduced Standard Parasitic Format

S/D Source/Drain

Slope The slope of a waveform transition (we used the transition time)

SNM Symetric Non-standardized Model

xxi SOS Sum of Squares of Error

SPEF Standard Parasitic Exchange Format

SSM Symetric Standardized Model

SSTA Statistical Static Timing Analysis

STA Static Timing Analysis

SVD Singular Value Decomposition

T Temperature

TPHL Waveform Propagation Time for a High-to-Low Output Transition

TPLH Waveform Propagation Time for a High-to-Low Output Transition

TSMC Taiwan Semiconductor Manufacturing Company Limited ®

TSMC180RF TSMC 180 nm Process Technology for RF

Vdd Supply Voltage of a gate

VIL Maximum acceptable Low Voltage for an Input of a gate

VIH Minimum acceptable High Voltage for an Input of a gate

VOL Maximum acceptable Low Voltage at the Output of a gate

VOH Minimum acceptable High Voltage at the Output of a gate

Vt Threshold Voltage of a Transistor

VTC Voltage Transfer Curve

WMIN Minimum Transistor Width

WMAX Maximum Transistor Width

xxii SUMMARY

This dissertation reports on a new methodology to characterize and simulate a standard cell library to be used for statistical static timing analysis. A compact variation-aware timing model for a standard cell in a cell library has been developed. The model incorporates variations in the input waveform and loading, process parameters, and the environment into the cell timing model. Principal component analysis (PCA) has been used to form a compact model of a set of waveforms impacted by these sources of variation. Cell characterization involves determining equations describing how waveforms are transformed by a cell as a function of the input waveforms, process parameters, and the environment. Different versions of factorial designs and Latin hypercube sampling have been explored to model cells, and their complexity and accuracy have been compared. The models have been evaluated by calculating the delay of paths. The results demonstrate improved accuracy in comparison with table-based static timing analysis at comparable computational cost. Our methodology has been expanded to adapt to interconnect dominant circuits by including a resistive-capacitive load model. The results show the feasibility of using the new load model in our methodology. We have explored comprehensive accuracy improvement methods to tune the methodology for the best possible results.

The following is a summary of the main contributions of this work to the statistical static timing analysis:

(a) accurate waveform modeling for standard cells using statistical waveform models based on principal components;

xxiii (b) compact performance modeling of standard cells using experimental design statistical techniques; and

(c) variation-aware performance modeling of standard cells considering the effect of variation parameters on performance, where variation parameters include loading, waveform shape, process parameters (gate length and threshold voltage of NMOS and

PMOS ), and environmental parameters (supply voltage and temperature); and

(f) extending our methodology to support resistive-capacitive loads to be applicable to interconnect dominant circuits; and

(e) classifying the sources of error for our variational waveform model and cell models and introducing of the related accuracy improvement methods; and

(f) introducing our fast block-based variation-aware statistical dynamic timing analysis framework and showing that (i) using compiler-compiler techniques, we can generate our timing models, test benches, and data analysis for each circuit, which are compiled to machine-code to reduce the overhead of dynamic timing simulation, and (ii) using the simulation engine, we can perform statistical timing analysis to measure the performance distribution of a circuit using a high-level model for gate delay changes, which can be linked to their parameter variation.

xxiv CHAPTER I

INTRODUCTION

Dealing with variability of semiconductor process and environmental parameters has become a major challenge for designing modern high-performance complex system-on- chip integrated circuits. Hence, it is essential to the electronic design and automation industry that the next generation of computer-aided-design (CAD) tools be variation aware to maintain a high manufacturing yield for profitability. Variability is a significant issue because controlling process and environmental variations, such as temperature and supply voltage, has become very difficult or even impossible for modern nanometer technologies. Therefore, innovative tools, methodologies, and algorithms are required to implement the concept of variation awareness in CAD tools for at least the next two decades. The large number of state-of-the-art papers that address various aspects of statistical performance analysis demonstrates the significance of this research area for both industry and academia. More than 200 publications were cross-referenced in two recent literature survey journal publications on statistical static timing analysis authored by Dr. David Blaauw et al. [1] and Dr. Critiano Forzan et al. [2]. Considerable research effort is still required to advance the concept of statistical static performance analysis to maturity.

Toward enabling the variation-aware-ability of the current generation of performance analysis tools, specifically the timing analysis tools that use standard cells, we propose a methodology based on design of experiments, analysis of variance, and principal component analysis to create a realistic variational waveform model to deal with

1 variation systematically and build compact variation-aware timing models for a standard cell library. Moreover, we present the necessary timing analysis framework capable of using variation-aware cell models for statistical static timing analysis.

Traditional timing analysis tools use tabular timing models that can capture effectively the nonlinear interactions of waveform transitions and loading to perform static timing analysis while waveform transitions are abstracted by slopes. Intuitively, it is possible to build a tabular timing model to take into account variability by adding a dimension per variable; however, the characterization time and memory requirement of a cell grow exponentially with increasing the dimensionality of a table. We use experimental design techniques to characterize cells with a minimum number of simulations to cover the whole multi-dimensional variation space and take advantage of the analysis of variance to make our compact variation-aware models.

Historically, the accuracy needed to handle waveform shape models used in CAD tools for digital systems has caused the evolution of waveform shape models from a vertical transition to a slope. The continuous trend of downscaling transistors has required downscaling voltages to make the power tolerable for transistors; therefore, the transition distance of a waveform from one logic level to another has become smaller.

This shorter voltage swing along with higher transistor switching frequency has made the shape of waveforms more important and, consequently, a transition can no longer be accurately modeled by a slope. Chapter II, presents a literature review on waveform modeling and its connection to statistical static timing analysis, static timing analysis, and variation-aware cell modeling. Moreover, a statistical waveform model based on principal component analysis has been proposed in the literature to model a variational

2 waveform. In Chapter III, we propose a waveform model based on such a model; we formalize and generalize the methodology to make the models practical to accurately capture a realistic waveform shape.

We also propose a methodology to construct compact variation-aware standard cell models for timing analysis using our variational waveform model. The tabular cell models that are still in use in industry are not variation-aware. A multi-dimensional tabular model that accounts for variability is not practical despite being more accurate theoretically; therefore, we allow for the trade-off between accuracy and practicality by reducing the memory requirement and characterization time of the cell.

To generalize the methodology, we propose some extensions to it in Chapter IV and present our guidelines and techniques on how to handle: (1) finding and covering the largest possible range of parameter variations, (2) adopting a resistive-capacitive load model to extend our methodology to interconnect dominate circuits, and (3) improving the accuracy of our statistical models by classifying our possible accuracy improvement methods. We also show that our methodology will give reasonable approximations for timing analysis even if the defacto industry standard saturated ramp waveform, instead of our variational waveform model, is used in constructing our compact variation-aware timing models.

In Chapter V, we present our proposed high-level fast variation-aware dynamic statistical static engine in contrast with our static timing analysis engines that we built to verify our methodology in our compact variation-aware models. In general, dynamic timing analysis is more time intensive than static timing analysis; however, we improve the simulation times by using compiled-code techniques to make it possible to build our

3 fast variation-aware dynamic statistical timing engine. Our dynamic simulation engine does suffer from neither the pessimistism of block-based static timing analysis nor the possible missing of the critical paths of path-oriented static timing analysis. Block-based static timing analysis engines use the maximum input-to-output gate delays without considering logic values of the gate input and output ports that affect the delay; therefore, they can be pessimistic. Path-oriented static timing analysis engines require a pre- selected set of critical paths while parameter variations can make a non-critical path critical or vice versa; consequently, they can fail in the including of all the necessary critical paths in the analysis. These shortcomings are not present in our dynamic statistical timing analysis engine since the paths are selected automatically during the simulation depending on the logic values of the gate input and output ports. In our dynamic statistical static timing analysis engine, the delay variation of a gate is a linear function of a few delay increase parameters sampled from a pre-defined set of distributions where as in our compact variation-aware timing models, the delay variation is an arbitrary function of the parameter variation distributions as well as the gate output loading and the input waveform shape parameters.

Finally, in Chapter VI, we present several paths to continue this research or start some

related research in the short term as well as the long term.

4 CHAPTER II

BACKGROUND

Circuit timing analysis is needed to ascertain if a design meets timing requirements before manufacturing. The standard approach to estimate circuit timing is through static timing analysis (STA). STA involves converting a circuit into a timing graph, where each edge represents the delay of a gate between its inputs and outputs. STA then performs a graph traversal to find the longest path, based on a project planning technique, called the Critical Path Method [3].

The delay through gates is a function of the slope of the input signals. Hence, the traditional approach to accounting for the input slope is to characterize cell delay through tables, which pre-compute delay and output slew as a function of input slew for each gate in a standard cell library. To account for slew, STA requires an additional step, a preliminary backward traversal through the timing graph to determine the relationship between slew and delay to the output for each node in the network [4].

Circuit timing is increasingly impacted by variation due to the manufacturing process and the operating environment. The standard approach to account for variation is through worst-case analysis [5]. Worst-case analysis assumes that parameters are constant within a chip, but vary between chips. Designers ensure that their design satisfies specifications for all process corners by simulating the circuit with a small set of “corner” models that represent process extremes. The corner models consist of tables relating delay and output slew to input slew and loading for these process extremes.

5 Circuit timing, however, has become increasingly susceptible to within-die variation due to both the manufacturing process and the operating environment. Hence, it has become imperative to take into account these variations in device and interconnect characteristics during design. Worst-case design does not take into account within-die variation.

To account for within-die variation, we need to perform statistical static timing analysis (SSTA) at corners that define die-to-die variation [1],[6]-[13]. SSTA can determine the variation in critical path delays as a function of random and systematic variation within paths. SSTA resembles STA, except that gates are characterized by delay distributions. The gate delay and arrival time distributions result in distributions of output delays and correlations among these delays. Graph traversal involves applying the statistical sum to arrival time distributions and the delay distribution for each gate, and the statistical maximum operation to the resulting gate delay distributions.

Clearly, for SSTA we need compact models of standard cells that are accurate over parameter and environmental variations, not just at process extremes, as in worst-case design. Our proposed models can be used to generate the delay distribution functions, which can account for spatial correlations, as needed, using methods as in [1],[8]-[13].

Our models can also be used directly in Monte-Carlo-based SSTA, which involves path enumeration, Monte Carlo analysis of critical paths, and the statistical maximum operation on the resulting path delays, as described in[1],[14]-[18].

We must take into account two delay components for timing analysis: gate delay and interconnect delay [19]-[21]; however, we narrow the scope of our research just to deal with the complexity of modeling variation-aware standard cells, while modeling

6 variation-aware interconnects has its own significance as discussed in [22]-[24]. Hence, the goal of this work is to develop a methodology to construct compact variation-aware timing models for standard cells in a cell library that are accurate over process and environmental variations. The cell models utilize our compact waveform models. We show that these compact waveform and cell models, when used for static timing analysis, are almost as accurate as the models based on the well-known tabular method [25] and comparable in terms of computational cost. The tabular cell models, still in use in industry, are neither variation-aware nor compact, and a multi-dimensional tabular model that accounts for variability is not practical despite being more accurate theoretically; therefore, we allow for the trade-off between accuracy and practicality by reducing the memory requirement and characterization time of a cell, which are of exponential order for the tabular method.

The compact waveform models are constructed via principal component analysis

(PCA) [26] of waveforms, where the waveforms are described by principal component

scores (PCSs), which can reconstruct the waveforms. Moreover, since the principal

component basis functions are shared among all waveforms, cell library characterization

requires that we only store the equations that describe the transformations of the principal

component scores as the waveform passes through the cell. The equations also describe

changes in cell performance as a function of variations in the process and operating

environment.

This method differs from traditional static timing analysis

(a) by working with waveforms with realistic shapes,

7 (b) by storing the waveform transformation through a cell as an equation rather than a table, and

(c) by including equations that describe any changes in cell performance as a function of variations in the process and operating environment.

This is not the first attempt to accurately model waveforms for timing analysis.

Recent work has considered accurate modeling of waveform propagation through standard cells. In [27],[28], it is shown that realistic waveforms do not resemble the idealized ramp, and in [29] it is shown that realistic waveform modeling results in more accurate timing analysis. Examples of waveform modeling include [30], where a Weibull shape parameter is added to waveform characterization to account for the differences between real waveforms and their approximation by a ramp. Other work has aimed to model realistic waveforms with a set of basis functions [31]-[34]. The basis functions have been selected in a variety of ways, including an error minimization heuristic involving shifting and scaling of waveforms [31],[32], PCA [33], and singular value decomposition (SVD) [34]. All prior work has shown that a few basis functions can be used to approximate realistic waveforms.

Like [32],[35], the proposed work considers the impact of process and environmental variations on waveforms. In the proposed work, the basis functions are derived by PCA.

Hence, the proposed approach extends prior work in [33],[34] by including in PCA waveform model construction for large variations in parameters related to the process and environment. This work formalizes, generalizes, and specifies restrictions for the approach and proposes methods to make the waveform models practical.

8 The cell models differ from prior work on modeling cells as equations [11],[12],[36]-

[39], since the cell models operate on parameters that describe waveforms, not just process parameters, waveform slew, and environmental parameters. The parameters are not required to be independent, and the compact model consists of multivariate polynomials with a minimum number of terms, which are selected based on analysis of variance and accuracy.

Since cells operate on waveforms in the PCA domain, several new problems arise.

First, we need to determine the set of PCSs that correspond to realistic waveforms, i.e.,

PCSs that can be transformed back to the time domain. Second, we need common principal component basis functions for both the inputs and outputs of cells. This is because PCA is a data-driven methodology. Hence, each set of input waveforms and each standard cell can generate a unique set of principal component basis functions describing the output waveforms. Therefore, some additional steps are needed to generate a common set of basis functions for all inputs and cells.

Additionally, for our model involving PCA waveform modeling and cell characterization with equations, we show that, unlike the tabular static timing analysis method, where memory usage increases exponentially as a function of accuracy in the discretization of parameters that characterize the input and output waveforms (slope and fanout), our proposed method is typically quadratic in memory usage as a function of the parameters describing the waveforms, process, and environmental variations. Finally, we apply the PCA model to static timing analysis and examine the accuracy of delay calculations for long chains of gates.

9 Chapter III explains our proposed waveform model and cell model, and it explores several other options to construct both types of models; the chapter ends with the complexity analysis of the cell model considering characterization time, simulation time, and memory requirement. Chapter IV focuses on enhancing our methodology by constructing a cell model for very large variations, extending the methodology to a cell with a resistive-capacitive load, and investigating accuracy improvement methods; we conclude this chapter with possible paths for continuing this research. Chapter V elaborates our high-level variation-aware statistical dynamic timing engine that uses complied-code techniques to reduce the overhead of dynamic timing analysis. Chapter

IV describes our future research directions.

Although we have included most important equations for PCA that we used in this chapter, Appendix A can be referenced for more information.

10 CHAPTER III

MODELING AND ANALYSIS OF COMPACT VARIATION- AWARE STANDARD CELLS

This chapter is organized as follows. Section 3.1 describes the experimental platform and the parameters modeling variability for cells and waveforms. Sections 3.2 and 3.3 discuss waveform model construction and accuracy analysis, respectively. Section 3.4 describes cell model construction and evaluates its accuracy for a path delay in comparison with Hspice [40] and tabular static timing analysis. Section 3.5 explores several alternative experimental designs to construct a cell model to study the trade-off between accuracy and experimental design sampling strategies because each strategy affects the memory requirement, characterization time, and simulation time of the cell.

Section 3.6 elaborates on memory usage and the computational complexity of all the presented cell models and compares their order of complexity. Section 3.7 concludes with a summary of the research results.

3.1. Experimental Platform and Model of Variation

Traditionally, input waveforms have been represented by delay-slope pairs. In this work, the slope is replaced by a set of PCSs. The number of PCSs determines the accuracy of the model. In one extreme, if all the scores are used, the model can reconstruct the exact waveform.

An inverter, designed and laid out with TSMC 180 nm technology, was used to develop the methodology. This technology was the most advanced one available for our

11 CAD tools. After design rule check (DRC) and layout versus schematic (LVS), parasitics were included in the model through parasitic extraction [41]. Advanced features of Hspice automated the large number of simulation runs, which included generating input waveforms based on a model and capturing the data points of the output waveforms at predetermined relative voltage intervals. The dataset was imported and manipulated using Matlab [42] to construct the 2-level full-factorial model [43] for each output parameter. The significant effects were determined to form the compact models.

Timing characteristics of standard cells are primarily a function of loading

capacitance (fanout) [44], the input waveform [44], variations of device parameters, i.e.,

the channel lengths and the threshold voltages of transistors [45]-[47], and the

environment, i.e., the power supply voltage and temperature [47]-[48]. The ranges of

parameters in the model are listed in Table 3.1. These parameters include the fanout,

parameters that describe the input waveform (either slope or principal components,

[PC1 ,PC2 ] or [ L,Θ], described in Section 3.2), the gate length and threshold voltage of

the NMOS and PMOS transistors, temperature, and supply voltage.

The ranges for process parameters were chosen to be small relative to realistic die-to-

die process parameter variations, which are on the order of ±30 % . This is because die-

to-die variation is effectively handled with corner models, and the focus of this work is to

supplement these models with variation-aware compact models at each corner that can

account for within-die variation, whose range is smaller than die-to-die variation.

A set of models describes the stage delay and output waveform shape, characterized

by its principal components ([ PC1 ,PC2 ] or [ L,Θ]), as a function of all parameters in

Table 3.1. The models were designed to be valid over a wide range of variations by using

12 a full factorial experimental design covering all extreme corners of the experimental space.

Table 3.1. Variation model parameters. Variable Variation Variable Variation Lp 0% to 5% Ln 0% to 5% ∆Vtp -5% to 5% ∆Vtn -5% to 5% ∆T 0˚C to 70˚C Slope 0.4 to 8.0 ns (for slope) Fanout 1 to 64 [PC1, PC2] Dataset Range (otherwise) Vdd -10% to 10% [L, Θ]

3.2. Construction of the Waveform Model

To develop the waveform models, a dataset of 256 falling and 256 rising waveforms

was generated by running a 2-level full factorial experiment varying the parameters in

Table 3.1, i.e., characterizing fanout, parameters that describe the input waveform (either

slope – during the first iteration – or principal components – for other iterations,

[PC1 ,PC2 ] or [ L,Θ]), the gate length and threshold voltage of the NMOS and PMOS transistors, temperature, and supply voltage. The datasets for rising and falling waveforms were merged by converting the fall times to rise times by subtracting the fall times’ voltages from the maximum voltage. A set of waveforms is shown in Figure 3.1.

Figure 3.1. The dataset of time domain rising and falling waveforms generated using a full factorial experimental design.

13 The resulting 512 timing waveforms were discretized by partitioning the voltage scale

into equal intervals to form 19 voltage and time point pairs. This discretization was

chosen to have enough points to accurately capture the waveform shapes and match the

minor scale of our supply voltage. A comprehensive analysis of the impact of

discretization on accuracy is presented in Chapter IV.

Analysis of the inverter waveforms revealed that two PCSs cover 99.8% of variation

for both rising and falling transitions. Hence, only two PCSs (PC1 and PC2 ) serve as

weights for the two waveforms, whose linear combination is used to reconstruct the time

domain transition waveform. Moreover, each transition maps to a single point in the two-

dimensional PCA domain (shown in Figure 3.2).

The points in the PCA domain that correspond to the waveforms in Figure 3.1 are shown in Figure 2. The figures indicate how the use of a full-factorial experimental design to explore a wide range of parametric variations can create clusters of waveforms in the time domain in Figure 3.1 that are mapped into clusters of points in the PCA domain in Figure 3.2.

Figure 3.2. The waveforms corresponding to rising and falling transitions transformed to the PCA domain.

14 Mapping between the time domain to the PCA domain and vice versa can be

represented by a pair of transformations. If the data are not standardized, the

transformation equations are as follows:

PCS = PCM * (T-U) (3.1)

T = U +PCMI * PCS (3.2) where PCS is a 19-element vector of scores, T is a 19-element vector of time points describing the waveform, U is a 19-element vector that is the average of all T’s in the dataset, PCM is the PCA model transformation matrix from the time domain to the PCA domain, and PCMI is the inverse of PCM . For a 19-element vector, PCM and PCMI are

19x19-dimensional matrices. PCM is found by computing the eigenvectors of the 19x19 covariance matrix from the dataset. The rows of PCM are the normalized eigenvectors of this covariance matrix.

Based on (3.1), for a 19-element vector, there are 19 mapping functions (3.3); each maps the 19 time points describing a waveform to a point in the 19-dimensional PCA space.

pc1= pcm(1,1)*(t1-u1) + pcm(1,2)*(t2-u2) + …

pc2 = pcm(2,1)*(t1-u1) + pcm(2,2)*(t2-u2) + … (3.3) … { pc19 = pcm(19,1)*(t1-u1) + pcm(19,2)*(t2-u2) + … }

The elements of the PCM matrix are coefficients of the linear equations for the transformation.

15 If the data are standardized, equations (3.1) and (3.2) are replaced by equations (3.4) and (3.5):

PCS = PCM * (T-U) * D -1 (3.4)

T = U +PCMI * PCS * D (3.5) where D is a diagonal matrix of standard deviations associated with each of the 19

elements of the dataset.

The significant PCSs are found through determining the eigenvalues of the

covariance matrix. Small eigenvalues correspond to insignificant PCSs. Dimensional

reduction is achieved by setting the coefficients of PCM that correspond to the

eigenvectors associated with insignificant PCSs to zero. It is worth mentioning that the

sum of the eigenvalues corresponding to the eigenvectors selected for the model

determines the variance coverage.

The inverse of the PCM matrix, PCMI , is used to reconstruct waveforms. PCMI is

the transpose of PCM . For non-standardized data, the significant PCSs weight the

waveforms stored in PCMI to generate time domain transition waveforms as follows.

t1 = u1 + pcmi(1,1)*pc1 + pcmi(1,2)*pc2 +…

t2 = u2 + pcmi(2,1)*pc1 + pcmi(2,2)*pc2 + … (3.6) … { t19 = u19 + pcmi(19,1)*pc1 + pcmi(19,2)*pc2 +… } All of the points in the PCA domain don’t necessarily map to valid transition waveforms. Valid transitions require that the waveform does not move backwards in time. Accordingly, it is required that

t19 > t18 > …>t1 . (3.7)

16 This creates an acceptability region restriction on the PCA space, which is obtained by substituting equation (3.2) or (3.5) into (3.7) to create 18 linear relationships, as follows for the case with non-standardized data.

u1 + pcmi(1,1)*pc1 + pcmi(1,2)*pc2… < u2 + pcmi(2,1)*pc1 + pcmi(2,2)*pc2 … u2 + pcmi(2,1)*pc1 + pcmi(2,2)*pc2 … (3.8) < u3 + pcmi(3,1)*pc1 + pcmi(3,2)*pc2… { … } The acceptability region is also restricted by the maximum and minimum of the PCSs from the dataset. Linear programming is used to find the acceptability region. The resulting acceptability region is shown in Figure 3.3(a).

(a)

(b) Figure 3.3. (a) The acceptability region in the PCA domain, together with some points, labeled as A, B, C, and D corresponding to corners of the PCA domain. (b) Time domain waveforms corresponding to the corner points A, B, C, and D in (a).

17 Figure 3.3(a) also contains some points, marked by A, B, C, and D in the PCA domain. They correspond to the waveforms in the time domain in Figure 3.3(b).

Waveforms A and B in Figure 3.3(b) are not valid waveforms because they contain segments where time moves backward. They correspond to points A and B in Figure

3.3(a), which are outside the acceptability region. Waveforms C and D in Figure 3.3(b) are monotonic and valid. They are inside the acceptability region illustrated in Figure

3.3(a).

Some of the original data points in Figure 3.2 lie outside the acceptability region, as can be seen in Figure 3.4. This is because of dimensional reduction. These waveforms can be reconstructed by augmenting the original dataset with waveforms containing negative time points by reflecting the transitions across the voltage axis. This process is similar to what is done in Fourier analysis, where negative frequencies are used to help construct a model. The addition of these waveforms with negative time points for model construction widens the acceptability region. It does not invalidate the model because the

PCA model uses only the positive time points. Additionally, the PCA model generated from the resulting dataset has the property that U=0. As a result, the line segments bounding the acceptability region determined by equation (3.7) always pass through the origin.

Initial analysis modeled the input waveforms with a slope. However, it is desirable to determine a set of universal PCA basis functions for both input and output transitions to avoid extra mapping steps. This is because, if we do not have a common set of basis functions, we would need to store a separate set of PCs basis functions for all outputs of all cells in the cell library, and consequently we would increase the memory requirement

18 of the cell models. Also, timing simulation would require numerous transformations between basis functions, as illustrated in Figure 3.5, and hence we would increase the simulation time of the cell models during timing analysis.

Figure 3.4. Data points corresponding to the waveforms in Figure 3.1 and the acceptability region.

To find a common set of principal component basis functions for all waveforms associated with all cells, the corners of the PCA space that define the extreme waveforms must be determined for 2-level factorial analysis. But, as can be seen from Figure 3.3(b), two of the PCSs that correspond to corners of the PCA space lie outside the acceptability region and correspond to invalid waveforms.

PC A Domain Time Domain PC A Domain

Cell Conversion? Cell

Input Output Input Output Waveform Waveform Waveform Waveform Parameter Parameter Parameter Parameter

Figure 3.5. A common set of PC basis functions for all waveforms associated with all cells simplifies timing analysis by avoiding (a) conversions to and from the time domain and (b) storage of a unique set of basis functions for each cell in the library.

19 This problem was tackled by using a polar coordinate system, instead of a Cartesian

coordinate system, for defining the corners of the PCA space for full factorial

experimental design. Note that the shape of the acceptability region in Figure 3.4 is

triangular rather than square. To map PC1 and PC2 to polar coordinates, one finds the

magnitude ( L) and angle ( Θ) of a vector from the origin, as follows:

L = PC 12 + PC 22 (3.9) PC 2 θ = arctan( ) { PC 1 } This coordinate conversion assumed only two significant PCSs. If more than two

PCSs are significant, pairs of PCSs can be converted to polar coordinates with the same

transformation. The acceptability region in the polar domain is then determined to

guarantee valid waveforms, and a rectangle of maximum size is fit into the acceptability

region to define the corners for full factorial experimental design, denoted as L(min) ,

L(max) , θ (min) , and θ (max) .

A common set of principal components for the input and output waveforms of a cell

is generated by running the following iterations:

(a) find the principal components of the output waveforms;

(b) determine the acceptability region in the PCA space in terms of polar coordinates;

(c) fit a rectangle into the acceptability region to find the corners for full factorial experimental design;

(d) generate the waveforms corresponding to these corners and apply these waveforms as inputs to the cell;

(e) simulate the cell to determine the corresponding output waveforms;

20 (f) find the principal components of the output waveforms; and

(g) go to (a) if the coefficients of significant principal components basis functions in equation (3.3), e.g., pcm(1,*) and pcm(2,*), have not converged yet.

If the coefficients of input waveform principal components basis functions match the coefficients of output waveform principal components basis functions, then the principal components have converged to a set of waveforms appropriate for both the input and output of the cell.

Convergence is only possible by restricting the time window for valid PCs because a slow-rising input for the cell will create an output with a slower transition. In our example, we have restricted the time window to be from 0.4 to 8 ns. This window size impacts the acceptability region. A larger window size creates a larger acceptability region but reduces model accuracy and reduces the speed of convergence. This time

restriction imposes an additional limit on the acceptability region, illustrated by the

diagonal line in Figure 3.6. With this limit, convergence was achieved in two iterations.

Figure 3.6. The final acceptability region, including the limit imposed the convergence requirement.

21 The resulting principal components are shown in Figure 3.7. Principal component

basis functions related to the original 512 waveforms are labeled as GN ( 0 th iteration),

where the input was a ramp. The following iterations are designated as IT(i), where “ i” is

the iteration number. For these iterations, the input had a realistic shape and was defined

by the extremes of the PCA space in Figure 3.6. Principal component basis functions

from the two iterations using realistic input waveform shapes were almost

indistinguishable, and hence the model has converged.

3.3. Comparison of PCA Methods for Waveform Modeling

PCA waveform models can be constructed in a variety of ways, including (a) the symmetric non-standardized model (SNM), obtained from a dataset formed by augmenting the original dataset with waveforms with negative time points, (b) the symmetric standardized model (SSM), obtained like the SNM method, but with a standardized dataset (equations (3.4) and (3.5)), and (c) the asymmetric standardized model (ASM), obtained with the standardized dataset, but without augmenting the dataset with waveforms containing negative time points. Note that the asymmetric non- standardized model was not considered because a large number of the original data points are outside the acceptability region.

22 Comparing PC1s (Trise)

0.5 GN 0.4 IT(1)

0.3 IT(2)

0.2

0.1 Coefficients of PC1s 0 0 5 10 15 20 Data Points

(a)

Comparing PC2s (Trise)

0.7 GN 0.6 IT(1) 0.5 IT(2) 0.4

0.3

0.2

Coefficients of PC2s 0.1

0 0 5 10 15 20 Data Points

(b)

Figure 3.7. The coefficients of the principal components basis functions, computed after each of the iterations: (a) PC1 and (b) PC2 .

23 Several criteria have been suggested to select the appropriate number of PCs for a model [26]. They include the following methods: the Broken Stick, the Average Root,

Variability Explained by PCs, the Scree Plot, the Residual Trace, the Velicer Partial

Correlation Function, the Index of Correlation Matrix, Imbedded Error, and the Indicator

Function. These criteria recommend very different numbers of principal component basis functions, ranging from one to 17. To keep our models compact, we have selected two principal component basis functions. Two principal component basis functions cover

99.8% of variation for both rising and falling transitions for all models.

The accuracy of the standard cell model is dependent on the accuracy of

(a) the mapping of a waveform from the time domain to the PCA domain,

(b) the mapping of input PCSs to output PCSs through a cell, and

(c) the mapping of output PCSs back to the time domain.

We analyzed the PCA modeling accuracy by determining the residuals at each

voltage level for all 512 transition waveforms used to construct the model. Residuals are

expressed as time domain errors for a fixed voltage level. Table 3.2 summarizes the

results for all of the models. The table shows the maximum, average, and maximum of

the averages of the residuals for each voltage point. It also includes the maximum of the

average of the residuals for the 15 middle voltage points, which correspond to the 10% to

90% range of the transition, which is more critical for accurate timing analysis [49]. It

was found that larger errors are associated with longer transitions and the tails of the

waveforms. Specifically, it can be seen that residuals associated with the center of the

waveform are close to zero. The symmetric models appear to be the more accurate.

24 Table 3.2. Residuals of PCA waveform models. SNM SSM ASM Max. (19-pt) 1.10 1.75 1.67 Average (19-pt) 0.08 0.07 0.07 Max. Ave. (19-pt) 0.25 0.38 0.36 Max. Ave. (15-pt) 0.00033 0.00023 0.00181

The Q-Statistic [26] and T 2 -Statistic [26] were used to analyze the adequacy of the models by determining the number of outliers in the original dataset. Outliers correspond to waveforms in the original dataset that are not accurately modeled by PCA. Table 3.3 shows the fraction of outliers considering each of the screening statistics. The table indicates that the number of outliers for all three models is very similar.

Table 3.3. Fraction of outliers in PCA waveform models (%). SNM SSM ASM Significant Level 0.01 0.05 0.01 0.05 0.01 0.05 T 2 -Statistic 0 0 0 0 0 0 Q-Statistic 4 8 4 10 5 8 Both 4 8 4 10 5 8

3.4. Construction of the Cell Model and Timing Analysis

We have applied the SNM waveform model with two significant principal component

basis functions to our dataset to find a relationship between the input parameters (listed in

Table 3.1) and output parameters ( L, Θ, and Stage_Delay ) for the inverter cell. L and Θ characterize the shape of the output waveform. Stage_Delay is the delay from input to output measured at 50% of the supply voltage [49]. The relationship between the input parameters and output parameters is computed using Yate’s algorithm [43] to determine

25 all 511 effects (linear coefficients and interactions) and the average. Because of the lack of experimental error, significant effects were found using normal probability plots [43].

The resulting model indicates how the shape of the output waveform and delay vary as a function of the shape of the input waveform, process parameters, and variations in the operating environment (temperature and supply voltage).

The output waveform is characterized by L, Θ, and Stage_Delay . L was found to be a function of the input waveform L and Θ, fanout, supply voltage, temperature, n-channel threshold voltage, and p-channel length. Θ was found to be a function of the input

waveform L and Θ and fanout. Stage_Delay was found to be a function of the input waveform L and Θ, fanout, supply voltage, temperature, n- and p-channel threshold

voltages, and p-channel length.

In evaluating the accuracy of the model, we do not consider the presence of variation

in process parameters, supply voltage, and temperature, which is a function of the number

of terms on the model equations. Instead, we just consider accuracy in the presence of

variations in the shape of the input waveform and fanout. This enables us to make a

direct comparison with tabular static timing analysis.

The accuracy of the PCA model is evaluated by estimating the delay of a narrow tree

of inverters, with a depth of 20 and fanout ranging from two to five, as shown in Figure

3.8. This provides a way to determine the accuracy of the model for timing analysis of

paths in large circuits, with the only simplification being that the same cell is used for all

stages. The number of fanouts at each stage and the slope of the input to the first gate

were varied. The total delay from the input to the output of each stage was determined

using the following three methods.

26 1 2 3 21 Vin Vout

t t

Figure 3.8. Narrow tree of inverters used to evaluate the accuracy of the PCA method.

Method 1 : Tabular (Slope, Fanout) propagation. The inverter timing is characterized for combinations of (Slope, Fanout) in tables. Delay is estimated through linear and bilinear interpolation from the tables. This method requires the following functions, where i is the

index for the stage.

Slope(i+1) = Slope_Function(Slope(i), Fanout(i)) Stage_Delay(i+1) = Delay_Function(Slope(i), Fanout(i)) (3.10) { Total_Delay(i+1) = Total_Delay(i) + Stage_Delay(i+1) }

Our implementation included 595 elements in the table: 35 slopes and 17 fanouts.

We need four tables for the inverter, which are two sets of output slope and delay for

each rising and falling transitions of the output; therefore, we need storage for 2380

elements.

Method 2: Simulation using Hspice. This method solves numerical differential equations to find delay.

Method 3: PCA for delay propagation. Delay is calculated as follows, where “i” is the

index for the stage.

27

L(i+1) = Length_Function(L(i), Θ(i), Fanout(i))

Θ (i+1) = Angle_Function(L(i), Θ(i), Fanout(i)) (3.11) Stage_Delay(i+1) = Delay_Function(L(i), Θ(i),Fanout(i))) { Total_Delay(i+1) = Total_Delay(i) + Stage_Delay(i+1) }

Our implementation requires the storage of 281 coefficients for all six equations, which are two sets of three equations considering both rising and falling outputs; therefore, we need storage for 281 elements or even fewer (e.g., 102 elements just by considering just the 2nd order terms of the full-factorial design) based on Table 3.5 in

Section 3.5. The input to the first stage for all methods was a ramp. Therefore, the input

to the first gate must be mapped to the PCA domain for Method 3; this introduces some

error at the beginning of the chain.

The delays obtained using the three methods are compared in Figure 3.9. For a fast rising input transition with low fanout, as shown in Figure 3.9 (a), we see that Method 3 tracks Hspice and Method 1 at the early stages of the chain, but it diverges from Hspice at the later stages even faster than Method 1. For a slow rising input transition with high fanout, as shown in Figure 3.9 (b), we see that both Method 3 and Method 1 have some error at the early stages, but Method 3 makes up for the error at later stages.

Delays from Hspice are used as the basis of comparison to obtain errors for each

method. The average relative errors are compared in Figure 3.10, using data from the

outputs of each of the stages, i.e., from stage 2 to 21 (20 points), which shows that

Method 3 (PC) is more accurate than Method 1 (tabular) in most cases.

28 The simulation of the circuit were performed on a 4-CPU Ultra Sparc II 400MHz server with a Sun Solaris operating system to compare the three methods. The simulation time for Methods 1 and 3 was 0.2 s, while the simulation time for Method 2 was 21.8 s.

We constructed our cell model to be both compact and variation-aware although we compared the accuracy with a variation-unaware cell model. We conclude this section by listing the sources of error and explaining why an alternative variation-aware tabular cell model is not practical even if it is more accurate.

The errors could originate from (a) modeling error for the input waveform, (b) modeling error for the cell model, and (c) accumulation of errors through the stages. The waveform input to the first gate must be mapped to the PCA domain for Method 3; this introduces some error at the beginning of the chain. Moreover, we construct our compact cell models using just the corner values of the parameters and not the values at the center points, while we construct the tabular models using a range of values for parameters including the center points. Therefore, we reduce cell characterization time dramatically by introducing some modeling error. We discuss some accuracy improvement methods in Chapter IV.

Constructing a variation-aware tabular cell model that even uses slope instead of PCs requires storage for 2380 * 10 6 elements, assuming 10 levels for the six remaining parameters in Table 3.1, and hence the resulting model needs one million times more memory than the original tabular method. Also, the cell characterization time increases at the same rate. As a result, this makes the variation-aware tabular model impractical.

Section 3.7 discusses this matter further and shows that even a tabular model based on a sensitivity analysis is not as scalable as our model.

29 Comparison of Total Delay (fast rising transition, fanout =2) 2.5 Hspice 2 Tabular

1.5 PC

1

Total Delay (ns) 0.5

0 1 2 3 4 5 6 7 8 9 101112131415161718192021 Stage Node

(a)

Comparison of Total Delay (slow rising transition,fanout=5)

3.5 3 Hspice 2.5 Tabluar 2 PC 1.5 1 Total Delay (ns) Total Delay 0.5 0 1 2 3 4 5 6 7 8 9 101112131415161718192021 Stage Node

(b)

Figure 3.9. Comparison of delay for the three methods for (a) a fast rising input transition and (b) a slow rising input transition.

30 Average Relative Error

0.5 Tabular PC 0.3

0.1

-0.1 (3.2,2) (3.2,3) (3.2,4) (3.2,5) (0.32,2) (1.76,2) (0.32,3) (1.76,3) (0.32,4) (1.76,4) (0.32,5) (1.76,5) -0.3 Average Average Relative Error

-0.5 (Slope,fanout)

Figure 3.10. Average relative error of delay for methods 1 (tabular) and 3 (PC) in comparison with Hspice using data from the outputs of each of the 21 stages.

3.5. Comparison of Experimental Design Methods for Cell Modeling

A cell model indicates how the shape of the output waveform and delay vary as a

function of the shape of the input waveform, process parameters, and variations in the

operating environment (temperature and supply voltage). We use the waveform model

with two significant principal component basis functions to find a relationship between

the input parameters (listed in Table 3.1) and output parameters ([L(Rise) ,Θ(Rise) ], Tplh ,

[L(Fall) ,Θ(Fall) ], and Tphl ). [L(Rise) ,Θ(Rise) ] or [L(Fall) ,Θ(Fall) ] characterizes the shape of the output waveform for rising and falling output transitions, respectively. Tphl and Tplh are the high-to-low and low-to-high propagation time, respectively. The propagation times are the delays from input to output measured at 50% of the supply voltage [49].

31 The data to generate the relationship between the input and output parameters was found using several experimental design techniques: 2-level full factorial [43], a 2-level fractional factorial [43], and Latin hypercube sampling [50],[51]. The factorial and fractional-factorial sampling plans build models based on data points at the corners of the sampling domain (all combinations of the maximum and minimum values of all variables in Table 3.1). Since we have nine variables in Table 3.1, full-factorial experimental designs require 29 samples. The number of samples for fractional-factorial experimental designs is a function of the desired model complexity. We have chosen 27 samples in this experiment. Latin hypercube sampling generates data based on quasi-uniform sampling of the sampling domain. The number of samples used for Latin hypercube designs was set to match the number of samples used for the full factorial and fractional- factorial experimental designs. Table 3.4 shows all the designs used for cell model construction along with their number of sampling points, term selection criteria, and abbreviations.

The relationship between the input parameters and output parameters for a cell was determined using a variety of methods. For full factorial experimental designs, the relationship was computed using Yate’s algorithm [43] to determine all 511 effects

(linear coefficients and interactions) and the average. Because of the lack of experimental error, significant effects were chosen automatically using analysis of variance [52]. Models FF2 and FF3 included only significant effects, up to 2 nd and 3 rd order terms, respectively. Similarly, model FRF included only significant effects, up to

3rd order terms. The models for Latin hypercube experimental designs were found by polynomial regression [53] , with at most quadratic terms.

32 Table 3.5 shows the number of terms and the number of operations for each of the

output parameters ([L(Rise) ,Θ(Rise) ], Tphl , [L(Fall) ,Θ(Fall) ], and Tplh ), where

operations consist of addition and multiplication. FF2 is the most compact model, and

FRF is the next, while both are almost as accurate as FF. The more compact the cell

model, the lower its memory requirement and the simulation time.

Table 3.4. Designs used for cell model construction. Design Points Model Terms Abbr. (2 9) All factors FF Full Factorial (2 9) Significant up to 3 factors FF3 (2 9) Significant up to 2 factors FF2 Fractional Factorial (1/4*2 9) Significant up to 3 factors FRF Latin Hypercube (2 9) All quadratic factors LHC Sampling (1/4*2 9) All quadratic factors LHCQ

Table 3.5. Number of terms (number of operations) in cell models. Model Functions for Each Model [L(Rise) , Θ(Rise)] Tplh [L(Fall) , Θ(Fall)] Tphl FF 512(3328) 512(3328) 512(3328) 512(3328) 512(3328) 512(3328) FF3 42(131) 34(112) 42(132) 31(95) 24(72) 45(142) FF2 20(50) 17(44) 17(41) 17(41) 14(35) 17(43) FRF 20(50) 13(36) 19(50) 21(59) 12(32) 21(59) LHC 55(154) 55(154) 55(154) 55(154) 55(154) 55(154) LHCQ 55(154) 55(154) 55(154) 55(154) 55(154) 55(154)

The coefficient of multiple determination, R 2 , provides a measure of the adequacy of models in explaining variation in a dataset. The coefficient of multiple determination approaches one when the model explains most of the variation in the dataset. To penalize overfitting a model, the adjusted coefficient of multiple determination, R 2 , is often used.

It is defined as follows :

33 n −1 R 2 = 1− (1− R2 ) n − p (3.12)

where n is the number of observations in the dataset and p is the number of coefficients in

the regression equation. Table 3.6 shows R 2 for each of the output parameters

([L(Rise) ,Θ(Rise) ], Tphl , [L(Fall) ,Θ(Fall) ], and Tplh ). The results show that most of the models fit the data well. The full-factorial model for Tphl with at most 2nd order terms

and the Latin hypercube models for Θ(Rise) , Θ(Fall) , and Tphl fit the dataset least well.

Note that the values for R 2 were computed based on the dataset used to develop the models. Hence, the lower values for R 2 indicate that the responses display nonlinearity

not captured by the models. As a result, we can see that the models with 3 rd order terms fit R 2 better than models with just 2 nd order terms for Tphl and Tplh . This is similarly

the case for the model for Θ(Rise) . However, the model indicates the existence of

nonlinearity for the Latin hypercube dataset, but not the fractional-factorial dataset. The

Latin hypercube dataset is based on a quasi-uniform sampling throughout the input

domain, specified in Table 3.1, while the fractional-factorial dataset only samples the

corners of the input domain.

Table 3.6. Adjusted coefficient of multiple determination for cell models (%). Model [L(Rise) , Θ(Rise)] Tplh [L(Fall) , Θ(Fall)] Tphl FF 100 100 100 100 100 100 FF3 99.8 99.2 99.8 99.5 98.8 97.4 FF2 99.6 98.7 98.6 99.7 99.2 91.6 FRF 99.8 99.1 99.7 99.2 99.2 96.7 LHC 99.6 89.2 99.2 99.5 83.9 95.6 LHCQ 99.2 89.3 98.9 99.1 80.4 93.2

34 Table 3.7 provides an indication of predictive cell model accuracy by comparing the sum of squares of residuals, which incorporates the effects of both the residual errors and their standard deviation, for each experimental design methodology and each of the output parameters ([L(Rise) ,Θ(Rise) ], Tplh , [L(Fall) ,Θ(Fall) ], and Tphl ). The residuals

were not computed with the original dataset used to develop the model, but instead with a

separate “checking” dataset containing data randomly distributed throughout the

parameter space (Table 1). The results show that error bounds are a function of sample

size and are tighter for Latin hypercube experimental designs.

Table 3.7. Sum of squares of residuals for cell models. Model [L(Rise) , Θ(Rise)] Tplh [L(Fall) , Θ(Fall)] Tphl FF 180.7 7.293 12.78 161.85 17.91 12.48 FF3 181.0 7.279 12.81 159.82 17.92 10.44 FF2 180.3 7.271 12.82 160.10 17.92 10.45 FRF 176.0 7.162 12.28 153.42 17.70 10.56 LHC 48.86 0.978 1.18 52.15 1.29 1.71 LHCQ 90.89 1.359 2.21 60.17 2.24 2.08

We compared the number of terms, adequacy, and prediction accuracy of each output

parameter for all types of our cell models; we found that the accuracy of a cell model is a

function of the number of terms in the model equations and experimental design method.

However, we should take into account all sources of error during timing simulation, as

mentioned in Section 3.4, for evaluating the overall accuracy of the cell models. The

overall accuracy of all the models was evaluated using the narrow tree of inverters

(Figure 3.8) in the presence of variations in the shape of the input waveform and fanout

for two cases: (1) in the absence of variation in process parameters, supply voltage, and

temperature in compared with tabular STA similar to Section 3.5, and (2) in the presence

35 of variation in process parameters, supply voltage, and temperature using random samples in the parameter space. We used equation (3.11) to perform the simulations. In the equation, Stage_Delay is equal to Tplh or Tphl for rising and falling output

transitions, respectively, and [L,Θ] is equal to [L(Rise) ,Θ(Rise) ] or [ L(Fall) ,Θ(Fall) ] accordingly. The input to the first stage for all methods was a ramp and therefore had to be mapped to the PC domain for our models. This is an additional source of error because the ramp is mapped to a non-ramp waveform.

Now, we explain each case.

Case 1: In the absence of variations in process parameters, supply voltage, and temperature compared with tabular STA . Delays from Hspice are used as the basis of comparison for each method. Figure 3.11(a) shows that average relative error for the chain for factorial models is very close and 5% to 7% more than the tabular method, while relative variances are less than 5% for both factorial and Latin hypercube models.

This is the price to pay for compactness of the models. Latin hypercube models can offer better or worse results because these models are dependent on the location of samples as well as the number of samples. This is in agreement with their poor prediction capability in Table 3.6. We had invalid waveforms at the early stages of the interconnected cells and stopped the simulation. Figure 3.11(b) shows that the relative error variance for all models is less than 5%. The simulation times for Method 1 and the worst case for

Method 3 (FF) were 0.2s, while the simulation time for Method 2 was 21.8s.

36 Average Relative Error 0.30

0.10

-0.10

-0.30 FF FF3 FF2 FRF LHC LHCQ Tabular

(a)

Relative Error Variance

0.15 0.10 0.05 0.00 FF FF3 FF2 FRF LHC LHCQ Tabular

(b)

Figure 3.11. Average relative error (a) and relative error variance (b), of the delay for models in comparison with Hspice using data from the outputs of each of the 21 stages of 12 samples at nominal values of parameters.

Case 2: In the presence of variation in process parameters, supply voltage, and temperature using random samples in the parameter space. Figure 3.12(a) presents the average relative errors for the delay of each stage of the inverter chain. It indicates errors of under 5% for the chains for the factorial models. This is comparable to prior work that does not take into account process variations. For example, in [18], it is shown that the relative errors for one cell using an input ramp are 14% for delay and 19% for transition time. The results in [18] improve errors to 1.5% for delay and 5% for transition time.

Our results in Figure 3.12 are not just for a single cell, but for a chain of cells, where

37 errors can accumulate. Figure 3.12(b) shows that the relative error variance for all models but LHCQ is around 5%. The simulation time for FF was 0.19s, while the simulation time for FF3, FF2, FRF, LHC, and LHCQ was 0.3s.

Average Relative Error 0.30 0.10 -0.10 -0.30 FF FF3 FF2 FRF LHC LHCQ

(a)

Relative Error Variance

0.15 0.10 0.05 0.00 FF FF3 FF2 FRF LHC LHCQ

(b)

Figure 3.12. Average relative error (a) and relative error variance (b) of delay for models in comparison with Hspice using data from the outputs of each of the 21 stages of 12 samples at random values of parameters.

A comparison between Figures 11 and 12 shows that, in general, factorial models

offer similar accuracy relative to each other (10% average relative error, and 5% relative

error variance). Latin hypercube models have similar relative error for each case while

38 being at a different level, but the case involving variation has a much higher variance for a smaller sample size.

We want our models to have both good model adequacy and prediction accuracy.

The model adequacy of the factorial models is superior to that of the Latin hypercube

models (Table 3.5), but the model prediction capability of the factorial models is inferior

to that of the Latin hypercube models (Table 3.6). However, our simulation results show

better stability in accuracy for both cases of the factorial models because the average and

variance of relative errors are at almost the same level. The overall accuracy of the Latin

hypercube models was better or worse than the factorial models; this is in agreement with

our findings that Latin hypercube models offer less model adequacy but better prediction

capability.

3.6. Complexity Analysis

Table 3.8 compares the estimated space complexity per transition entry per input for each cell for Methods 1 and 3. Let p be the number of parameters characterizing a cell.

For Method 1, p = 2, i.e., slope and fanout. For Method 3, p = 3, since this method requires a pair of PCSs to characterize the waveform shape plus a value. Let us also suppose that we take into account q sources of variation, deriving from the process and environment (temperature and supply voltage). Method 1 requires a p-dimensional table of numbers with k levels in each dimension. If we take into account q sources of variation by computing sensitivities to each of these parameters for each of the table entries (i.e., we are postulating a linear model), we then require q+1 tables with k p entries. Otherwise, a table with k ( p+q) entries is needed for a model with all interactions.

39 Hence the space complexity of Method 1 is O (k ( p + q ) ) , which is reduced to

O (q.k p ) if we assume a linear model.

Table 3.8. Comparing space complexity of methods for a cell (per delay/transition entry per input). Method Complexity ( p+q) Tabular (general case) O(k ) p Tabular (linear case) O(qk ) FF O ( pw + 2 ( p + q ) p ) nd 2 FF2, LHC, LHCQ (2 order case) O( p(w + ( p + q) ) FF, FRF, LHC (linear case) O( p(w + p + q))

Method 3 discretizes waveforms into w voltage steps. A model with p parameters has p-1 significant eigenvalues. Consequently, ( p-1) w numbers must be stored. In addition, for a factorial experimental design, the model produces a maximum of 2( p+q) coefficients for each of the p expressions. The resulting worst-case model space complexity for

Method 3 is O ( pw + 2 ( p + q ) p ) . For models with at most 2 nd order terms, the model

2 2 will have O(( p + q) ) terms, resulting in space complexity of O( p(w + ( p + q) )) . On

the other hand, if only linear terms are significant, each of the p expressions has

O( p + q) coefficients, resulting in a space complexity of O( p(w + p + q)) .

Table 3.9 compares the estimated simulation time complexity per transition entry per input for each cell for Methods 1 and 3. The simulation time complexity of Method 1 is proportional to the table lookup time and the number of stages ( s). A table lookup requires a search in each of the p dimensions among the k entries, which has complexity

40 O( pk ) . Once the appropriate entry is selected, the delay is computed, which takes into

account the q sensitivities and has complexity O(q) if the model is linear. This process is repeated for each of the s stages, resulting in a simulation time complexity of

O(s( pk + q)) . For a nonlinear model the time complexity is O(sk ( p + q )) .

Method 3 requires the evaluation of p expressions, each with at most 2( p+q) terms,

( p+q) for each of the s stages. This results in a computational complexity of O(sp 2 ) .

However, typical expressions contain at most (p+q) linear terms, corresponding to a simulation time complexity of O(sp ( p + q)) .

Table 3.9. Comparing simulation time complexity of methods for a cell (per delay/transition entry per input). Method Complexity Tabular (general case) O(sk ( p + q)) Tabular (linear case) O(s( pk + q) ( p+q) FF O(sp 2 ) nd 2 FF2, LHC, LHCQ (2 order case) O(sp ( p + q) ) FF, FRF, LHC (linear case) O(sp ( p + q))

Table 3.10 compares the estimated characterization time complexity per transition

entry per input for each cell for Methods 1 and 3. The characterization time complexity

of Method 1 is proportional to the number of simulations needed to obtain each number

in its lookup table and hence is the same as the model’s space complexity.

41 Table 3.10. Comparing characterization time complexity of methods for a cell (per delay/transition entry per input). Method Complexity ( p+q) Tabular (general case) O(k ) p Tabular (linear case) O(qk ) 3 2 FF O(w + w n + pn ln n) nd FF2, LHC, LHCQ (2 order 3 + 2 + n p + q 4 + p p + q 6 case) O (w w n ( ) ( ) ) 3 2 2 3 FF, FRF, LHC (linear case) O(w + w n + n ( p + q ) + p ( p + q ) )

Method 3 requires several steps. First, n simulations are performed. For Latin

Hypercube experimental designs, n is arbitrary. For full factorial and fractional-factorial

( p+q) experimental designs, n = 2 . The number of simulations for a fractional-factorial experimental design is a function of resolution. For a model with up to 3 rd order terms, we require n = O(( p + q)3) simulations. The simplest fractional-factorial design with only linear terms requires O( p + q) simulations.

The simulations result in wn points to be analyzed by PCA. The generation of the

appropriate w-dimensional covariance matrix and its eigendecomposition using SVD

have computational costs of O(w2n) and O(w3 ) , respectively. Iterative methods exist

that avoid finding the covariance matrix. They reduce the computational cost to O(rw ) per iteration, where r

+ O(w2( p q) ) .

42 To develop p expressions, we analyze the resulting PC domain data. The factorial models require that we determine the model for each of the p expressions at a cost of

O( pn ) , sort the resulting dataset to find the significant factors, at a cost of O( pn ln n) ,

and select the significant factors, at a cost of O( pn ) . The dominant terms are shown in

Table 3.10.

Latin hypercube experimental designs use regression to find the models for each of

the p expressions. Assembly of the regression matrices with at most 2 nd order terms

4 6 requires O(n( p + q) ) operations, and solving for p models involves O( p( p + q) ) operations. On the other hand, if we had limited models to include only linear terms, the characterization cost to set up and solve the regression matrices would be

2 3 O(n( p + q) + p( p + q) ) . The resulting overall computational cost is shown in Table

3.10.

It can be seen from Table 3.10 that Method 1 is linear in characterization time complexity, while Method 3 is exponential. However, characterization is done only once for a cell library. Model users are only impacted by space and simulation time complexity. In addition, if a fractional experimental design [43], rather than a full- factorial experimental design, were performed to generate the dataset, characterization would be polynomial in q.

Method 1 is exponential in model space and characterization time complexity as a function of p, which limits the discretization of the space that describes the input waveforms (slope and fanout), while Method 3 is not. As a result, for Method 3, memory usage does not increase rapidly with increasingly accurate waveforms.

43 Moreover, as we add more parameters, q, Method 1 requires k p more entries for each additional parameter, while Method 3 requires only p additional entries for the linear case. Therefore, memory usage does not increase as rapidly for Method 3 as the number of parameters increases.

3.7. Conclusions

We proposed a method to develop compact models of standard cells for static timing

analysis enabling accurate characterization over variations in input waveform

characteristics, output loading, process parameters, and the environment (temperature and

power supply voltage). Our approach, involving equation-based cell characterization in

combination with PCA waveform modeling, offers improved handling of a high-

dimensional parameter space by reducing memory requirement. The compact models

enable the performance of a variety of statistical experiments, including efficient Monte

Carlo analysis of the impact of within-die variation on delay and of the impact of various

temperature profiles and variations in the power supply voltage on delay.

To illustrate the impact of our variational waveform modeling, in combination with

equation-based cell characterization incorporating parameter variations, the accuracy and

efficiency of the method have been evaluated in comparison with Hspice and the tabular

method. Run-times and accuracy are comparable with the tabular method, while memory

usage is improved.

We explored several possible strategies to construct a variational waveform model and a cell model based on our methodology. For a variational waveform model, three approaches to PCA have been compared; the results indicate that the SSM and SNM

44 methods offer better accuracy to model a variational waveform. For a cell model, several choices of experimental designs were considered because the accuracy of a cell is dependent on the sampling method employed for cell characterization, and we want to develop a cell model to be accurate over parameter variations. Simulation of the inverter tree showed that the accuracy for Latin hypercube designs depends on the sample size and the locations of samples, while factorial designs were in general more stable and performed better because such models are built to be accurate at the corners of the parameter space. In contrast, Latin hypercube designs just explore a subset of the parameter space; and it is possible that a combination of parameters falls outside the range of the original quasi-random dataset that was used to build the model. This can result in inaccurate extrapolation. Moreover, the results showed that fractional-factorial designs with up to two factor interactions offer better memory usage, reduce the computational cost, and provide accuracy, which is almost as good as full-factorial designs with all interactions.

45 CHAPTER IV

EXTENDING AND ENHANCING THE METHODOLOGY

We devised a methodology on how to construct and use a compact variational waveform model and how to use it to build a compact variation-aware timing model for a basic cell in a cell library, i.e. an inverter. We showed the feasibility of our methodology in [55]-[59]. With minor modifications, the methodology can be extended to other cells.

We want to extend and enhance the methodology to push it to its limits. First, we used the most recent technology accessible to us at the beginning, but we want to show that the methodology is applicable to the newer technologies as well. Second, we started with a moderate range of variation for process parameters, but we want to demonstrate that we can build the models to work within their maximum range of variation supported with their circuit-level models. Third, we assumed pure capacitive loads for the models, which makes our models applicable to resistive-capacitive loads using the effective capacitance method; however, since the effective capacitance method loses its accuracy for interconnect dominant circuits, we want to extend the methodology to support resistive-capacitive loads. Finally, we want to look at all possible ways to increase the accuracy of our methodology by categorizing the accuracy improvement methods.

This chapter is organized as follows. Section 4.1 explains how to construct a cell model for a deep submicron technology. Section 4.2 demonstrates how to construct a cell model for very large parameter variations. Section 4.3 describes constructing our cell model for resistive-capacitive loads. Section 4.4 lists and briefly discusses our accuracy improvement methods.

46

4.1. Constructing a Cell Model for Deep Submicron Technology

Our waveform and cell models are statistical models obtained by statistical analysis on the waveform data generated for standard cells. It is obvious that the accuracy of our models depends on the accuracy of the transistor model used to generate the waveform data. We developed our methodology to construct our waveform and cell models using

TSMC180RF technology, which was the submicron technology that we had access to and was supported by our CAD tools. The minimum transistor channel length for

TSMC180RF is 180 nm. We want to make sure the methodology is applicable to deep- submicron technologies, where the minimum size transistor channel length is 100 nm or lower [60].

The transistor models for deep-submicron technologies need to better model and address the presence of physical effects of the sub-micron regime. Since our high-level statistical waveform and cell models are built based on the behavior of the circuit-level models of the transistors, our cell models are affected indirectly as we use the samples of an output variable response surface to construct them. On the other hand, process variation is more pronounced as the technology scales down because it becomes more difficult for semiconductor process control to keep pace with the accuracy needed to keep the process variation within the tighter limits needed.

In this section, we discuss the support for parameter variation of the circuit-level transistor models for our high-level statistical cell models and present our parameter selection criteria for our cell models. We explain how to determine the largest possible range of variations for our cell model to maintain their functionality as a logic gate in

47 Section 4.2. Then, we construct our cell models for the largest possible range of variations of a deep-submicron technology supporting resistive-capacitive loads in

Section 4.3.

4.1.1. More Accurate Transistor Models with More Parameters

We use FreePDK45 (NSCU 45 nm) technology [61], with 45 nm minimum transistor channel length, as our deep-submicron technology to assess our methodology because we didn’t have access to any commercial 45 nm technology files. FreePDK45 is available for academic purposes from NanGate Inc.[62]. In the technology files that we use, transistors are modeled by BSIM (Berkeley Short Channel IGFet Model) [63]. The values of the transistor parameters in FreePDK45 technology files are based on the

Predictive Technology Model (PTM) [64].

BSIM contains more than 200 parameters; however, most of them are related to secondary effects. TSMC180RF uses BSIM3v3 (Hspice Level 49) while FreePDK45

[61] uses BSIM4 (Hspice Level 54). BSIM3v3 is an industry-wide standard for the modeling of deep submicron Metal Oxide Semiconductor Field Effect Transistor

(MOSFET) transistors [63]. BSIM4 is an extension to BSIM3 model and addresses

MOSFET physical effects into the sub-100 nm regime [65]. We list just the enhancements that can affect timing analysis from the reference:

(a) An accurate new model of the intrinsic input resistance (bias-dependent gate resistance model), for both RF, high-frequency analog and high-speed digital applications;

(b) A comprehensive and versatile geometry dependent parasitics model for various

source/drain connections and multi-finger devices;

48 (c) Asymmetrical and bias-dependent source/drain resistance, either internal or external to the intrinsic MOSFET at the user's discretion;

(d) Acceptance of either the electrical or physical gate oxide thickness as the model input at the user's choice in a physically accurate manner;

(e) The quantum mechanical charge-layer thickness model for both IV and CV;

(f) A more accurate mobility model for predictive modeling;

(g) Different diode IV and CV characteristics for source and drain junctions;

(h) Dielectric constant of the gate dielectric as a model parameter;

(i) A new scalable stress effect model for the process-induced stress effect; device

performance becoming thus a function of the active area geometry and the location of the

device in the active area;

(j) A unified current-saturation model that includes all mechanisms of current

saturation- velocity saturation, velocity overshoot and source end velocity limit;

(k) A new temperature model format that allows convenient prediction of temperature

effects on saturation velocity, mobility, and S/D resistances.

We investigated which of the above enhancements have been used in FreePDK45 for

NMOS_VTL and PMOV_VTL, which are NMOS and PMOS transistor models

respectively. Enhancements (b), (d), (e), (h), and (k) are used; and the rest are not

employed. Consequently, the model should behave like an older model such as

BSIM3v3.2 considering the absent parameters [65].

We checked the presence of related parameters in the technology files to know which

enhancements are used. Table 4.1 shows the new or enhanced BSIM4.5.0 parameters

used in FreePDK45. The value used for the parameter inside parentheses for the 2nd

49 column shows its similarity to older models. The enhancement column has three sub- columns demonstrating the utilization of enhancements in FreePDK45 models, the letter of enhancement that we listed before, and how it affects the accuracy of timing analysis the category column. We know parasitics, transistor parameters, and temperature influence the accuracy of timing analysis.

Table 4.1. New or enhanced BSIM4.5.0 model parameters used in FreePDK45 Parameter Description (Value used [Similar to]) Enhancement Name Used Letter Category GEOMOD Geometry-dependent parasitics model Yes b Parasitics TOXE Electrical oxide thickness Yes d, e Parasitics TOXP Physical gate equivalent oxide thickness Yes d, e Parasitics EPSROX Gate dielectric constant relative to vacuum Yes h Parasitics UA Coefficient of first-order mobility Yes k Mobility degradation due to vertical field UB Coefficient of second-order mobility Yes k Mobility degradation due to vertical field UC Coefficient of mobility degradation due to Yes k Mobility body-bias effect UA1 Temperature coefficient for UA Yes k Mobility UB1 Temperature coefficient for UB Yes k Mobility UC1 Temperature coefficient for UC Yes k Mobility RGATEMOD Gate resistance model selector (1 [BSIM3; No a Parasitics Constant resistance]) RDSMOD Bias-dependent source/drain resistance No c, k Parasitics model selector (0 [BISM3; Rds modeled internally through IV equations]) KU0 Process induced stress effect No i Mobility MOBMOD Mobility model selector (0 [BSIM3v3.2]) No f Mobility DIOMOD Source/drain junction diode selector No g I-V curve (1[BSIM3v3.2]) LAMBDA Velocity overshoot coefficient No j Velocity VTL Thermal velocity No l Velocity TEMPMOD Temperature mode (0 [BSIM3v3.2]) No k Temperature

50 We did not use oxide thickness in our models as one of the parameters and we present our reason later, but we mention that its changes can affect process transconductance

(k’ n), while mobility has a weaker effect [63]. BSIM4 allows capturing the variations

related to mobility on the circuit performance better as they are affected by temperature,

which is one of parameters that we use for characterizing our statistical cell models.

Table 4.2 shows how these parameters are listed in the BSIM4.5.0 user manual.

From more than 200 parameters listed in the manual, we listed just the parameters that

we used in Table 4.1. Please note that some of the parameters are binnable, which is

explained in the next section.

Table 4.2. BSIM4.5.0 model selectors/controllers. Parameter Description Default Value Binnable Note Name GEOMOD Geometry-dependent 0 (isolated) NA - parasitics model selector - specifying how the end S/D diffusions are connected TOXE Electrical gate equivalent 3.0e-9m No Fatal error if not oxide thickness positive TOXP Physical gate equivalent TOXE No Fatal error if not oxide thickness positive EPSROX Gate dielectric constant 3.9 (SiO2) No Typically greater relative to vacuum than or equal to 3.9 UA Coefficient of first-order 1.0e-m/V for Yes - mobility degradation due MOBMOD=0 and 1; to vertical field 1.0e-15m/V for MOBMOD=2 UB Coefficient of second- 1.0e-19m 2/V 2 Yes - order mobility degradation due to vertical field UC Coefficient of mobility -0.0465V-1 for Yes - degradation due to body- MOBMOD=1; bias effect -0.0465e-9 m/V 2 for MOBMOD=0 and 2

51 Table 4.2. (cont.) BSIM4.5.0 model selectors/controllers. Parameter Description Default Value Binnable Note Name KU0 Mobility 0.0[m] No - degradation/enhancemen t coefficient for stress effect UA1 Temperature coefficient 1.0e-9m/V Yes - for UA UB1 Temperature coefficient -1.0e-18 (m/V) 2 Yes - for UB UC1 Temperature coefficient 0.056V-1 for Yes - for UC MOBMOD=1;0.056e -9m/V 2 for MOBMOD=0 and 2 KT1 Temperature coefficient -0.11V Yes - for threshold voltage RGATEMOD Gate resistance model 0 (no gate resistance) - selector RDSMOD Bias-dependent 0 NA Rds(V) modeled source/drain resistance internally model selector through IV equation MOBMOD Mobility model selector 0 NA - DIOMOD Source/drain junction 1 NA - diode IV model selector LAMBDA Velocity overshoot 0 Yes If not given or coefficient (<=0.0), velocity overshoot will be turned off VTL Thermal velocity 2.05e5[m/s] Yes If not given or (<=0.0), source end thermal velocity will be turned off TEMPMOD Temperature mode 0 No If 0, original selector model is used If 1, new format used

52 4.1.2. Non-binned Transistor Model vs. Binned Transistor Model

FreePDK45 uses BSIM4 that models the transistors more accurately than BSIM3v3, which was used in TSMC180RF; however, TSMC180RF uses a binned transistor model while FreePDK45 does not. Binning is done by dividing the 2-dimensional surface of acceptable W (transistor width) and L (transistor length) into several regions. The number of regions is the product of the number of divisions chosen for W and L.

Transistor parameter values have been determined for each region separately. The proper region is automatically selected by a Spice simulator depending on the values of W and L.

This makes the transistor model more accurate because it is like having a unique transistor model in each region. Each transistor model is valid for a limited region determined by these parameters in the transistor model: LMIN , LMAX , WMIN , and

WMAX ; which are the minimum and maximum of acceptable transistor channel length and width.

4.1.3. Support of Symmetric Parameter Variation for all Parameters

TSMC180RF doesn’t let the transistor model be simulated for channel length values less than 180 nm; therefore, we considered only the +5% variation in L and ignored its

-5% variation in constructing our models. Consequently, the range of channel length values is from 180 nm to 189 nm and its center is 184.5 nm. Thus, the range of variation in channel length is not symmetric with respect to the default value. If we could have used its symmetric range, the channel length values would have been from 171 nm to 189 nm and its center would have been 180 nm.

FreePDK45 allows reducing the channel length down to 35 nm while the standard cells use 50 nm for the channel length (instead of 45 nm). We see that we can reduce the

53 channel length by 30% and the range of variation in channel length will be from -30%

(35 nm) to +30% (65nm) and its center is 50 nm; therefore, the range of variation in channel length is symmetric. To make sure of the accuracy of simulations for values under 45 nm, we contacted Prof. Yu Cao, Nanoscale Integration and Modeling (NIMO)

Group because FreePDK45 uses a Predictive Technology Model based on BSIM4.

According to him, the transistor model will be accurate to the first order when we get closer to 35 nm, which was mentioned as one of our process extremes. That means the model can still be used for channel length values under 45 nm although the accuracy is not going to be as good as for channel length values over 45 nm.

4.1.4. Variation Parameters Chosen for Cell Models

The transistor models link the transistor-level parameter variations to our statistical

waveform and cell models. Our models take into account variation in both process

parameters and environmental parameters. We have included just the channel length and

threshold voltage of transistors to represent process variation parameters in our models to

keep the total number of variation parameters minimal; however, there are many more

parameters at the transistor level. Reference [63] categorizes these parameters into two

groups.

(a) Variations in the process parameters includes impurity concentration densities,

oxide thickness, and diffusion depths. They originate from nonuniform conditions during

deposition and/or diffusion of impurities, and they affect transistor parameters such as the

threshold voltage and sheet resistances. Table 4.3 lists some of these parameters for

BSIM4.5.0 [65]. The last row shows the zero-body-bias threshold voltage that we used

to vary threshold voltage while other parameters can be used for the same purpose.

54 (b) Variations includes the dimensions of devices. They result mainly from the

limited resolution of the photolithographic process. These affect the W, L of the

transistor, and the width of the interconnect wires. We have included L in the list of our

parameters, while W is a constant for each cell. The variation in W does not affect the

switching speed of a transistor in the same way as the variation in L. However, the

variation in W can affect the load and the parasitics. It is worth mentioning that variation

in W and L are totally uncorrelated because the first is related to the field oxide step while

the second is determined by polysilicon definition and the source and drain diffusion

processes.

Table 4.3. Example BSIM4.5.0 model process parameters. Parameter Description Default Value Binnable Note Name TOXE Electrical gate equivalent oxide 3.0e-9m No Fatal error if thickness not positive TOXP Physical gate equivalent oxide TOXE No Fatal error if thickness not positive NDEP Channel doping concentration at 1.7e17 cm -3 YES depletion edge for zero body bias NSUB Substrate doping concentration 6.0e16 cm -3 YES - NGATE Poly Si gate doping 0.0 cm -3 Yes - concentration NSD Source/drain doping 1.0e20 cm -3 Yes - concentration Fatal error if not positive NDEP Channel doping concentration at 1.7e17 cm -3 Yes - depletion edge for zero body bias VTH0 or Long-channel threshold voltage 0.7 V (NMOS) Yes - VTHO at Vbs=0 -0.7 V (PMOS)

We have chosen channel length and threshold voltage to represent process parameter statistical models. Our reasons for this selection are as follows.

55 (a) From the process parameter perspective, the variation in channel length, threshold voltage and oxide thickness of transistors affect the timing of a circuit [63]; however, the variation in channel length and threshold voltage has a more significant role in the performance of a circuit.

(b) The variations in channel length is independent from the variation in the threshold voltage because they are determined by different process steps [63].

(c) We need to keep the number of parameters small in our statistical models to make them computationally efficient.

We close this section by presenting the reasons for the variation in channel length and threshold voltage. Channel length variation is because of optical exposure and process control limitations. Threshold voltage variation originates from process control limitations since the number of dopants in the channel cannot be controlled well enough.

4.2. Constructing a Cell Model for Very Large Parameter Variations

We assumed a reasonable range of variation for all variables at the beginning of the

research as mentioned in the previous chapter. It is conducive to study the effect of very

large parameter variations on our variational waveform model and cell model; therefore,

we try to construct a cell model compliant with the range of variations predicted by the

International Technology Roadmap for Semiconductors (ITRS) [66] for channel lengths

and threshold voltages of the transistors.

We construct our models using an inverter from the Oklahoma State University

standard cell library, which is based on FreePDK45 technology. We used Calibre [67]

56 for DRC, LVS, and parasitic extraction of the layout for the inverter because the technology files of FreePDK45 didn’t support Assura. We used Assura for parasitic extraction for our inverter model based on TSMC180RF. For more information about the cell models that we used, please refer to Appendix B.

To know what are the maximum range of variations in Vt (Vtn or Vtp in Table 3.1) and L (Ln or Lp in Table 3.1), we swept the range of L and the range of Vt and observed the voltage transfer curve (VTC) [63] changes for the inverter. VTC is based on DC analysis, therefore, the curves for all four different combinations of Fanout and Slope are mapped into just one curve. We included these two variables in our experiment design parameters table to use exactly the same combinations of variables that we use for our cell characterization using transient analysis simulations. Moreover, the inclusion of the two variables makes this table consistent with our other tables for variation model parameters. Figure 4.1 shows the VTC curves for our inverter based on FreePDK45 technology when we varied all the parameters using a 2-level full-factorial design based on Table 4.4. The horizontal and vertical axes are input voltage and output voltage in volts, respectively. We observe the VTC’s are affected by parameter variations. We will show later that increasing the range of variations in parameters can make the VTC unacceptable for operating as an inverter.

Table 4.4. Variation model parameters for our VTC's (FreePDK45). Variable Variation Variable Variation Lp -20% to 20% Ln -20% to 20% ∆Vtp -5% to 5% ∆Vtn -5% to 5% ∆T 0˚C to 70˚C Fanout 1 to 9 Vdd -10% to 10% Slope 10 ps to 3 ns

57 On a VTC, as we increase Vin , the first time that the slope of the curve is -1, it is VIL

(the acceptable low voltage for input) and the second time that slope is -1, it is VIH (the acceptable high voltage for input), the distance between VIH and VIL is where the signal is

undefined. VOL is the minimum value of Vout (output voltage) on the VTC and VOH is the maximum value of Vout on the VTC. Reference [63] provides more details on the

relationship among VIL , VIH , VOL , and VOH .

Figure 4.1. VTC’s are affected by variations of 20% and 5% in L and Vt, respectively.

We used a 2-level full-factorial design to plot the set of voltage transfer curves taking into account the variations in Lp , Ln , Vtp , Vtn , Vdd (supply voltage), and ∆T

(temperature). When we increase the variation in Vt and/or L more than a limit the inverter does not function properly because the level of the output cannot reach within an

58 acceptable distance from the supply voltage (Vdd ) or ground (zero). We observe, VOL

(VOH ) is far from 0 ( Vdd ) for some cases with parameter variations. We see process

variation decreases the noise-margin-low ( NM L=VIL -VOL ) and the noise-margin-high

(NM H=VOH -VIH ). In a good design our goal is to preserve a full rail-to-rail swing and to

keep VOH =Vdd and VOL =0; however, we choose the acceptable level of the output to be

within 10% of the target value to take into account the effect of process and

environmental variations. We consider a voltage swing of 80% of the supply voltage

acceptable for the output in the presence of process and environmental variations

although the ideal swing should have been 100% of the supply voltage, which guarantees

the correct operation of the inverter. The inverter does not switch properly for some

combinations in the factorial design; therefore, the range of the parameters must be

adjusted to make all the VTC’s acceptable. We increased L by 5% increments from 0%

to 30% and set the variation in Vt to be 5%. The results indicated that maximum acceptable variation in L should be 20% for the right operation of the inverter. Similarly, when we set the variation in L to be 20% and increased Vt from 5% to 20% and then to

30%, the maximum acceptable range of variation in Vt was 20%. Our target range of variation was 30% for both Vt and L originally, but the inverter does not function properly when we pass the 20% limit of the range of variation in both Vt and L at the same time; therefore, we have to change our target to 20% to be practical from the ideal target of 30% predicted in ITRS.

We illustrate with an example how process variation can interfere with operation of an inverter and does not allow its correct operation. Figure 4.2 shows the output response of the inverter based on a transient analysis using Hspice for all possible combinations of

59 parameters of Table 4.4 with one exception. We swept the variation in L from 5% to 10%

and then to 20%, 30%, and 40%. Output waveforms for all the cases of the variation in L as mentioned are in cyan, yellow, pink, green, and red, respectively. In the plot, the vertical axis is the input or output voltage in volts and the horizontal axis is time in nano- seconds. The input waveform is a saturated ramp with both of the extreme values for the slope based on the table. The supply voltage also exercises the two extreme cases, i.e.

0.99 and 1.21 volts based on the table. You cannot see the actual input waveform (in blue) clearly because is covered by the output waveforms. We want just to show how output waveforms are affected when we sweep the variation in L. In the plot, we observe

the level of the output is affected for cases with 30% and 40% variation in L. This makes

the levels of the output swing not acceptable for the correct operation of the inverter for

each of the two supply voltage levels. For all other cases with the range of variation in L

equal or less than 20%, the levels of the output swing fall within 10-90% of the supply

voltage, which is acceptable for our cell models although a good design targets a 100%

rail-to-rail swing. Figure 4.3 shows the input transitions and corresponding output

transitions for the first set of transitions resulting from applying the first pulse in Figure

4.2. In the figure, we observe the output transitions do not cover all the range of the

supply voltage for each of two supply voltage levels for the cases where the range of

variations in L is more than 20%. Figure 4.4 depicts a plot similar to the plot in Figure

4.1, but it shows VTC’s when the variations in L and Vt are 30% and 5%, respectively.

Consequently, the inverter doesn’t function correctly.

60

Figure 4.2. Output swings are affected by process parameter variations.

Figure 4.3. Output transitions are affected by parameter variations (Input transitions are the saturated ramps in blue).

61 After showing the case that the increase of the range of variation in L is the major reason for the problem, we show in Figure 4.5 another plot when the variations in L and

Vt are 5% and 90%, respectively. This is for the case that the increase of the range of variation in Vt is the major source of the problem and the inverter does not function correctly. Based on a series of simulations changing the range of variation in L and Vt , we determined the safe range of variation in L and Vt is 20%. Figure 4.6 depicts the VTC curves for this case.

Figure 4.4. VTC’s are affected adversely by increasing the variation in L to 30% while keeping the variation in Vt at 5%.

Similar to our inverter model based on TSMC180RF, we use 5% as the range of variations in Vt and L to construct our original models for FreePDK45. We choose 20% as the range of variation in Vt and L to build our models with a “large range of variation”.

We will compare the accuracy of the two PC waveform models in Section 4.4.

62

Figure 4.5. VTC’s are affected adversely by increasing the variation in Vt to 90% while keeping the variation in L at 5%.

Figure 4.6. VTC’s are acceptable when the variations in L and Vt are set to 20%.

63 Tables 4.5, 4.6, 4.7 and 4.8 show the range of variations in all the parameters for an inverter based on FreePDK45. They will be used in Section 4.4 when we explore the effect of subranging and large variations on the accuracy of our models, but we have included them here to introduce the basic idea. In Table 4.5, the range of variation in all the parameters but Slope is chosen to be the same as what we used for the inverter based on TSMC180RF. This helps us evaluate our waveform and cell construction methodology, which was discussed in Chapter III, for FreePDK45 using the same variation model parameters that we used for TSMC180RF. In Table 4.6, the range of variation in threshold voltages and channel length is increased from (-5% to 5%) to

(-20% to 20%), which is denoted with “large variations” in its caption. In Tables 4.7 and

4.8 the range of Fanout and Slope has been reduced, i.e. “subranged”, to evaluate if using a subrange of these two parameters can improve the accuracy of our waveform and cell models.

Table 4.5. Variation model parameters (FreePDK45). Variable Variation Variable Variation Lp 5% to 5% Ln 5% to 5% ∆Vtp -5% to 5% ∆Vtn -5% to 5% ∆T 0˚C to 70˚C Slope 10 ps to 3 ns (for slope) Fanout 1 to 65 [PC1, PC2] Dataset Range (otherwise) Vdd -10% to 10% [L, Θ]

Table 4.6. Variation model parameters (FreePDK45 - large variations ). Variable Variation Variable Variation Lp -20% to 20% Ln -20% to 20% ∆Vtp -20% to 20% ∆Vtn -20% to 20% ∆T 0˚C to 70˚C Slope 10 ps to 3 ns (for slope) Fanout 1 to 65 [PC1, PC2] Dataset Range (otherwise) Vdd -10% to 10% [L, Θ]

64 Table 4.7. Variation model parameters (FreePDK45 - subrange). Variable Variation Variable Variation Lp 5% to 5% Ln 5% to 5% ∆Vtp -5% to 5% ∆Vtn -5% to 5% ∆T 0˚C to 70˚C Slope 10 ps to 300 ps (for slope) Fanout 1 to 9 [PC1, PC2] Dataset Range (otherwise) Vdd -10% to 10% [L, Θ]

Table 4.8. Variation model parameters (FreePDK45 – subrange & large variations). Variable Variation Variable Variation Lp -20% to 20% Ln -20% to 20% ∆Vtp -20% to 20% ∆Vtn -20% to 20% ∆T 0˚C to 70˚C Slope 10 ps to 3 ns (for slope) Fanout 1 to 9 [PC1, PC2] Dataset Range (otherwise) Vdd -10% to 10% [L, Θ]

4.3. Constructing a Cell Model for Resistive-Capacitive Loads

We want to extend our methodology to support resistive-capacitive loads. The methodology that we developed for our compact variation-aware cell models [55]-[59], in chapter III, is based on the assumption of pure capacitive loads for cells while interconnects can make the loads resistive-capacitive. For interconnect dominant logic circuits, such a solution can reduce the accuracy of the timing analysis because (a) it takes into account just the effect of the capacitive component of a load of the cell and ignores the effect of its resistive component or (b) it provides not very accurate delay estimates even by using the effective capacitance [68] concept to adjust its capacitive component to reflect the existence of the resistive component to some degree.

While interconnect networks have inductive components, for our timing analysis purposes with a focus on high frequency signals, we limit our discussion just to RC- interconnect networks [69] instead of RLC-interconnect networks [69].

65 We represent voltage transfer functions and impedances (or admittances) in the

Laplace domain throughout this section. Moreover, the terms “cell” and “gate” have been used interchangeability.

This section is organized as follows. Section 4.3.1 describes how a resistive-

capacitive load can model each resistive-capacitive interconnect network to be included

in our cell characterization. Section 4.3.2 shows how a resistive-capacitive interconnect

network can be mapped into a Pi-model [69]. Section 4.3.3 covers the process of our cell

characterization with a Pi-model load and the RC-interconnect networks. Section 4.3.4

summaries the RC-interconnect networks characterization methods and presents our

choice. Section 4.3.5 describes our test circuit and its Pi-model-converted RC-

interconnect networks. Our timing analysis engine is discussed in Section 4.3.6. Section

4.3.7 is dedicated to the timing simulation and analysis of the results. Section 4.3.8

concludes Section 4.3.

4.3.1. Timing Characterization of Complex Loads

The signal propagation delay from one gate to another has two components: (a) the delay of a gate to the RC-interconnect network at the gate output and (b) the delay of the interconnect network at its own output, which can be the input of the next stage gate.

The delay of a gate is a function of its input waveform transition and its output load; therefore, the resistance of the interconnect at its output causes the gate to see a smaller capacitance, which is called effective capacitance [68]. To find the propagation delay of one gate to another, the delay of the gate, which is a function of the RC-interconnect itself, must be added to the delay of the interconnect [69].

66 Resistive-capacitive loads are actually complex loads. Since the resistance is the real part of the (complex) impedance, it can be ignored for very small values in comparison with the imaginary part, which makes the load almost purely capacitive. The concept of the effective capacitance is based on this rule, which treats the load almost as pure capacitive, but the value of the capacitance seen should be adjusted to make a "good" approximation of a complex impedance by considering just its "adjusted" complex part.

However, this approximation loses its accuracy as the value of the resistance, the real part of the impedance, increases.

There are different levels for modeling the load of a logic gate as shown in Figure 4.7.

The load of a gate is actually the complex load of the RC-interconnect network connected to it and includes the input capacitance of the next level logic as shown in Figure 4.7.a.

We can use: (a) the original RC interconnect-network, (b) a Pi-model [69],[70] approximation, as shown in Figure 4.7.b, or (c) an effective-capacitance model (Ceff) approximation. We call the input admittance of the networks (a), (b), and (c), Y(s) , Y'(s) , and Y''(s) , respectively. The input voltages of the RC-interconnect networks in the figure, which are the output voltages of the gate, are shown as Vin(s) , Vin'(s) , and Vin''(s) , respectively. The output voltage of the RC-interconnect network, which is the voltage across the input capacitance of the next stage logic, are shown as Vout(s) , Vout'(s) , and

Vout''(s) , respectively. Since Y'(s) and Y''(s) are approximations of Y(s) , the input admittance of the RC-interconnect network, Vin'(s) and Vin"(s) are be approximations of

Vin(s) ; and Vout'(s), and Vout"(s) are be approximations of Vout(s) . Depending on the level of accuracy needed for timing analysis, different models can be used, which will

67 affect the simulation time and accuracy. The models from the highest to the lowest level of accuracy (even at circuit level) are as follows: (a), (b), and (c).

Figure 4.7. For a logic gate, there are different levels for load modeling: (a) original RC-interconnect network, (b) Pi-model, and (c) Ceff-model.

Level (a) models can be described in Detailed Standard Parasitic Format (DSPF) [71], in which the detailed parasitics are represented in Spice format. Level (b) models can be described in Reduced Standard Parasitic Format (RSPF) [71] because the parasitics are

68 represented a reduced format. Both levels (a) and (b) can be simulated using Spice-like simulators. On the other hand, the Standard Parasitic Exchange Format (SPEF) [71] allows representation of only the detailed parasitics (and not the gates!) Level (c) models can be used in high-level timing analysis tools by adjusting the value of capacitive load to compensate for the existence of a small resistance in resistive-capacitive loads. It is obvious that the Ceff capacitance becomes a poor approximate for interconnect dominant

networks that the resistance of in the complex impedance is significant.

In the original RC-interconnect network (a), the load of the next stage logic is actually considered a part of the interconnect itself. Since the input capacitive load of a gate is variable and is a function of the transition time, an average value must be used as an approximate of the input capacitance of the gate when the gate input capacitance is replaced by a . In circuit-level simulation, we do not need to do this necessarily, but it is required to do so if we want to map the RC-interconnect network to an approximate model and calculate an input admittance and a voltage transfer function.

In the Pi-model (b), C1 is connected directly to the output of a gate and C2 is connected through R to C1 and the second terminals of both C1 and C2 are grounded.

Vin'(s) , the voltage across C1 (and not C2 , which is connected to the terminal node in the

Pi-model), is applied through a voltage-controlled voltage source with unit gain after passing through a low-pass 1-level RC filter to the original output of the interconnect to make the Vout'(s) . The values of R3 and C3 are determined in such a way to reflect the necessary delay and slope at the output node of the RC-interconnect network. The reason for using this technique will be better understood when we describe handling the fanout branches in RC-interconnect networks.

69 In the Ceff-model (c), the gate output voltage, i.e. Vin''(s) , is the same as the RC-

interconnect network output voltage, i.e. Vout''(s) , because the resistance of the complex

load is negligible after adjusting the value of the load to Ceff ; hence we see only one

capacitor in the model and no at all.

It is theoretically possible to include the effect of a complex impedance in our

compact models by adding as many parameters needed to represent an RC-interconnect

network for cell characterization, but this method has two major practical problems: (a) it

increases the number of the dimensions in our models by the total number of

and resistors in the interconnect network and consequently increases the characterization

time complexity exponentially and (b) it makes the characterization of different RC-

interconnect network topologies necessary and increases the characterization time

complexity as the result.

Theoretically, the interconnect complex impedance can be modeled as a set of

topologies, represented by a set of complex functions T={T1(.), T2(.), ... T(k)(.)}, each

with a set of resistances and capacitances as their parameters if we restrict our models

just to RC-interconnect networks. For example, as shown in Figure 4.8., T3(C0, R1, C1,

R2, C2) represents an RC-interconnect network with a capacitance of C0 cascaded with 2 low-pass RC filters of R1C1 and R2C2 ; while T4(C0, R1, C1, R2, C2) , with the same set of parameters, represents another RC-interconnect network with a capacitance of C0 cascaded with a fanout node with 2 branches, for which, the first fanout is a low-pass filter of R1C1 and the second one is a low-pass filter of R2C2 and the output is derived from the non-ground terminal of C2 . Please note the other output is not important for us, but the existence of its circuit elements, R1 and C1, affects the input admittance function.

70 T5 is similar to T4 , but the output is derived from the non-ground terminal of C1 . While it would be possible to categorize a large number of the most used topologies for a finite number of resistors and capacitors, it would be very time consuming.

Figure 4.8. Two RC-interconnect networks with different topologies are mapped to a simple RC network (i.e. Pi-model) just with different values for the parameters.

In practice, instead of using the set of interconnect complex functions with many parameters that we mentioned, we need a function with just a few parameters. If we can find a way to map the parameters of all the complicated complex functions to a simple function with a few parameters, we have solved the problem. Such a function should be an approximation of the all the functions for different topologies while its parameters could capture somehow the related topology information of each original function.

Different topologies of the interconnect can be represented by multi-port networks, but for our timing analysis purposes, we need to know only the impedance (or admittance) that we can see from each port; therefore a 1-port model should be constructed with its impedance (or admittance) ideally equal to the impedance (or admittance) seen from the

71 specific port of the multi-port network. Fortunately, there are standard methods to reduce the order of such RC networks, which are referred to as model order reduction (MOR) methods [69], but we can only find a good approximation of the original interconnect complex impendence.

For example, we can map the set T into a set of reduced-order models (ROM), which we represent them as M={ M1(.), M2(.), ... M(k)(.)} . We define the M(.) functions just with three parameters, C1 , R, and C2 , which correspond to the capacitances and resistances of a Pi-Model [70]. Hence, we will have a function in the form: M(C1,R,C2) .

Here, the actual load is the C1 cascaded to a low-pass RC2 filter, which is called a Pi- model, is shown on the right side of Figure 4.8. The values of C1 , C2 , and R of the Pi- model (i.e. C1' , C2' , and R' or C1'' , C2'' , and R'' ) in the figure, are determined based on the original network topology and the values for capacitances and resistances of the network.

The Pi-model can be used in a circuit-level simulator for timing analysis. The reduced standard parasitic format, which is one of the standard formats for representation of parastitics for post-layout timing analysis, is based on a Pi-model, in which the parasitics extracted from a layout are mapped into C1 , R, and C2 parameters [71]. For example, Figure 4.9 shows an RC-interconnect network comprised of three RC- interconnect segments, in which each segment is 0 or 1 capacitor cascaded to a one or more RC-low-pass filters. This is an example of an RC network with two fanout branches. Here, the load is modeled by a Pi-model that incorporates all interconnect segments (e.g. INT0, INT5, and INT6); however, the voltage at the end of each output interconnect segment is determined by passing the input voltage through the interconnect

72 network; and the signal is affected by a low-pass filter (e.g. R5C5 or R6C6 ) to accommodate for the slope and delay changes of the signal through each specific interconnect segment to the output (e.g. INT5 and INT6). Such a model described in

RSPF can be mapped easily to a Spice by including the unit gain voltage- controlled voltage sources. That means the timing simulation can be performed at a higher level of abstraction of the circuit but still in circuit-level.

Figure 4.9. For an RC-interconnect network with fanout branches, the load is modeled by a Pi-model that incorporates all interconnect segments (e.g. INT 0, INT5, and INT6); the voltage at the end of each output interconnect segment is determined by passing the input voltage of the interconnect network affected by a low-pass filter (e.g. R5C5 or R6C6) to accommodate for the slope and delay change of the signal through each specific interconnect segment to the output (e.g. INT5 and INT6).

73 We want to do the timing simulation at a higher level of abstraction than circuit-level,

like Spice. Therefore, we need to characterize the cells for a complex load. Fortunately,

using a Pi-model increases the number of parameters in our models just by 2 in

comparison with that of Ceff that we used in Chapter III. This makes the characterization

possible since many parameters (resistances and capacitances of each interconnect

segment) are mapped only into just 3 parameters, which are C1 , R, and C2 .

4.3.2. Mapping RC-Interconnect Networks to Pi-Models

There are several standard methods to map each RC network of T with H(s) transfer

function to a 2-pole 1-zero H'(s) approximate transfer function. We just refer to the

fundamentals of Asymptotic Waveform Evaluation (AWE) [72]. AWE is a general

method based on state-space equations and the circuit response. It has the following two

steps:

(a) Moment generation: we generate the moments from a circuit.

(b) Moment matching: we match the moments to the simpler model.

Assume H(s) has the following general form:

c + c s + ... + c s n−1 H(s) = 0 1 n−1 (4.1) a + a s+ ... + a s n 0 1 n

The numerator and denominators are polynomials of s. and c0-cn-1 and a0-an are the coefficients of the polynomials.

On the 1st step, we describe H(s) based on its moments by a Taylor expansion about s = 0 with a sufficient number of its moments in the following form:

74

H(s) = m + m s + .m s 2 + ... (4.2) 0 1 2

H(s) is polynomials of s and m0, m1, m2, … are called the moments of H(s).

On the 2nd step, we map the above function back to an H'(s) with lower order polynomials in the nominator and denominator to obtain the MOR. For example, H'(s) can be the MOR of H(s) if we can determine the coefficients of the polynomials in the nominator and denominator so that H’(s) has the same moments as H(s) .

c' +c' s H (' s) = 0 1 (4.3) a' +a' s+ a' s 2 0 1 2 Since we did not have access to a MOR engine, we built our own MOR engine using

Matlab's symbolic math toolbox and implemented the following three methods:

(a) Two-pole approximation with explicit moment matching [73],

(b) Stable 2-pole model based on first three moments [74], and

(c) Stable 2-pole method (S2P) [75].

These methods are not as general as AWE, but they are suitable for our purpose of

obtaining the H'(s) and Y'(s) with minimum necessary implementation time since we need

a 2-pole system for a Pi-model.

We used the S2P as the main method because of its guaranteed stability of the

resulting model and used the two other methods to cross-check the transfer functions.

Also, S2P gives us Y(s) the ROM of the (input) admittance of the RC network as Y'(s) ,

which is a 2-pole 2-zero system and different from the voltage transfer function of the RC

75 network, H'(s) , which is as a 2-pole 1-zero system. The general forms of the described approximate Y'(s) and H'(s) are as follows:

b + b s+ b s 2 (4.4) = 0 1 2 Y (' s) 2 a0 + a1s+ a2 s

c + c s = 0 1 (4.5) H (' s) 2 a0 + a1s+ a2 s

Please note that Y'(s) and H'(s) are related by looking at the denominator polynomials.

We will use this information to say the parameters of circuit models (Pi-model and our

H'(s) circuit model) that implement these two functions are related and consequently cannot be treated as independent parameters in our 2-level full-factorial designs for generating our target timing models, but we could not find an straight forward relationship between them. Further research is needed to determine if such a relationship can be expressed in a closed-form for all types of the RC interconnect networks.

To generate the moments, as our first attempt, we used Modified Node Analysis

(MNA) [76] to solve the circuit. MNA is very general, but it suffers from poor scalability because of the large matrices it needs. Then, we decided to use a method similar to Path

Tracing (PT) [69], which is suitable for an interconnect network with a tree topology. A similar method that explicitly makes the impedance ( Z) matrix of the circuit was used in

RICE [77], which has a scalability problem because it is expensive to find the inverse matrix of Z , i.e. the admittance matrix, that we need. Because all the interconnect networks in our test circuit were trees, we could this method, but we needed the admittance ( Y) matrix, therefore, we built our own tree traversal method suitable for our specific interconnect tree types in our test circuit.

76 There are some advanced methods for MOR with much better scalability. We include this one just for reference. For example, PRIMA [78]-[80] is a Krylov-based [81] projection MOR method that guarantees generating stable models. In projection MOR methods, the equations of a circuit (e.g. MNA equations) are mapped to a set of simpler equations with the fewer number of dimensions and the new set of equations is solved instead.

As a summary, our goal was to find all the T's in symbolic form as the most general solutions. Building such a library could reduce the problem of circuit analysis to a simple function evaluation. We used the first approach at the beginning to find MOR's using

(MNA) using the symbolic calculation engine of Matlab versions (2010a-2010b); however, the matrix operations were prohibitive considering our limited computing resources. Therefore, we came up to a solution based on equivalent admittance as symbolic expression, which means the R’s and C’s in the expression are not substituted with their values. We started finding the equivalent admittance and performing MOR for a case with a capacitance connected to a low-pass RC filter, which is a Pi-model, to verify its feasibility and we later extend our method to perform MOR for a RC- interconnect segment with many low-pass RC filters in series; however, this time, the complexity of the complete complex admittance functions impacted scalability of the solution because of using symbolic expressions although we could find a general solution for each interconnect segment as a symbolic expression. Such general solutions enable us to build a library of MOR models for interconnect segments with different lengths.

Using the general solutions, you can find a MOR model just by plugging the parameters of the interconnect segments (i.e. R's and C's) into each symbolic expression. The

77 symbolic expressions can grow very large as we increase the number of the parameters of the interconnect segments and this makes the computations slow and resource intensive.

Since simplifying the complex admittance functions makes the complexity of the functions manageable, we substituted R's and C's with their values in the data structures

for each of our RC-interconnect segments (and our RC-interconnect network, later) prior

to deriving a complex admittance function. Then, we used the moment matching

techniques that we mentioned to obtain the equivalent Pi-models. We categorized the

interconnect networks that we needed to reduce, to solve the problem for our specific

cases of the interconnect tree networks; however, the tool was designed to be expandable

for future cases. Although, we did not plan to make a tool to characterize interconnects

based its parameter variations, we built the foundation necessary to make variation-aware

interconnect models. That means expanding our compact variation-aware methodology

from covering only gates to RC- interconnect networks.

We built a tool to generate our ROM models for RC-interconnect networks because

we did not have access to a MOR engine. To verify the correctness of our MOR models

and our MOR engine we compared the output voltage signals of each RC-interconnect

network with that of each corresponding Pi-model. In an effort to estimate the output

voltage of each RC-interconnect network, we constructed H'(s) as a 2-pole ROM for H(s)

as follows:

k1 k2 H (' s) = + (s − p )1 (s − p )2 (4.6)

78 However, the voltage transfer function, H'(s) , cannot be directly included in a netlist

for circuit-level simulation; therefore, we designed a 2-port network with the same

transfer function as shown in Figure 4.10. We describe the circuit-level model details

later and show the transfer function at this point. H'(s) is as follows:

(Rx .Cx + Ry .Cy 1( − x)) s+1 H (' s) = (4.7) (Rx .Cx .Ry .Cy )s 2 + (Rx .Cx + Ry .Cy )s +1 We used partial fraction expansion to determine the values of Rx , Cx , Ry , Cy , and x using the values of k1 , k2 , p1 , and p2 .

We know that the time constant equations of H'(s) are as follows:

T1 = -1/p1 (4.8)

T2 = -1/p2 (4.9)

We used an arbitrary value for Cx and Cy and set the values of Rx , Ry , and x so that

Equation (4.6) matches Equation (4.7). The corresponding equations to do this are as follows:

Cx = Constant Value (e.g. 1f F) (4.10)

Cy = Constant Value (e.g. 1f F) (4.11)

Rx = T1/C1 (4.12)

Ry = T2/C2 (4.13)

x = ((k1+k2)+p1) /(p1-p2) (4.14)

79 With 5 parameters, the general form of H'(s) could be characterized in our compact variation-aware cell methodology to find the output voltage of each RC-interconnect network, while the Pi-model provides us with the input voltage of the RC-interconnect network. Since these 5 parameters are related to (and not independent from) the 3 parameters of the Pi-model and the relationship could be very complicated and different for different topologies of the interconnect, we decided to incorporate only the Pi-model in our cell characterization and to include the H'(s) models just for our circuit-level simulations to verify the correctness of our MOR models and our MOR engine. That means our cell model will provide us with the voltage at the input of the RC-interconnect network and we propagate the signals through the interconnect using our timing characterized models just for our specific type of RC-interconnect networks in our test circuit.

We emphasize that our research goal is cell characterization and not necessarily interconnect characterization. Therefore, we characterized our RC-interconnect networks connecting each two gates in our test circuit by just changing the input waveform slope using Hspice considering the fact that we did not have access to an interconnect timing simulation engine. Such an engine could complement our cell models in a commercial tool implementation. Moreover, the Pi-models could be generated using a MOR engine from a commercial tool although we built our own engine to show the feasibility of our compact variation-aware models methodology for resistive-capacitive loads.

Figure 4.10 shows the Hspice model that we made to verify the correctness of our

H'(s) model. Here, Rx and Cx constitute the lower low-pass RC filter and Ry and Cy form the upper low-pass RC filter. The lower low-pass RC filter is fed by a voltage-

80 controlled voltage source with the gain x while the upper low-pass RC filter is fed by a voltage-controlled voltage source with the gain (1-x) . The output of the two filters are added algebraically using a unit gain voltage-controlled voltage source. H'(s) will convert the input voltage of the RC-interconnect network to its output. This conversion can be performed by an interconnect simulation engine in a CAD tool internally, but we made our own engine to generate H'(s) functions for the RC-interconnect networks in our test circuit for us. It is important to mention that the voltage-controlled voltage sources prevent the 2-port model to load Vin'(s) or to be loaded by the next level logic; consequently, we do not distort the waveform shapes because of loading.

Figure 4.10. Our H'(s) is a 2-pole and 1-zero reduced order model for H(s).

We verified the correctness of our Pi-model and H'(s) model using a test bench that include both of the following circuits:

(a) an inverter connected to each RC-interconnect network (e.g. the top picture of

Figure 4.9) and

81 (b) another instance of the inverter with the same set of parameters connected to the equivalent Pi-model and H'(s) model of the circuit (a), as shown in Figure 4.10.

We changed the parameters of the circuit using the 2-level full-factorial design of

Table 4.9 and performed transient analysis on both circuits and captured the inverter (and the interconnect) output waveform transition times and delay errors (with respect to the original circuit), which are displayed in Table 4.10, and the inverter output waveform transition timing points of 10%, 50%, and 90% plus equivalent resistance errors as shown in Table 4.11. All the errors are relative to the value of the corresponding parameters of the circuit (a).

Table 4.9. The 2-level full-factorial design variation parameters for verifying Pi- models and H'(s) models. Variable Variation Variable Variation Lp -20% to 20% Ln -20% to 20% ∆Vtp -20% to 20% ∆Vtn -20% to 20% ∆T 0˚C to 70˚C C1 Fixed for each network Vdd -10% to 10% R Fixed for each network Slope 10 ps to 3 ns (for slope) C2 Fixed for each network

According to Table 4.10, all the output waveform transition time errors are less than

1% for both the inverter and the RC-interconnect network. Please note that the Tplh and

Tphl columns for the RC-interconnect networks actually refer to their input-to-output delay because the transition directions of the input and the output of RC-interconnect networks are the same. For the RC-interconnect networks the average errors are almost

0%.

In Table 4.11, we compare the inverter output waveform transitions at their 50% and

90% points for a rising output waveform transition and at their 10% and 50% for a falling output transition. All the average errors are less than 0.07%. All the maximum errors but

82 one are less than 1%. We also compare the equivalent resistance for both rising and falling output waveform transitions of the inverter as a secondary measure to know if the inverter output waveform transition for the circuit (a) matches that of the circuit (b). This resistance is the average output resistance of 10-90% of the inverter output voltage that the inverter sees when it is connected the original RC-interconnect network in the circuit

(a) or its Pi-model in the circuit (b). The average errors are less than 1%; however, the maximum error is about 91%. Since the average is very small, the error(s) could appear in one or a few cases of the 2 7 = 128 cases for each of the 11 RC-interconnect networks.

Table 4.10. Comparing output waveform transition time and delay errors (%) of our Pi-model and H'(s) model for all 11 RC-interconnect networks. Output of RiseTime Tplh(delay) FallTime Tphl(delay) Average -0.03 -0.03 0.03 0.05 Inverter Max 0.64 0.55 0.53 0.60 Average 0.00 0.00 0.00 0.00 Interconnect Max 0.29 0.31 0.02 0.02

We also observed that our Hspice simulator in an effort to minimize the simulation time decreases the resolution. This could put small spikes, as computation noise, on the current through the interconnect or our Pi-model. The spikes were comparable in magnitude to the current, and consequently, made the instantaneous current very small and caused very large instantaneous resistance values making the average resistance much more than what it should be. By increasing the resolution of simulation, we could avoid most of such wrong values; however, we allowed Hspice to set its optimum resolution for all 1408 (128*11) transient analysis simulations and we did not have a chance to review all the 128 inverter output waveform transitions for all the 11 RC- interconnect networks because the average error was very small.

83 Table 4.11. Comparing the inverter output waveform transition timing points errors (%) at 10%, 50%, and 90% points and equal resistance errors (%) of our Pi- model and H'(s) model for all 11 RC-interconnect networks. Output Transition T(10%) T(50%) T(90%) Req(10-90%) (Ω) Average - 0.00 0.00 0.01 Rising Max - 0.10 0.00 0.69 Average 0.00 0.00 - 0.07 Falling Max 0.02 0.02 - 90.42 We showed that our Pi-models and H'(s) are very accurate; therefore, we use our Pi-

model to characterize a cell in the next section.

4.3.3. Cell Characterization with a Pi-Model Load

We solved the problem of mapping the interconnect segments to their corresponding

Pi-models. We show how to perform the cell characterization for an inverter with a complex load.

Using a Pi-model for the load enables representing a resistive-capacitive load with only 3 parameters. In our previous compact variation-aware cell models in Chapter III, we include only one parameter for loading in variation model parameters, which is

Fanout as we showed in Table 3.1.

We also decided to use just a 1-parameter waveform transition shape, i.e. Slope instead of our 2-parameter waveform transition shape. This allows us to reduce the dimension of the new model by one and to show that our methodology for building a compact variation-aware cell model does not necessarily have to use a multi-parameter compact waveform model such as our PCA waveform model. Table 4.12 lists the new set of parameters along with their range of variation. Lp and Ln are the channel lengths of the PMOS and NMOS transistors. ∆Vtp and ∆Vtn are the variation in the threshold voltages of PMOS and NMOS transistors. Vdd is the supply voltage. ∆T is the temperature variation parameter. C1 , C2 , and R are the capacitances and the resistance of

84 a Pi-model, which approximates the resistive-capacitive load of an RC-interconnect network. Slope is the 0% to 100% transition time of the input waveform transition time of a saturated ramp while we measure 10% to 90% transition time of the output waveform and adjust it to 0% to 100% when we use it.

Table 4.12. Variation model parameters for (resistive-capacitive) Pi-model loads. Variable Variation Variable Variation Lp -20% to 20% Ln -20% to 20% ∆Vtp -20% to 20% ∆Vtn -20% to 20% ∆T 0˚C to 70˚C C1 Range of our samples * Vdd -10% to 10% R Range of our samples * Slope 10 ps to 300 ps (for slope) C2 Range of our samples * * C1, R, and C2 are from 0.0896-2.72 fF, 2.87-33.81 Ω, and 3.12-12 fF for our samples. The spread of C1, R, and C2 is available in Appendix D.

We use a method similar to our standard cell characterization method that we

described in Chapter III. Using a 2-level 10-parameter full-factorial design, we define

the relationship between the 10-input parameters and each of the output parameters: (a)

cell output-transition time, i.e. RiseTime or FallTime , which is the 10% to 90% output

waveform transition time and (b) cell delay, i.e. TPLH or TPHL, which is time difference

between the 50% time of the input to the 50 time of the output. The relationship between

the input parameters and output parameters is computed using the Yate’s algorithm [43],

which is very efficient compared to linear regression. We have two sets of the mentioned

output parameters: one for the rising output transition, which are GateRiseTime(.) and

GateTPLH(.) , and the other one for the falling output transition, which are

GateFallTime(.) and GateTPHL(.) . We generate the following functions as multi-

variable polynomials with 1024 terms (an average and 1023 effects):

85 GateRiseTime(Vdd, Slope, C1, R, C2, ∆T, Ln, Lp, ∆Vtn, ∆Vtp)= ... (4.15)

GateTPLH(Vdd, Slope, C1, R, C2, ∆T, Ln, Lp, ∆Vtn, ∆Vtp) = ... (4.16)

GateFallTime(Vdd, Slope, C1, R, C2, ∆T, Ln, Lp, ∆Vtn, ∆Vtp) = ... (4.17) GateTPHL(Vdd, Slope, C1, R, C2, ∆T, Ln, Lp, ∆Vtn, ∆Vtp) = ... (4.18)

We call these our FF models, but we do not show the actual polynomials here for brevity because they are very long. We simplify the multi-variable polynomial using analysis of variance [43] to include the most significant terms up to 2 and 3 factors to build our FF2 and FF3 models. Table 4.13 shows all of our full-factorial models.

Table 4.13. Designs used for cell model construction with a Pi-model load. Design Points Model Terms Abbr. (2 10 ) All factors FF Full Factorial (2 10 ) Significant up to 3 factors FF3 (2 10 ) Significant up to 2 factors FF2

We compared the adequacy of the models using the following criteria:

2 (a) The sum of squares of the residuals [43], SR : They are compared in Table 4.14.

The smaller the sum of the residuals, the better fit is the model; therefore, FF models are the best and FF2 models the worst.

(b) The coefficient of multiple determination [53], R 2 ,and the adjusted coefficient of

multiple determination [53], R 2 : They are compared in Table 4.15. The closer to 100%

the numbers, the more adequate the models. While the values for all the models are more

86 than 93%, the values for FF3 models are at least 98%, which is very close to the original

FF models.

2 (c) The coefficient of multiple determination for prediction [53], R predict : They are

compared in Table 4.16. The closer the values are to 100%, the better is the prediction

capability of the models. This criteria should be used when at least one of the terms is

not present in polynomials of the models; therefore, it is not applicable for assessing the

original FF models because all the terms have been used. We indicated this with an * in

the table just to make the table consistent with other tables. The values for all the models

based on FF2 and FF3 are more than 92%; however, the values for FF3 models are at

least 98%.

Table 4.14. Sum of squares of residuals for cell models. Model RiseTime Tplh FallTime Tphl FF 0.000 0.000 0.000 0.000 FF3 0.103 0.023 0.086 0.026 FF2 0.484 0.157 0.171 0.111

Table 4.15. Comparing the model adequacy of the full-factorial models using coefficient of multiple determination (%) and adjusted coefficient of multiple determination (%) in parentheses. Model RiseTime Tplh FallTime Tphl FF 100(-) 100(-) 100 (-) 100 (-) FF3 99 (99) 99 (99) 98 (98) 98 (98) FF2 95 (95) 93 (93) 96 (96) 95 (95)

Table 4.16. Comparing the model prediction accuracy of the full-factorial models using coefficient of multiple determination for prediction. Model RiseTime Tplh FallTime Tphl FF * * * * FF3 99 99 98 98 FF2 95 96 97 92

87 We can list all the full-factorial models based on adequacy and prediction accuracy from the best to worst as: FF, FF3, and FF3. Please note the values for FF3 are much closer to FF than the values for FF2 in general.

Accuracy is not the only factor for choosing the models. We need to consider the size of the models (as a measure of space complexity) as well as their execution times (as a measure of time complexity). Table 4.17 shows the number of terms (i.e. the number of coefficients) in each equation for each function as well as the number of operations (total additions and multiplications) needed for their execution. By studying the table, we observe the FF model is the poorest in both space complexity and time complexity while

FF2 is the best in both perspectives. We see that FF3 is placed somewhere between FF2 and FF, but it is much closer to FF2.

Table 4.17. Number of terms (number of operations) in cell models. Model RiseTime Tplh FallTime Tphl Total FF 1024 (6144) 1024 (6144) 1024 (6144) 1024 (6144) 4096 (24576) FF3 87 (296) 71 (273) 81 (274) 59 (196) 298 (1003) FF2 37 (101) 32 (85) 35 (94) 24 (63) 128 (143)

Therefore, there is a trade-off between accuracy and both space complexity and time

complexity. FF3 seems to be the optimal choice, because it exhibits an accuracy very

close to that of FF and a very close space complexity and time complexity to that of FF2.

However, all three models can be used depending on our priorities.

Although we demonstrated the cell characterization for an inverter with one input, the

methodology can be generalized to multi-input cells by adjusting the number of input

parameters and building the models for each timing arc, which is defined as the path of

signal change propagation from each input to the output.

88 4.3.4. RC-Interconnect Network Characterization

We showed in Section 4.3.3 how we can find H'(s) , which is the MOR model for an

interconnect transfer function of H(s) . In this section, we describe possible approaches to

perform RC-interconnect characterization and our selected approach.

In general, having a 2-pole 1-zero H'(s) , we can find the h'(t) in the time domain as an

equation in the following form:

h (' t) = k .e p1.t + k .e p2 .t 1 2 (4.19) Here, p1 and p2 are the poles of the H'(s) and k1 and k2 are the constants for each component. Using this equation, it is possible to obtain the delay of the RC-interconnect network using the analytical step response and ramp response of h'(t) ; however, the delay equation must be solved using numeric methods since there is no closed-form formula for it [69]. The step and ramp response equations as a function of time are as follows:

k1 p1.t k2 p2 .t vstep (t) = e + e (4.20) p1 p2

2 1 ki pi .t pi (t−tr ) vramp (t) = ∑ [()()e −1− pi .t .u(t) − e −1− pi (t − tr .) u(t − tr ) (4.21) t p r 1 i

Here, tr is the input-transition time of the saturated ramp as the input and the

approximate delay of an RC-interconnect network for an arbitrary waveform transition

can be found by mapping the input waveform transition to a saturated ramp.

For example, to find the delay using one of the equations, we should solve the

equation for time when the value of the voltage is at 50%. We can find the delay by

89 subtracting the time for 50% of the input voltage from the calculated time, the time for

50% of the output voltage. Moreover, it is possible to find the slope of the output voltage similarly by subtracting the time for 10% (90%) of the output voltage from the time for the 90% (10%) the output voltage for a falling (rising) waveform transition. Therefore, the following formulas can be used to find the 50% input to 50% output delay and the

10% to 90% output slope for falling and rising output waveform transitions:

delay = T (Vout = 50 %) −T (Vin = 50 %) (4.22)

slope falling = T (Vout = 90 %) −T (Vout =10 %) (4.23)

slope = T (V =10 %) −T (V = 90 %) (4.24) ri sin g out out

Because for an arbitrary input waveform transition there is not such simple closed-

form solutions, we decided to characterized our interconnect segments using Hspice. If

we could have found a method to characterize RC-interconnect networks using our

compact statistical waveform model, we could have included it in our cell

characterization methodology to be able to determine the approximate output waveform

transition of an RC- interconnect network for its input waveform transition, which is the

output transition waveform of the gate feeding the RC-interconnect network. As we

explained before, we decided to include this as in item in our future research direction

and use a 1-parameter saturated ramp waveform transition model to examine if our

methodology could still be applicable; and if so, what modifications were necessary, if

any. Therefore, instead of solving the ramp response equation numerically to find the

output transition times and delays, our RC-interconnect characterization was done by

90 sweeping the input slope to obtain the output slope and delay of each RC-interconnect network in our test circuit, which we will describe in the next section, and storing the response surfaces in tables. We use linear interpolation to estimate the response points not recorded in the tables using the following functions:

InterconnectTransitionTime (C1(i), R(i), C2(i), Slope) (4.25)

InterconnectDelay (C1(i), R(i), C2(i), Slope) (4.26)

By comparing the output transition time and delay functions with that of cells that we developed before, we observe the following differences:

(a) We have only one output transition function that covers both the rising and falling transitions.

(b) We have just one delay function that is good for both types of delay for each type of input transition.

(c) Here, the functions are not dependent on supply voltage.

(d) Here, the functions are not dependent on temperature.

The reason for all the above is our RC-interconnect networks are linear systems consisting of ideal resistors and capacitors. Although real resistors and capacitors are temperature dependent, we did not include it in characterization because we did not have their temperature dependent and capacitor models.

Since our goal was just to show the effectiveness of our method for cell characterization and not interconnect characterization, we characterized our specific interconnects just for Slope . At the beginning, we were looking for a compact model

91 applicable to all types of RC-interconnect networks. To form such a model, we need a few independent parameters to represent an arbitrary RC-interconnect network. We can use the parameters p1 , p2 and one of k1 or k2 (or Rx , Ry , and x) and add a few parameters for the waveform transition, or just one parameter as Slope in its simplest form, to build a set of 2-level full-factorial compact models for the output transition time and delay of an arbitrary RC-interconnect network. Please note that k1 and k2 are interdependent because of initial conditions; therefore, we should use one of them.

To build reasonably accurate models, we need to determine the optimum range of the

parameters in the model because the accuracy of our statistical models is dependent on

the range of the data that we build the models for. We did not verify if such models

could be built with good accuracy and left answering to this question as an item for our

future research direction, i.e. building compact interconnect timing models.

4.3.5. Test Circuit and its Pi-Model-Converted RC-Interconnect Networks

Our test circuit is the clock tree of a JPEG2 Encoder which was synthesized and

mapped into NCSU 45 nm standard cells at one of the laboratories of the Georgia

Institute of the Technology. Figure 4.11 depicts the 10-level tree that we used to verify

our compact variation-aware cell models. Although the original tree is not a binary tree,

we have shown a binary subtree with 1024 leaves for simplification. The tree consists of

minimum sized inverters connected by RC-interconnect networks. We have expanded

just the first level of the tree and have showed several paths of the tree from the starting

inverter to the final load, which is shown as a box. The first RC-interconnect network is

fed by another inverter, which is not part of the tree. The tree has about 500 inverters and

500 RC-interconnect networks with about 6200 resistors and 6500 capacitors. We have

92 marked one of the paths to show a sample path that can be chosen to perform a path- oriented static timing analysis. It starts from the root of the tree and ends at one of the final leaves.

Figure 4.11. Our test circuit is a JPEG2 Encoder clock tree.

Usually several critical paths should be included in static timing analysis, but we can use just the slowest path without losing generality. The slowest path was determined using Silicon Encounter [82] by applying our slowest transition to the input of the clock tree. The tool sorts the clock tree branches according to their node delays. In Figure 4.12 the slowest path is the topmost one and the fastest path is at the bottommost one.

Supposing the highlighted path in Figure 4.11 is the slowest inverter chain, we show its details in Figure 4.13. We have 11 inverters, out of which, only the first one is not part of the tree, but has been included to drive the first interconnect network. Each

93 inverter is connected to the next one through an RC-interconnect network. Here, we just show the input and the specific output of the RC-interconnect network that connects to the next inverter while the other interconnect network outputs are terminated with the input capacitance of the cell connected to them.

Figure 4.12. Choosing one of slowest critical paths of the JPEG2 Encoder clock tree.

Each RC-interconnect network consists of one or more RC-interconnect segment(s)

connected to each other. We define an RC-interconnect segment as a series of RC low-

pass filters with one input and one output.

We had access to the Spice netlist of our JPEG2 Encoder as well as its SPEF file.

The total number of capacitors and resistors in the all RC-interconnect networks in the

94 selected critical path is 220 and 200, respectively. We could partition the RC- interconnect networks using the SPEF file and mapped them into a set of interconnect segments that we encountered in the selected critical path. We categorized them into 5 types, which are shown in Figure 4.14. While the academic MOR tool that we made using Matlab, and named it GT_MOR, supports only these interconnect network types, support for other type of networks can be implemented in future versions.

Figure 4.13. The abstracted inverter chain used for our timing analysis. Each RC- interconnect network is made of one or more RC-interconnect segment(s) and each interconnect segment is made of one or more cascaded RC low-pass filters.

We used the SPEF file and Spice netlist to extract the critical path netlist and its RC-

interconnect networks to create the inverter chain Hspice netlist and our Hspice cell

characterization test bench. Moreover, we stored the interconnect networks as Hspice

subcircuits to be able to compare the accuracy of our Pi-models with the original RC-

95 interconnect networks. It is important to remember not to include the load capacitance of the next gate in the each interconnect subcircuit when we instantiate them in the inverter- chain netlist to avoid having duplicated loads because the input capacitance of the gate is already present. It is necessary to include this capacitance for finding the equivalent Pi- model as well as Y'(s) and H'(s) . For each RC-interconnect network, we replace it with its equivalent Pi-model with Y'(s) admittance. Since we cannot get the interconnect output waveform at the other end of the Pi-model, we had to include our voltage transfer model, H'(s). That means we replaced each interconnect network with a Pi-model and

connected the gate output voltage to the input of our H'(s) Spice network, which is shown

in Figure 4.10. This transformation is depicted in Figure 4.15, which makes our Pi-

model-converted interconnect networks.

Figure 4.14. The interconnect network types reduced to a Pi-models by our tool (GT_MOR).

96 In the next section, we will describe the timing analysis engine that we built to assess our methodology.

Figure 4.15. To test our Pi-model in the inverter chain at Spice level the RC- interconnect networks were replaced by sets of Pi-model and H'(s) .

4.3.6. Timing Analysis Engine for Our Cell Models and RC-Interconnect Models

We have built our cell models and RC-interconnect models. We can use them to perform static timing analysis. We find the total delay for our 11-stage inverter chain, which is shown in Figure 4.13, after replacing each RC-interconnect network by its Pi- model.

Our static timing analysis engine was coded in the C programming language using the algorithm shown in Figure 4.16. The algorithm uses a two-step delay approximation

[69], in which a delay has two components added together: a gate delay and an Pi- interconnect model delay. Please note that the algorithm is shown for the falling input waveform transition; the algorithm for the rising input transition is similar, which is the dual of this one. The algorithm is based on the following principles:

(a) The gate delays and output transition times for even (odd) stages are given by

GateTPHL (GateTPLH) and GateRisetime (GateFalltime) , respectively.

97 (b) The RC-interconnect network Pi- model delays and output transition times,

regardless of the transition direction, are given by InterconnectDelay and

InterconnectTransitionTime , respectively. The parameters (Cipi, Rpi, and Cjpi) are fixed

for each stage.

(c) Propagation of an input waveform transition time through a gate (RC-interconnect

network Pi- model) determines its output waveform transition time using GateTPHL and

GateTPLH (InterconnectDelay ).

(d) Each gate (RC-interconnect network Pi- model) input transition time is equal to its

previous RC-interconnect network Pi- model (gate) transition time. This determines all

the gates and RC-interconnect network Pi- model input transition times.

(e) Having all the input waveform transition times, all gate delays and all RC-

interconnect network Pi- model delays can be determined.

(f) Adding each gate delay to its next RC-interconnect network Pi- model delay

determines its stage delay.

(g) The total delay is the sum of all the stage delays.

A waveform transition time, Tr , is the time between its 10% and 90% of the supply voltage transition that have been mapped into a saturated ramp. For a fixed supply voltage, there is a one to one correspondence between a 10-90% transition time and its slope as follows:

Slope = 1.25 * Tr(10-90%) (4.27) In general, Tr can be substituted with a multi-parameter waveform model, like our

PCA waveform model with 2 parameters, and the waveform parameters propagate

98 through the chain. Delays must be a function of the parameters of the waveform model.

In this algorithm Tr has been mapped into a 1-paramters saturated ramp.

TotalDelay(0) = 0;

Tr(0) = Falltime10to90Percent(0)

for all stages (i=0 to 10){

if (Even Number) {

GateDelay(i) = GateTPLH(C(i), R(i),C(i), Tr(i));

GateOutputTransitionTime(i) = GateRisetime(C(i), R(i),C(i), Tr(i));

} else {

GateDelay(i) = GateTPHL(C(i), R(i),C(i), Tr(i))

GateOutputTransitionTime(i) = GateFalltime(C(i), R(i),C(i), Tr(i))

}

InterconnectDelay(i) = InterconnectDelay (C(i), R(i),C(i), Tr(i));

InterconnectOutputTransitionTime(i) =InterconnectTransitionTime (C(i), R(i),C(i), Tr(i))

StageDelay(i) = GateDelay(i) + InterconnectDelay(i)

TotalDelay(i+1) = TotalDelay(i) + StageDelay(i)

Tr(i+1) = InterconnectOutputTransitionTime(i)

}

Figure 4.16. Timing simulation algorithm with RC-interconnect networks support.

This algorithm is a generalization of Equations 3.10 that we used for our tabular static

timing analysis engine, but our cells are characterized in a way that is very similar to

what we did for our cell models when we used our PCA waveform model. The only

99 simplification is the replacement of the PCA waveform model with a saturated ramp waveform model. We may lose some accuracy because of using a simpler waveform model, but it allows us just to concentrate on interconnect modeling for cell characterization in a framework of an academic research project. As a future plan, a multi-parameter waveform model could be used in conjunction with the Pi-model for cell characterization and timing analysis; however, the waveform model should be capable of the large range of waveform shape variations of interconnects. The authors of [83] emphasized the need for a effective waveform for cell characterization and introduced a moment-based solution on how to fit the best saturated waveform for a sample of input waveforms using an optimization problem solving approach.

4.3.7. Timing Simulation and Simulation Results

We performed static timing analysis (STA) on our test circuit, the inverter chain, and

compared the results for the following methods:

Method 1 (Hspice) : Circuit-level STA with the original RC-interconnect networks . In this method, the original inverter chain is simulated at the circuit level using Hspice. The results of this method, as our golden model results, are used to measure the level of accuracy of Method 2 and Method 3.

Method 2 (Pi-Model) : Circuit-level STA with our Pi-model-based interconnect loads and our H'(s) circuit model . In this method, the RC-interconnect networks in the inverter chain are replaced by their corresponding Pi-models, which were built using our Pi- model generation engine. As mentioned before, Pi-models could not provide us with the output waveform transition at the end of each RC-interconnect network; however, they can provide us with the output waveform transition of the inverter feeding the RC-

100 interconnect network. Hence, our H'(s) model, implemented as a circuit, as in Figure

4.10, is used to generate the output waveform of each RC-interconnect network. The

simulations, performed at the circuit level, enable us to evaluate the accuracy of our Pi-

models and our H'(s) by comparing the results with Method 1.

Method 3 (FF-Model and its family, i.e. FF3-Model and FF2-Model): High-level

STA with our compact variation-aware cell models and interconnect models . In this

method, our high-level models are used. The inverters are replaced with our compact

variation-aware models and the RC-interconnect networks are replaced with our

interconnect-characterized timing models. Since our compact variation-aware cell model

is the high-level timing model of a cell with a Pi-model load, comparison of the

simulation results with those of Method 2 could reveals how much inaccuracy could

originate from the abstraction of our circuit-level model using our timing characterization

methodology, which is based on 2-level full-factorial designs and the analysis of

variance. Comparing the results of this method with those of Method 1, determines the

accuracy of our methodology considering both the high-level cell modeling and the Pi-

model RC-interconnect network abstraction used in our high-level cell modeling.

Using all the 3 methods, we performed a set of 24-STA simulations, i.e. 12 for the rising input waveform transition and 12 for the falling input waveform transition, for the following cases.

Case a: Nominal parameter values condition : The process parameters ( Lp , Ln , ∆Vtp ,

∆Vtn ) and environment parameters ( ∆T and Vdd ) are kept at their nominal values and only the waveform transition shape parameter ( Slope ) varies linearly in its range in Table

4.12. The nominal parameter values for our 2-level full-factorial designs are the center

101 values of parameters. The Pi-model parameters ( C1 , R, and C2 ) are fixed for each RC-

interconnect network.

Case b: Random parameter values condition : All process parameters ( Lp , Ln , ∆Vtp ,

∆Vtn ) and environment parameters ( ∆T and Vdd ) and the waveform transition shape parameter ( Slope ) vary randomly in their range in Table 4.12. The Pi-model parameters

(C1 , R, and C2 ) are fixed for each RC-interconnect network.

Figure 4.17 compares total delays at each stage through the inverter chain for our three different methods for a falling input transition for two cases for (a) nominal parameter values conditions and (b) random parameter values conditions. For each case, although we have included only one plot from each set of 24-STA simulations for in this document, we studied all the plots and concluded that both the Pi-model-based and FF- model-based simulations track the results of Hspice simulations reasonably; however, all the Pi-model-based simulations overestimated the delays and all the FF-model-based simulations underestimated the delays.

Since we characterize our cells using a saturated ramp, the cell delay can be

underestimated for an actual input waveform transition abstracted as a saturated ramp.

Figure 4.18 shows such a situation that characterizing a cell using a saturated ramp

waveform model can be troublesome. We use the 10% point to the 90% point of the

waveform transition to find the transition time and slope; we know that the 50% point for

a saturated ramp is always in the middle of the 10% and 90% points while the 50% point

for a real waveform shape is not necessarily in the middle of these points. Consequently,

the delay estimates have some error due to imperfection of the cell characterization used

at the circuit level.

102

(a)

(b)

Figure 4.17. Comparison of delay for the three methods (Hspice, Pi-Model, and FF) at (a) nominal parameter values and (b) random parameter values.

103

Figure 4.18. Cell characterization using a saturated ramp could result in a delay error.

In STA, the overestimated delays are more desirable than the underestimated delays.

According to a recent conference publication authored by famous timing analysis researchers from IBM [83], fitting a saturated ramp waveform for cell characterization can result in up to 16% delay variation for one cell. In our work that uses standard saturated ramp waveform for cell characterization, our maximum delay per stage error is

2.2%, and 2.9% for nominal parameter values case and random parameter values case.

While the authors of [83] provide their own solution for the better fitting of a saturated ramp toward better cell characterization, they reported that using 20% to 80% waveform transition points makes it possible to overestimate the transition time and slope. Moreover, it is possible to overestimate the delay in cell characterization by adding some marginal values to the delays, e.g. by applying a falling input waveform transition to an inverter and measuring the delay as the time between the 60% point

(instead of the 50% point) of the input waveform to the 40% point (instead of the 50%

104 point) of the output waveform transition. We believe using a better waveform model for characterization can improve the accuracy of timing models and we proposed our PCA waveform models for this reason; however, in our academic research framework we had to limit the scope of our research. Although our methodology gives reasonable delay estimates, they can be improved using some of the mentioned techniques.

To have a better understanding of the delay error trends, we plotted the relative delay error per stage, in percentage, for Method 2 (Pi-model) and Method 3 (FF-model its family, i.e. FF3-model and FF2-model) for the inverter chain. Please note that the delays obtained from Method 1 (Hspice) were used as the basis to obtain the relative delay errors. Figure 4.19 shows the error per stage, in percentage, for cases (a) nominal parameter values condition and (b) random parameter values condition. We have included one input transition waveform direction for each case, i.e. low-to-high (high-to- low) output transition of the first gate for the first (second) case. The plots for the other input transition waveforms are similar.

We can draw the following conclusions from studying the plots.

(a) The errors per stage for the Pi-model are positive while the errors per stage for the

FF-models are negative. This means the Pi-Model overestimates the total delay while the

FF-Model and its family underestimate the total delay.

(b) The errors per stage for the Pi-model are less than 0.5% for most cases and less than 0.8% for all the cases; therefore, our Pi-models are very accurate.

(c) The errors per stage for the FF-model and its family are less than 3%. Our FF- model and its family are less accurate than our Pi-model, but the errors are much smaller

105 1.0

0.5 HL(FF) 0.0 1 2 3 4 5 6 7 8 9 101112 HL(FF3) -0.5 HL(FF2) -1.0 HL(Pi-Model)

Per Stage Error (%) Per Stage Error -1.5

-2.0

-2.5

(a)

1.0

0.5

0.0 LH(FF) 1 2 3 4 5 6 7 8 9 101112 -0.5 LH(FF3) -1.0 LH(FF2) -1.5 LH(Pi-Model) -2.0 Per Stage Error (%) Per Stage Error -2.5

-3.0

-3.5

(b)

Figure 4.17. Comparison of delay errors, in percentage, using Pi-Model and FF- Model and its variations, i.e. FF3-Model and FF2-Model, at (a) nominal parameter values and (b) random parameter values.

106 than the 16% delay variation per cell that we mentioned before; however, we will show that timing analysis using our FF-models is about 2880 times faster than Hspice simulation.

(d) The errors per stage for the FF-model and its family are very close. This means that we can make the models smaller and faster and get almost the same results. FF2 models are our smallest models and the fastest but the least accurate models of the family.

All our STA simulations were performed on a Dell PowerEdge 2970 with a dual-core

Intel Xeon E5440 CPU (2.83 GHz) and 16 GB of RAM under a Red Hat Enterprise

Linux Server. The FF-model and its family were generated using Matlab R2010 64-bit on a Dell Optiplex 980 with dual-core Core i5 CPU (3.19 GHz) and 8 GB of RAM.

Table 4.18 compares the simulation time, the characterization time, the total number of operations, the memory usage and the per-stage accuracy of all the STA methods. The characterization time has the two components of Hspice simulation time and the FF-

Model generation time. The number of operations column does not have any meaningful values for the comparison between the first two circuit-level simulations and the rest of the methods. The memory usage for Hspice and Pi-Model methods is the memory that our Hspice simulator reported to us and cannot be compared with the memory usage of our FF-models family. The table shows the trade-off between time, memory, and accuracy for all the methods. The accuracy column shows the average accuracy of all the

24 simulations.

In Appendix E, we compare the space complexity, the simulation time complexity, and the characterization time complexity of the FF family models and the tabular models.

107 Table 4.18 Comparing STA time, the characterization time, the total number of operations, the memory usage and the accuracy of all the STA methods. Method STA Characterization Number of Memory Per-stage Time(s) Time (s) Operations Usage (kb) * Accuracy (%) Hspice 2.88 0 - 2100 ** 100 Pi-Model 1.14 1481.0 + 0.0 - 1050 ** 99.46 FF-Model 0.02 1481.0 + 9.6 24576 224 97.95 FF3-Model 0.002 1481.0 + 10.6 1003 11 97.96 FF2-Model 0.001 1481.0 + 9.9 143 2 97.97 * The estimate was found by adding the total number of operations to the total number of terms and multiplying the result by 8/1024, supposing 8 bytes per double floating point numbers and 8 bytes of machine code per operation. ** This number is the total memory usage reported by our Hspice simulator.

By studying Table 4.18, we can draw the following conclusions:

(a) Using our FF2-model can speed up the simulation time by 2880 times while

losing about 2.03% per-stage accuracy.

(b) The major characterization time for the FF-Model and its family is for Hspice

simulation of the Pi-Models; therefore, our methodology is not very expensive. The times

for generation of our FF-Models are about 0.7% of the total characterization time, which

are included under the characterization time after the plus sign.

(c) Our FF2-Model lookup is 171 times faster than that of FF-Model considering the

number of operations.

(d) The memory usage of our FF2-Model is 112 times smaller than that of FF-Model.

(e) As we described before, although we did not use an ideal cell characterization

method, the per-stage accuracy of our FF-Model and its family is about 97.95% while the

accuracy of the Pi-Model is about 99.46%.

(f) Using Pi-Models for RC-interconnect networks can speed up the simulation at the

circuit level by 2 times while losing about 0.64% per-stage accuracy. The memory usage

of the Pi-Model method was about half of that of the Hspice method.

108 (g) Although FF2 offers much better simulation time and memory usage, considering the fact that all the family of FF-Models are statistical models and we have checked the model adequacy and prediction accuracy, using FF3-model is preferred to be on the safe side. The FF3-Model speeds up the simulations up to 1440 times while the models are 25 times smaller than the FF-model.

4.3.8. Conclusions and Future Work

Our compact variation-aware cell methodology allows generating cell timing models that incorporate the variations in semiconductor process parameters and environment parameters such as supply voltage and temperature. The timing models, i.e. cell models, can be used to perform statistical static timing analysis using a static timing analysis engine to run Monte-Carlo-based path-oriented static timing simulations. They can also be used to characterize cells to build a statistical timing library for block-based statistical static timing analysis engines. The simulation times can be an order of a couple thousand times faster than their circuit-level equivalents while some accuracy is lost due to model abstraction.

We showed that our compact variation-aware cell methodology can be extended to interconnect dominant circuits using a Pi-model to represent the RC-interconnect networks and verified our Pi-model conversion engine. We also showed that our methodology is applicable even to a single parameter waveform transition model although we used a two parameter waveform transition model, our PCA waveform model, in Chapter III. We used a methodology very similar to what we did for our previous compact-variation aware cell models that did not support RC-interconnect networks and developed the family of FF-models. Using our static timing simulation

109 engine, we simulated an inverter chain from a clock tree network of a JPEG2 encoder covering both the cases of (a) nominal parameter values conditions and (b) random parameter values conditions. Then, we compared the simulation time, the memory usage, the characterization time, and the accuracy of all our STA simulation methods including

(a) circuit-level STA for the original circuit with RC-interconnect networks, (b) circuit- level STA for the chain with our Pi-model-converted RC-interconnect network circuits,

(c) High-level STA using our cell full-factorial models.

We found that the Pi-model conversion for RC-interconnect networks at the circuit- level has a per-stage accuracy of 99.46% while the memory usage and the simulation times were reduced to half. Moreover, we learned that our full-factorial models can offer a per-stage accuracy of 97.95% while the simulation times were improved by a factor of

2880. We pointed to the imperfection of the cell characterization using a saturated slope that we used and reviewed some suggested accuracy improvement solutions.

We had to limit the scope of this endeavor; however, based on the insight we gained from this experience, we believe this research could be continued in the following directions with a good chance of success.

(a) Extending the methodology to multi-input cells: The process of applying the methodology to multi-input cells is straightforward; however, the number of parameters increases, which increases the characterization time. It is interesting to know how the size of the models and the accuracy will be affected by building variation-aware cell models for several multi-input cells and evaluating the models.

(b) Finding more general compact waveform models: Building alternative compact

waveform models for cell characterization is necessary since the 1-parameter waveform

110 model using a saturated ramp makes the cell characterization inaccurate and our PCA waveform model are accurate only in a finite range and for specific shape types.

(c) Including a multi-parameter compact waveform model in our improved cell characterization with RC-interconnect network support: Improving the cell characterization accuracy using multi-parameter waveform models should be possible with keeping the support for Pi-model-converted RC-interconnect networks.

(d) Building compact variation-aware interconnect network models: Building compact interconnect network timing models using our cell models methodology should be possible; however, the support for a good compact waveform model capable of handling various waveform shapes at the output of the interconnect networks is necessary.

(e) Improving the accuracy of our cell models: Although using 3-level full-factorial

designs are more expensive than 2-level full-factorial designs, they should improve the

accuracy of the cell models, however; we may need to use linear regression to make our

cell models and that will be more expensive than using the combination of the Yate

algorithm and the analysis of variance that we used.

4.4. Investigating Accuracy Improvement Methods

We want to know how we can improve the accuracy of our variational waveform

model and our cell models. When our cell models use our variational waveform model,

increasing the accuracy of the waveform model can also improve the accuracy of the cell

models. Moreover, the accuracy of our waveform models and cell models depends on the

111 accuracy of the circuit simulation and the accuracy of our statistical models. We categorized the sources of error in our waveform model and our cell models as follows.

(a) Circuit level simulation errors: We use transient analysis to obtain the set of waveforms for both waveform and cell models. Using automatic selection of the simulation time resolution makes the simulations needed for cell characterization much faster but reduces the accuracy by introducing some error. Therefore, an optimum simulation time resolution must be selected neither to lose the accuracy nor to waste the simulation time because increasing the simulation resolution more than a limit simulation increases the simulation significantly while the accuracy of our captured parameters almost do not increase at all for our purpose.

(b) Experimental design errors: We sample a subset of all the possible waveform space to make manageable the simulation time for waveform model generation and cell characterization; therefore, some error is introduced in the model because we cannot afford to cover all the output waveforms space. For example, using full-factorial models ignores all the remaining points. In general, including more sampling points at strategic locations in designs can improve the accuracy of the models; therefore, building models using designs with more sampling points, such as central composite designs [52] and 3- level full-factorial designs [52], should improve the accuracy of the models but increases the characterizing time. It is interesting to see if the increase of the accuracy is worth the characterization time overhead.

(c) Statistical modeling errors: For our variational waveform model, we used principal component analysis to represent waveforms just with a few principal components with a very high variance coverage (more than 95%) but the residual errors

112 resulting from ignoring the low order principal components are still present. Moreover, in our cell models when we use linear regression models, we introduce some residual errors to the model due to the lack of fit. With full-factorial designs, we can reduce the lack of fit to 0, but when we use just a fraction of the terms to have a compact model, we introduce some lack of fit error.

We describe the accuracy improvement analysis for our waveform model in Section

4.4.1. In Section 4.4.2, we explain briefly the accuracy improvement analysis for our cell models.

4.4.1 Accuracy Analysis of the PCA Waveform Model

Accuracy of our waveform model is dependent on accuracy of the PCA model. In

Chapter III, we presented a qualitative factor to choose the most accurate method from the methods of PCA model construction (SNM, SSM, and ASM). We extend the accuracy analysis discussion with presenting some quantitative factors. Since we discretized voltage waveforms across the voltage axis to find the corresponding time vector, the discretization level (number of samples) and the locations of samples affect the accuracy of the waveform and its corresponding PCA model. We want to find the minimum number of the samples and the best locations for the samples.

4.4.1.1 Accuracy Analysis of the PCA Waveform Model – Number and Location of

Points

The discretization level for waveform modelling was chosen in order to have straightforward voltage levels for transistors in the technology that was used in Chapter

III. However, accuracy of the PCA waveform model is a function of

113 (a) the number of discretization levels along the voltage axis, and

(b) the choice of voltage levels to discretize the waveform on the voltage axis.

To analyze waveform model accuracy with respect to the number of discretization levels along the voltage axis, seven discretization patterns were compared. These are summarized in Figure 4.18, with between three and 19 levels.

Voltage 19 Levels

15 Levels

10 Levels

5 Levels -- case 1 5 Levels -- case 2 5 Levels -- case 3 3 Levels

Figure 4.18. PCA waveform discretization patterns for the voltage scale.

Figure 4.19 compares the accuracy of the uniform discretization plans using the Sum

of Squares of Error (SOS). It can be seen that at least 10 points are needed to achieve

high accuracy, and that increasing beyond 10 points does not increase accuracy much.

However, it should be noted that as few as five points can achieve high accuracy if they

are appropriately placed. Additionally, fewer discretization levels result in fewer

principal component basis functions. However, even with the worst case that we studied

where we considered 19 levels for discretization, two principal component basis

functions covered 99.8% of the variation.

114 PCA Waveform Accuracy 0.6

0.5

0.4

0.3

0.2

0.1 Sum of Squares of Error 0 3 5 7 9 11 13 15 17 19 Uniform DiscretizationDiscretization5 Level Levels -- Case 1 5 Levels -- Case 3

Figure 4.19. Increase in accuracy of the PCA waveform as a function of the number of discretization levels.

4.4.1.2 Accuracy Analysis of the PCA Waveform Model for TSMC180RF– Waveform

Dataset Selection, Range of Parameter Variations, and Model Subranging

In this section, we explain how the accuracy of the PCA waveform model is affected by the selection of the waveform dataset, the range of parameter variation, and the model subranging. This discussion is related to our variational waveform model that we developed in Chapter III for our cells based on TSMC180RF technology.

We could simplify cell characterization to reduce the memory requirement for cells by using a single variational waveform model for both rising and falling transitions and for both input and output waveforms for the inverter cell.

This was possible by

115 (a) generating the rising and falling output waveforms by using a 2-level full-factorial design to cover the space of all parameters, i.e. slope , fanout , and process and

environment parameters,

(b) combining the dataset of the rising waveforms with that of the falling waveforms

to obtain a general waveform model using PCA on the combined datasets, and

(c) modifying the PCA basis functions of the general waveform model iteratively to

find a common set of basis functions for both inputs and outputs a cell.

However, all of these actions can adversely affect waveform accuracy.

A full-factorial design only samples the response space at all the combinations of its

extreme input parameter levels; therefore, it can introduce inaccuracy if it cannot capture

the nonlinearity of the response space where it does not have any samples from. For

example, delay is a nonlinear function of slope and fanout ; consequently, a full-factorial

design can introduce more inaccuracy for a larger range of slope and fanout . It seems

logical that constructing the models for a subrange of these variables can increase

accuracy; we call this technique model subranging. Model subranging should be an

effective accuracy improvement method when the surface response in selected subranges

has less nonlinearity than in the original ranges. It is similar to concept of binned

transistor models introduced in Section 4.1.2 that binned models, in general, are more

accurate in each bin than their non-binned models.

In chapter III, we constructed the waveform model for the small range of variations

(5%) for the process parameters, i.e., channel length and threshold voltage of transistors,

while we used the full range of both slope and fanout.

We want to know how the waveform model is affected by

116 (a) using a subrange for both slope and capacitance, and

(b) using a large range of variation (30%) instead of the small range of variation (5%)

.

We constructed 5 waveform models, as shown in Table 4.19, for an inverter based on

TSMC 180 nm technology. Each row shows the model number, model name, whether it uses rise time and/or fall time waveforms, the subrange of slope and fanout used in percent, and the range of process parameters variations in percent.

Table 4.19. Waveform Models Compared for Accuracy – TSMC180RF. # Model Name Waveforms Subrange of these Range of Process used parameters (%) Parameters Trise Tfall slope Fanout Variation (%) 1 tr_gn √ √ 100 100 5 2 tr_sub_gn √ √ 100 25 5 3 tr_sub_large_gn √ √ 100 25 30 4 tr_sub_large_sn √ 100 25 30 5 tf_sub_large_sn √ 100 25 30

We use several criteria to compare the error introduced in a transition using a PCA model. The errors are normalized to the range of original waveform transition; therefore, we can compare the accuracy of all models although they are based on different sets of waveforms.

Waveform model 1 is based on using both rising and falling waveforms, while we use the whole range of slope and fanout and keep process parameters variation at 5%. The

settings to obtain waveform model 2 is similar to the settings of waveform model 1, but

we use just the 25% range of fanout . We increase the range of process parameter variation from 5% to 30% to construct waveform model 3. These three models are used

117 to evaluate the effect of subranging fanout and increasing the range of process parameter variations.

Waveform model 4 and 5 use only the rising waveforms or the falling waveforms that

we used to construct waveform model 3, respectively. We use waveform models 3-5 to

compare the accuracy of the waveform model obtained by combing the rising and falling

waveforms with the waveform models based on only the rising or falling waveforms.

Figure 4.20 compares the accuracy of all 5 waveform models. It shows the maximum,

the average, and the maximum of the average relative errors for (a) a complete transition

(19-points) and (b) the 10% to 90% of transition (15-points).

Waveform Accuracy - Max, Average, Max. Ave.

0.30 Max.(19-pt) 0.25

0.20 Average(19-pt) 0.15

0.10

Relative Error 0.05 Max. Ave.(19-pt)

0.00

n .gn Max. Ave.(15-pt) tr .s .sn ub.gn e s _ arge tr l ub_ tr_sub_large.gntr_s tf_sub_larg Max.(15-pt) Waveform Model Figure 4.20. Waveform accuracy – Max, Average, and Max. Ave.

(TSMC180RF).

118 In general, subranging increases the maximum and the average relative errors; however, combining subranging with a large range of process parameter variation (30%) does not increase the average relative errors much (4%). Moreover, combining the rising and falling waveforms does not decrease the accuracy much when we look at the last three waveform models. The points outside of the 10% to 90% of a waveform transition should not affect a gate delay much because one of the transistors is almost off; therefore, if we observe an error increase (or decrease) inconsistency between the 15- and 19- point cases, we should accept the error increase (or decrease) of the 15-point case. We have included all, just to show the maximum relative error at all the 19 points of a transition.

We use the Mahalanobis distance [84] to measure how similar a PCA waveform

model, using just the two first principal components, is to the original waveform. The

Mahalanobis distance takes into account the correlation among timing points at different

voltage levels while the maximum relative errors or the maximum average relative errors,

presented earlier, do not consider this correlation. It is actually the square root of the

distance of the two vectors, i.e. the difference between the actual time points of the

waveform transition and their estimated values from our PCA waveform model in its

transposed form, multiplied by the inverse of the two sets’ correlation matrix, and finally

multiplied by the difference vector. The Mahalanobis distance between the two vectors r r x and y is defined as:

r r r r T −1 r r d (x, y) = (x − y) S (x − y) (4.28)

119 r r Here, S is the matrix for the covariance between x and y . If the covariance matrix is an identity matrix, the Mahalanobis distance reduces to a Euclidean distance [42]; and, if the covariance matrix is a diagonal, it reduces to normalized Euclidean distance, which is defined as follows for N-dimensional vectors:

r r N (x − y )2 d (x, y) = i i ∑ 2 (4.29) 1 σ i

In our PCA waveform model, the timing points at different voltage levels are correlated; therefore, we need to use a measure of distance that takes into account this correlation, which is the Mahalanobis distance. We could use the normalized Euclidean distance only if the timing points at different voltage levels were independent. The

Euclidean distance is not a good choice, either.

Figure 4.21 shows the maximum and the average of the Mahalanobis distance for all

19 points and the 15 middle points (10% to 90% points) of a transition. We see that 10% to 90% of transitions are more similar to the original waveforms than the whole transition whether we look at the maximum or the average Mahalanobis distance. Comparing the first three waveform models, we observe subranging increases the accuracy, while increasing the range of process parameter variation decreases the accuracy. Comparison of the last three waveform models shows that combing the rising and falling waveform models increases the overall accuracy of the waveform model for rise time and decreases the waveform accuracy for the fall time.

Figure 4.22 compares the waveform model accuracy for all the waveform models just for the 50% point that we use for delay calculation. We observe the maximum

120 introduced error is less than 5% and the average error is less than 2% for all the models.

Moreover, we see the error is the least (i.e. maximum error< 3% and average error <

0.5%) just for the waveform model based on falling waveforms when we use just a subrange of fanout and we have a 30% range of process parameter variation. The error

for the waveform model based on the rise times is next least. The error for the waveform

model based on combing the rising and falling waveforms and using only a subrange of

fanout with a 30% range of process variation is better than the two first waveform

models. It seems waveform model 3 performs the best if we want to use a single

waveform model for both rising and falling waveforms and use a subrange for fanout and

a 30% range of process parameters variation.

Waveform Accuracy - Mahalanobis Distance

140 Max. Mahalanobis 120 Distance(19-pt) 100 80 Max. Mahalanobis 60 Distance(15-pt) 40 20

Mahalanobis Distance 0 Average Mahalanobis Distance(19-pt) n .s tr.gn e.gn e.sn e rg rg rg a a la tr_sub.gn _l _ b_l b sub Average Mahalanobis _ tr_su tr_su tf Distance(15-pt) Waveform Model

Figure 4.21. Waveform accuracy - Mahalanobis distance (TSMC180RF).

121 Waveform Accuracy - 50% Point

0.050 0.045 Max.(50% Point) 0.040 Average(50% Point) 0.035 0.030 0.025 0.020 0.015

Relative Error 0.010 0.005 0.000

gn .sn tr.gn b. e.sn u g rge tr_s la b_lar b_ u tr_sub_large.gn tr_s tf_su Waveform Model

Figure 4.22. Waveform accuracy – 50% point (TSMC180RF).

4.4.1.3 Accuracy Analysis of the PCA Waveform Model for FreePDK45 –Waveform

Dataset Selection, Design of Experiment, Discretization Level, Range of Parameter

Variations, and Model Subranging

In this section, we describe how the accuracy of the PCA waveform model is affected by the selection of the waveform data, the range of parameter variation, and the model subranging. This discussion is for to a variational waveform model for FreePDK45 technology using our methodology, which was described in Chapter III for our cells based on TSMC180RF technology. We want to know how options can impact the accuracy of the resulted waveform and cell models. Table 4.20 shows a list of options for improving the accuracy of waveform and cell models. We describe it in more detail later.

122 We have made a list of our options with the code from our general waveform and cell models improvement methods in Table 4.20. The list of our options is as follows:

(a) Selection of the variational waveform model basis functions (w.1): Using common

basis functions for variational waveform reduces the memory requirement for each cell

and simplifies timing simulation; however, it affects the accuracy. We want to know how

much accuracy improvement is possible using a dedicated basis function for each cell

versus having to map from one basis function to another. The basis function can be

constructed for output waveforms from rising waveforms, falling waveforms and both

rising and falling waveforms. The basis function resulting from the convergence of the

basis functions through an iterative approach is another alternative.

(b) Experimental Design (wc.1): One of the factors affecting the accuracy of a PCA waveform model and consequently the accuracy of the cell model is the choice of alternative experimental designs to construct the mapping equations. We want to compare the accuracy of models obtained based on factorial design, fractional-factorial design, central composite design (Faced) and Latin hypercube sampling.

(c) Fanout load model vs. effective-capacitance load model (wc.3): We characterized our cells using the number of fanouts as a simplification to help us concentrate on our modeling methodology; however, the better modeling of loading parameters can improve the accuracy of the cell models. Therefore, the fanout parameter can be replaced by a complex load, which takes into account the resistive, capacitive, and inductive components of the interconnect networks as the load as we incorporated a resistive- capacitive load model in Section 4.3. As a better single parameter load model, we can use effective capacitance.

123 Table 4.20. Waveform and cell models accuracy improvement methods. Type Category Options Cell model 1) Equations c.1 Implementation 2) Tables Method Waveform 1) Slope c.2 model used for 2) Variational waveform model Cell model 1) Based on rising output waveforms (Tr) Type of 2) Based on falling output waveforms (Tf) w.1 Waveform 3) Based on both falling and rising output waveforms (Tg) Basis Function 4) Based on converged waveform model (Tc) PCA Model 1) Symmetric Nonstandardized Model (SNM) w.2 Construction 2) Symmetric Standardized Model (SSM) Methods 3) Asymmetric Standardized Model (ASM) Design of 1) Full-factorial design Experiment 2) Fractional-factorial design wc.1 Used 3) Central composite design (Faced) (Appendix C) 4) Latin hypercube design Sampling Number & location of sampling points wc.2 Points 1) Number of fanouts for minimum size transistors Fanout Load wc.3 2) Effective capacitance load model Model 3) Resistive-capacitive load model Range of 1) Single model (whole range) Variables 2) SR (Subrange (slope, load)) wc.4 Covered by 3) SR2 (Subrange (load)) Model Variation 1) 5% wc.5 Covered by 2) 20% Model 3) 30% (not possible for the circuit-level model) 4) No variations (Used for our models based on FreePDK45) Load 5) The same variation for supply voltage and temperature wc.6 Assumptions 6) The same variation in threshold voltages, channel lengths, supply voltage and temperature (Used for our models based on TSMC 180 nm) Rail-to-Rail 1) The same swing as the supply voltage of the cell wc.7 Input 2) Fixed voltage swing Assumptions 1) RC_CC for cell & load Circuit Model 2) RC_CC for cell & C for load wc.8 Used 3) RC for cell & load 4) NP (No parasitics for cell and load)

124 The effective capacitance load model takes into account to some extent (i) the nonlinearity effect of the internal capacitance of the loading transistor and (ii) the resistive shielding of the capacitance. In our characterization in Chapter III, we have included the effect of nonlinearity of the internal capacitance of the transistors indirectly.

We want to know what the impact on accuracy is by using our simple fanout load model vs. the effective capacitance load model.

(d) Model subranging (wc.4): Since our cell timing parameters depend nonlinearly on

variables, such as fanout and waveform shape, we expect to have a more accurate model

if we generate the models for subranges of the variables. Subranging the variables will

increase characterization time, simulation time, and the memory requirement by a factor

of ( number_of_subranges +1)/2 for each dimension; however, we expect to see accuracy

improvement. The resulting models will be a hybrid using tables for selected nonlinear

variables and equations for the other variables. We can subrange both load and slope, our

‘sr’ models; or subrange just load, our ‘sr2’ models.

(e) Variation covered by the model (wc.5): We want to know how the range of

variation covered by the model affects the accuracy of the waveform and cell models.

We use the following two ranges for threshold voltages and transistor channel lengths:

(i) -5% to 5%

(ii) -20% to 20%

Please note that the large range of variation -30% to 30% will not be compared since

the cell will not function properly.

(f) Variational vs. Fixed load (wc.6): The set of waveforms that we used to construct

our waveform and cell models is generated by the circuit-level simulation of a netlist

125 representing an inverter connected to a load using Hspice. If we use an inverter or multiple inverters with a common input as a load, we need to know how the load can be affected by process and environment variations. For our waveform and cell models based on TSMC 180 nm, we assumed the same variation for the load as for the cell. For our waveform and cell models based on FreePDK45, we assumed no variations in load. The other option is to use the same range of variation in the supply voltage, Vdd, since the cell and the load are usually very close.

(g) Voltage-dependent vs. fixed waveform swing (wc.7): The rail-to-rail input voltage swing affects the shapes of the set of output waveforms. We can use the same swing for the input as the supply voltage of the cell or a fixed-voltage-range swing for the input.

We used the same swing for the input as the supply voltage since the swing is generated by another cell because its rail-to-rail swing is a function of the supply voltage and usually the supply voltage pins of the adjacent cells are connected to the same branch of the power grid.

(h) Circuit-level model accuracy (wc.8): The accuracy of the circuit-level models, representing a cell in a netlist, affects the shape of the set of waveforms we use to obtain the basis functions for the variation waveform. A selection of our cases of interest is as follows:

(i) RC_CC: We use the extracted resistance, capacitance and coupling capacitance for the cell and the load; the load is an inverter or multiple inverters with a common input.

(ii) RC_CC+C: We use the extracted resistance, capacitance and coupling capacitance for the cell while we use a capacitor for load; this is for characterizing the cell using effective capacitance.

126 (iii) RC: We use the extracted resistance and capacitance for the cell and the load; the load is an inverter or multiple inverters with a common input.

(iv) NP: We use just transistor models with no parasitics for the cell and the load.

Table 4.20 shows the generalization of all the accuracy improvements methods for the waveform and cell models. The columns of the table are the type of the category of the possible improvement method, the category name, and our options for each category.

The improvement methods are coded as c, w, wc followed by a number for cell models, waveform models, and both, respectively.

We evaluated category w.2 in chapter III, and we have categories c.1 and c.2 as improvement methods for cell models, which will be described in Section 4.4.2. We designed another experiment to explore most of our other options.

We use the discretization level of 11 for cells based on FreePDK45 to have straightforward numbering for 1.1 (volts) supply voltage, which sets the distance between subsequent levels to 0.11 (volts). In contrast, we used a discretization level of 19 for

TSMC 180 nm technology for 1.8 (volts) supply voltage, which set the distance between subsequent levels to 0.10 (volts).

We want to know how the accuracy of the variational PCA waveform model, based on variation model parameters of Table 4.5, is affected for the cases when we make these changes:

(a) Increasing the discretization level from 11 to 19 while the Vt variation is 5% and the L variation is 20% (wc.2),

(b) Using different cell and load models of Table 4.20 (wc.8),

127 (c) Using a central composite design instead of a full-factorial design while the Vt variation is 20% and the L variation is 20% (wc.1),

(d) Using a subrange model by subranging only the load, only the slope, and both while the Vt variation is 20% and the L variation is 20% as shown in Table 4.7 (wc.4),

(e) Increasing the range of variations from 5% to 20% in both threshold voltages and channel lengths as shown in Table 4.6 (wc.5), and

(f) Using different types of waveform basis functions based on: rising output waveforms, falling output waveforms, both rising and falling output waveforms, and converged waveform model (w.2).

We used option 1 for categories wc.6 and wc.7, which are "no variations for load" and "fixed rail-to-rail voltage swing". Please note that the variation model parameters for the combination of cases (d) and (e) are shown in Table. 4.8.

Table 4.21 shows a series of waveform models obtained by varying the options listed.

This table is for the case that we extract the resistance, capacitance, and coupling capacitance (tagged as RCCC) for the cell model. This is similar to Table 4.19 with the addition of the two columns for DOE (Design of Experiment) and the number of points

(discretization level). We have 3 more sets of these tables for (a) extracted parasitics for both resistance and capacitance (tagged as RC), (b) extracted parasitics for only capacitance (tagged as C, and capacitance is used instead of fanout) , and (c) no extracted parasitics (tagged as NP). We use an experimental design to evaluate how our PCA waveform accuracy is affected by the options in categories wc.1-wc.8 of Table 4.20.

We compare the accuracy of all methods using these criteria:

128 (a) Mahalanobis distance: We compare the original output waveforms originated from the selected experimental design with their 2-PC PCA waveform approximates. We plotted the maximum, the average, and the standard deviation of Mahalanobis distance for all cases in Table 4.21.

Table 4.21. Waveform models compared for accuracy – FreePDK45. # Model Name DOE No. Waveforms Subrange of Range of of Used these Process Points Parameters Parameter (%) s Variation L and Vt (%) Trise Tfall slope fanout L Vt 1 11Pts-L05-Vt05-SR-F-Tr-RCCC FF 11 √ 100 12.5 5 5 2 19Pts-L20-Vt05-SR-F-Tr-RCCC FF 19 √ 100 12.5 20 5 3 11Pts-L20-Vt05-SR-F-Tr-RCCC FF 11 √ 100 12.5 20 5 4 11Pts-L20-Vt20-SR-FT-Tr-RCCC FF 11 √ 10 12.5 20 20 5 11Pts-L20-Vt20-SR-F-Tr-RCCC FF 11 √ 100 12.5 20 20 6 11Pts-L20-Vt20-SR-F-CCF-Tr-RCCC CCF 11 √ 100 12.5 20 20 7 11Pts-L20-Vt20-Tr-RCCC FF 11 √ 100 100 20 20

8 11Pts-L05-Vt05-SR-F-Tf-RCCC FF 11 √ 100 12.5 5 5 9 19Pts-L20-Vt05-SR-F-Tf-RCCC FF 19 √ 100 12.5 20 5 10 11Pts-L20-Vt05-SR-F-Tf-RCCC FF 11 √ 100 12.5 20 5 11 11Pts-L20-Vt20-SR-FT-Tf-RCCC FF 11 √ 10 12.5 20 20 12 11Pts-L20-Vt20-SR-F-Tf-RCCC FF 11 √ 100 12.5 20 20 13 11Pts-L20-Vt20-SR-F-CCF-Tf-RCCC CCF 11 √ 100 12.5 20 20 14 11Pts-L20-Vt20-Tf-RCCC FF 11 √ 100 100 20 20

15 11Pts-L05-Vt05-SR-F-Tg-RCCC FF 11 √ √ 100 12.5 5 5 16 19Pts-L20-Vt05-SR-F-Tg-RCCC FF 19 √ √ 100 12.5 20 5 17 11Pts-L20-Vt05-SR-F-Tg-RCCC FF 11 √ √ 100 12.5 20 5 18 11Pts-L20-Vt20-SR-FT-Tg-RCCC FF 11 √ √ 10 12.5 20 20 19 11Pts-L20-Vt20-SR-F-Tg-RCCC FF 11 √ √ 100 12.5 20 20 20 11Pts-L20-Vt20-SR-F-CCF-Tg- CCF 11 √ √ 100 12.5 20 20 RCCC 21 11Pts-L20-Vt20-Tg-RCCC FF 11 √ √ 100 100 20 20

129 (b) The absolute value of error: We can have an idea about the gap between the original waveform and the estimated waveform. We plotted the maximum of this parameter for each discretization point, the maximum of the average of this parameter for each discretization point, and the average of this parameter for all points.

(c) The absolute value of relative error at the middle point (50% point) of the waveform. We use the 50% point of the waveforms as the basis for delay propagation of a waveform. We plotted the average and the maximum of this parameter.

(d) Relative errors: The relative error is calculated by dividing the absolute error (not the absolute value of error) by the original value of each timing point. We used the absolute value of this parameter for the last criteria. We plotted the average and the standard deviation of this parameter for all points and just 50% points.

(e) Absolute errors: An absolute error is calculated by subtracting the original value of a timing point from its estimate. We plotted the average and the standard deviation of this parameter for all points and just 50% points.

Figures 4.23-4.25 are similar plots to 4.20-4.22. We included Figures 4.26-4.27 to show how other criteria are compared. The plots are for models 1-7 for the case when the load is an inverter and the parasitics extracted for the inverter include resistance, capacitance, and coupling capacitance. In general, the plots are similar, but there are some exceptions. Studying the plots shows:

(a) The results of comparison using the Mahalanobis distance are not consistent with the results from using relative errors.

(b) Increasing the discretization level should increase accuracy, but the accuracy can decrease for some cases. This can be explained by the data dependency of the PCA and

130 the fact that we change the datasets of discretized points when we have to extrapolate from the data points that we do not have in our simulations.

(c) The results of using RCCC are almost the same as using RC. The results of using

RCCC, RC, NP, and C are consistent for most of the cases; however, using only a

capacitance instead of an inverter as the load seems to get the best results although it is

not as realistic as using the inverter with a non-linear time varying capacitance.

(d) Using a faced-central composite design improves the accuracy in comparison with

a 2-level full-factorial design, but the improvement is not possibility worth the overhead

for characterization.

(e) Subranging load and slope can improve the accuracy in many cases, but the results

are very data dependent. It seems that the waveforms for a 2-level full-factorial design

dataset resulting from subranged parameters form a more uniform cluster, and are

probably more similar, in comparison with such waveforms for the whole range of

parameters. The waveform models obtained by subranging both load and slope were, in

general, more accurate than the waveform models obtained by subranging just the load.

It means subranging should be done for both dimensions for the best results, which we

call it symmetric subranging.

(f) It seems that increasing the range of variation reduces the accuracy of the models

but this is not a general rule because the results are very data dependent. It seems that the

waveforms for a dataset resulting from a larger range of parameter variation form a more

nonuniform cluster, and are probably less similar because of the nonlinearity of the

parameters, in comparison with such waveforms for a smaller range of parameter

variations. The waveform models obtained by increasing the range of variations in both

131 Vt and L were, in general, more accurate than the waveform models obtained from

increasing just the range of variation in L. This means increasing the range of variation

should be done for both L and Vt for the best results, which we call the symmetric

increase of the range of variation.

(g) A PCA waveform model obtained from combining rising and falling datasets is

almost as accurate as the one from using only a rising dataset. The PCA waveform

model obtained using the rising (falling) dataset is the most (least) accurate. The

waveform accuracy of PCA waveform models obtained using both the falling and rising

dataset falls between the two others.

Waveform Accuracy - Mahalanobis Distance 50 45 Max. 40 Mahalanobis 35 Distance(11-pt) 30 Max. 25 Mahalanobis 20 Distance( 9-pt) 15 Average 10 Mahalanobis AccuracyCriteria 5 Distance(11-pt) 0 >Std C C C Mahalanobis C C C CC ... C C C C Distance(11-pt) -R -R -R r-R -Tr -Tr T -F-Tr -F -F -CCF-Tr Average R -FT- F SR R R- Mahalanobis 5- S t05-SR 0 - Distance( 9-pt) V Vt Vt05-S t20 5- 0- 0- >Std -L0 -L2 -L2 11Pts-L20-Vt20-Tr-RCCC ts ts ts Mahalanobis 1P 9P 1P 1 1 1 11Pts-L20-Vt20-SR-F-Tr-RCCC Distance( 9-pt) 11Pts-L20-Vt20-S 11Pts-L20-V Waveform Model

Figure 4.23. Waveform model accuracy compared using Mahalanobis distance.

132 Waveform Accuracy - Max, Average, Max. Ave. of Absoulte of Relative Errors 0.35 0.30 Max.(11-pt) 0.25 0.20 Average (11-pt) 0.15 0.10 Max. Ave.(11-pt)

Relative Error Relative 0.05 Max. Ave.( 9-pt) 0.00

C C C C C C .. C Max.( 9-pt) C C . C RCCC Tr R - -R r-RCC -R F- C F-Tr F-Tr - - t20-Tr- -F-C -SR -SR t05-SR-F-Trt05-SR t20 -L20-V -Vt05-SR-F-Tr-RCCCV V t20-SR-FT-TV t20 ts 5 0- 0- 0- V -L0 L2 -L2 -L2 0- 11P ts -L20-V ts L2 Pts ts - 1 9Pts- 1P 1P ts 1 1 1 1 1P 11P 1 Waveform Model

Figure 4.24. Waveform model accuracy compared using the absolute value of relative errors.

Waveform Accuracy - 50% Points of Absoulte of Relative Errors 0.14 0.12 0.10 Max.(50% 0.08 Point)

0.06 Average (50% 0.04 Point)

Relative Error Relative 0.02 0.00

C C C C C .. C C . CCC RCCC RCC R -R - -R r- Tr -F- -F-Tr -CCF-Tr R -FT-T F -SR- t05-SR-F-Trt05-SR 0 Vt05-S V V t20-SR Vt20-SR-F-Tr-RCCCt2 5- 0- 0- 0- -L0 L2 -L2 -L2 11Pts-L20-Vt20-Tr- ts ts ts 1P 9Pts- 1P 1P 1 1 1 1 11Pts-L20-V 11Pts-L20-V Waveform Model

Figure 4.25. Waveform model accuracy compared using absolute value of relative errors of 50% points.

133 Waveform Accuracy - Relative Errors 0.08 Average of All 0.07 Points(11-pt) 0.06 >Std of All 0.05 Points(11-pt) 0.04 Average (50% 0.03 Point) 0.02

Relative Error Relative >Std of (50% 0.01 Point) 0.00

C C C C C C C C C .. C C C C . RCC -R -R -R r- -R -Tr Tr Tr -F -F- -F-Tr -F- -CCF-Tr R R -FT-T R F R- -SR S t05-SR - Vt05-S Vt05-S V t20 Vt20-S t20 5- 0- 0- 0- -L0 -L2 -L2 -L2 11Pts-L20-Vt20-Tr-RCCC ts ts ts ts 1P 9P 1P 1P 1 1 1 1 11Pts-L20-V 11Pts-L20-V Waveform Model

Figure 4.26. Waveform model accuracy compared using relative errors.

Waveform Accuracy - Absolute Errors 0.16 0.14 Average of All 0.12 Points(11-pt)

0.10 >Std of All 0.08 Points(11-pt) 0.06 Average (50% 0.04 Point) 0.02 >Std of (50% Point) Absolute Error (ns) AbsoluteError 0.00

-0.02 C C C C C C CCC C R R RCC RCCC r- -R - Tr- Tr -F-Tr- T-T F- -F C -SR-F- -SR R -C -Vt20-Tr -F t05 t05 R L20 V t20-S t20-S -L20- 11Pts- ts ts-L20-V 11Pts-L05-Vt05-SR-F-Tr-RCCC19P 11Pts-L20-V 11Pts-L20-Vt20-SR-F-Tr-RCCC-L20-V 11P ts 11P Waveform Model

Figure 4.27. Waveform model accuracy compared using absolute errors.

134 We created 3-dimentional plots similar to 4.23-4.25 to compare models 1-7 for changes in the options of the categories w.1-w.2 and wc.1-wc.8 of Table 4.21 for the experimental design. The plots are available in Appendix F.

We learned from studying the plots that the results are similar in general but there are some exceptions. Table 4.22 summaries the results of all the plots. They agree with the results we mentioned earlier.

Table 4.22. Which options improve waveform models accuracy – FreePDK45. Does the option Mahalanobis Overall improve accuracy Max Average Max of Max 50% Std (#) based on the criteria? Avg Group Set 1: Set 2: Set 3: Set 4: Set 5: Model Subranging Yes Yes Yes Yes Yes* Yes(5) Central Composite Yes Yes Yes Yes - Yes(4) Design (Faced) Larger Range of No Yes No - No* Yes(1) Variation Higher Discretization No Yes No No Yes Yes(2) Level Using C as Load Yes Yes - - Yes Yes(3) Tr>Tg>Tf (Average), Tr,Tg,Tf (Greater Tr>Tg>Tf Tf>Tg>Tr Tg>Tf>Tr than sign means Tr>Tf>Tg Tg>Tf>Tr Tr>(Tf=Tg) * * (Std), better) Tr>Tf>Tg (Max) # of plots in group 2 7 6 2 4 21

4.4.1.4. Accuracy Analysis of the PCA Waveform Model for FreePDK45 – The

Iterative Method for Finding the Common PCs

We combined the rising and falling waveform datasets to construct a common

waveform model for both rise and fall transitions and investigated the effect of this

combination. We intended to know how the PCA waveform model accuracy is affected

by using a common set of PCs for both inputs and output waveforms of a cell. Having a

135 common set of PCs reduces memory requirement and simulation time for a cell but it can affect the accuracy as we described in Section 3.2 of Chapter III. We used the inverter based on FreePDK45 technology to create the PCA waveform model. The basis functions were based a common set of principal components. We used a similar methodology to construct the PCA waveform model with a unique set of basis functions based on a set of common principal components, as we did for TSMC180RF in Section

3.2 of Chapter III, but we made two changes for improvement. We used a capacitor as the load instead of an inverter as the load; and we used the dataset of the combined set of rising and falling waveforms for PC model construction for the next iteration. For

TSMC180RF, we used the dataset of rising waveforms for the next iteration and we assumed the resulting PCA waveform model could be used for both rising and falling transitions.

We followed the iterative method of finding the set of common PCs described in

Chapter III and we got convergence on the second iteration. Figure 4.28 shows the

convergence of coefficients of principal component basis functions.

Two PCs cover 99.7% of variance for iteration 0 and they cover 99.4% of variance

for iterations 1-2. We performed two more iterations (3, 4) after the observation of

convergence on iteration 2 to make sure the convergence was stable. We evaluated the

model adequacy (fitting accuracy) and prediction accuracy of two PCA waveform models

for iterations 1, 4. Waveform model adequacy is based on the statistics of residual errors.

Waveform model prediction accuracy was evaluated by cross-checking a model (e.g. the

waveform PCA model of iteration 4) with the waveform dataset of the other model (e.g.

waveform data set of iteration 1) and evaluating the residual errors. Moreover, we used

136 the dataset of iteration 0 based on Slope instead of [ L, Θ] to cross-check the dataset with

the PCA waveform model of iteration 4.

Comparing PC1s (Trise)

1 0.9 IT(0) 0.8 0.7 IT(1) 0.6 IT(2) 0.5 0.4 0.3 0.2

Coefficientsof PC1s 0.1 0 0 5 10 15 Data Points

(a)

Comparing PC2s (Trise)

0.6 IT(0) 0.5 IT(1) 0.4 IT(2) 0.3

0.2

Coefficients of0.1 PC2s

0 0 2 4 6 8 10 12 Data Points

(b)

Figure 4.28. The coefficients of the principal components basis functions for the inverter based on FreePDK45 technology, computed after each of the iterations: (a) PC1 and (b) PC2 .

137 Figure 4.29 shows the waveforms for iteration 1 of the inverter based on FreePDK45 technology in both the time domain and the PCA domain. The plots are similar to

Figures 3.1-3.2. They show the original corners of the experimental design if we could have used Cartesian coordinates similar to Figure 3.3, which helped us to explain the concept of invalid waveforms and the acceptability region in Figure 3.4.

Figure 4.29. Waveforms for iteration 1 of the inverter based on FreePDK45 technology, (a) Time Domain and (b) PCA Domain.

The pink line is the limit line similar to what we had in Figure 3.6 to impose the convergence requirement. We can also observe the triangular shaped region that have us use polar coordinates to map each data point back to time domain. Some of data are points mapped close to the pink limit line because we had to extrapolate some of the end

138 points of the transitions since some end points of output transitions could not be captured in our Hspice simulations. We also applied the 8 ns limit on the waveforms by cutting their upper tail of the transition for the transitions longer than the 8 ns limit. We can do that because both tails of the transitions falling outside the range of 10% to 90% do not have significant affect on switching timing of NMOS and PMOS transistors of the inverter because one of them should be off at each tail.

We used the same criteria to assess both model adequacy and prediction accuracy.

The models are listed in Table 4.23. The columns of the table are self descriptive. All the models are similar to item 21 (i.e. 11Pts-L20-Vt20-Tg-RCCC) of Table 4.11 with the one exception that only capacitance is used as the load in the simulations. Figures 4.30-4.33 shows the accuracy comparison plots similar to the of the Figures 4.20-4.22.

We observe the waveform dataset of iteration 0 cross-checked with the waveform model of iteration 4 results in the largest errors in all the three plots. This implies that a saturated ramp transition cannot perfectly be mapped to our PCA waveform model. This is a source of error when we compared the timing results for Tabular STA and our PCA method in Chapter III. We constructed our PCA waveform model for the extreme cases of the largest range of process variation and the largest range of load and capacitance that makes the cluster of output waveforms very nonlinear. It is obvious that mapping a line to a nonlinear curve cannot be done perfectly. The waveform shapes are very nonlinear for the cell in reality based on our simulation results. This fact suggests using a waveform model with better support for nonlinearity like our PCA waveform models can improve the accuracy of our cell models.

139

Table 4.23. Waveform models compared for adequacy and prediction accuracy – FreePDK45. # Case Name: PCA Checking Checking Cross-checked waveform model or the waveform adequacy prediction dataset combination of waveform model accuracy iteration dataset plus waveform model iteration number used as name number

1 11Pts-L20-Vt20-Tg-IT1 1 √ 2 11Pts-L20-Vt20-Tg-IT4 4 √ 3 11Pts-L20-Vt20-Tg-IT4inIT1 1 √ 4 4 11Pts-L20-Vt20-Tg-IT1inIT4 4 √ 1 5 11Pts-L20-Vt20-Tg-IT0inIT4 4 √ 0

Waveform Accuracy - Mahalanobis Distance Max. 18000 Mahalanobis 16000 Distance(11-pt) 14000 12000 Max. 10000 Mahalanobis 8000 Distance( 9-pt) 6000 Average 4000 Mahalanobis 2000 Distance(11-pt) Accuracy Criteria 0 >Std 1 4 4 4 IT IT1 IT IT Mahalanobis g- g-IT n n T Distance(11-pt) -T IT1i IT0i t20 g- V -Tg- -T Average 0-Tg-IT4in 0- 2 Mahalanobis L2 t20 t20 -Vt V V Distance( 9-pt) ts- 0 0- 0- L2 L2 11P 11Pts-L20-Vt20- ts- ts- >Std Mahalanobis 11Pts-L2 11P 11P Distance( 9-pt) Waveform Model

Figure 4.30. Waveform accuracy for a waveform model based on a common set of PCs – Mahalanobis distance (FreePDK45 – Iterations 0-4).

140 Waveform Accuracy - Max, Average, Max. Ave. of Absoulte of Relative Errors 5000 4500 Max.(11-pt) 4000 3500 Average (11-pt) 3000 2500 Max. Ave.(11- 2000 pt) 1500 Max. Ave.( 9-pt) 1000 Relative Error Relative 500 Max.( 9-pt) 0

1 1 IT IT4 IT IT4 - - n inIT4 n g 4i -Tg T 0 -IT1 g-I T -Vt2 -Vt20-T - 0 0 0 2 2 -L -L Pts Pts 20-Vt20-Tg 1 1 L20-Vt2 L 1 1 - Pts 1 1 11Pts- 11Pts-L20-Vt20-Tg-IT0i Waveform Model

Figure 4.31. Waveform accuracy for a waveform model based on a common set of PCs – Max, Average, and Max. Ave (FreePDK45 – Iterations 0-4).

Waveform Accuracy - 50% Points of Absoulte of Relative Errors 1000 900 800 700 Max.(50% 600 Point) 500 400 300 Average 200 (50% Point) RelativeError 100 0

1 4 T T -I -I IT1 g g in -T -T 0 0 t2 t2 -V -V 0 0 2 -L -L2 -Vt20-Tg-IT4 -Vt20-Tg-IT1inIT4-Vt20-Tg-IT0inIT4 s s 0 0 0 Pt Pt 2 1 1 L 1 1 s- s-L2 s-L2 Pt Pt Pt 1 1 1 1 1 1 Waveform Model

Figure 4.33. Waveform accuracy for a waveform model based on a common set of PCs – 50% point (FreePDK45 – Iterations 0-4).

141

We dropped the last model from the plots in Figures 4.30-4.33 to obtain the plots in

Figure 4.34-4.36. These plots compare better the model adequacy and prediction accuracy of the models. We observe the PCA waveform model of iteration 4 is as adequate as the one from iteration 1. Moreover, we see the prediction accuracy of the

PCA waveform model based on the iteration 4 is as good as its adequacy (fitting accuracy) when we cross-check the waveform model for iteration 4 with the dataset of iteration 1. This statement is valid even when we exchange 1 and 4 in the last statement.

The Mahalanobis distance in Figure 4.34 shows how the shape of the waveform

estimates are similar to the shape of the original waveforms for all the four models. The

maximum errors for all the points are less than 7% according to Figure 4.35. The average

error for all the points of the our waveform estimates in comparison with the original

waveform is about 5% and the average error of the 50% point of the waveform estimates

is about 3% according to Figure 4.36.

There are other plots similar to Figures 4.26-4.31 for the PCA waveform model based

on a common set of PCs but we do not include them in this document.

142 Waveform Accuracy - Mahalanobis Distance Max. 70 Mahalanobis 60 Distance(11-pt) 50 Max. 40 Mahalanobis 30 Distance( 9-pt) 20 Average 10 Mahalanobis Distance(11-pt) AccuracyCriteria 0 >Std Mahalanobis -IT1 -IT4 IT1 IT4 in in Distance(11-pt) -Tg 0 -IT4 -IT1 Tg -Tg Average 0 Mahalanobis -Vt20- -Vt2 0 0 Distance( 9-pt) 2 2 L L 11Pts-L20-Vt2 11Pts-L20-Vt20-Tg- - >Std Pts Pts 1 1 Mahalanobis 1 1 Distance( 9-pt) Waveform Model

Figure 4.34. Waveform accuracy for a waveform model based on a common set of PCs – Mahalanobis distance (FreePDK45 – Iterations 1-4).

Waveform Accuracy - Max, Average, Max. Ave. of Absoulte of Relative Errors 0.8 Max.(11-pt) 0.7 0.6 Average (11-pt) 0.5 0.4 Max. Ave.(11- pt) 0.3 Max. Ave.( 9-pt) 0.2 Relative Error Relative 0.1 Max.( 9-pt) 0.0

1 4 IT IT1 IT4 - n n 4i -Tg -Tg-IT 0 0 IT 2 - Tg -Vt2 - 0 0 2 L - -L20-Vt -Vt2 s 0 Pts 2 1 L 1 11Pt - Pts 1 1 11Pts-L20-Vt20-Tg-IT1i Waveform Model

Figure 4.35. Waveform accuracy for a waveform model based on a common set of PCs – Max, Average, and Max. Ave (FreePDK45 – Iterations 1-4).

143 Waveform Accuracy - 50% Points of Absoulte of Relative Errors 0.14 0.12 0.10 Max.(50% 0.08 Point) 0.06 Average 0.04 (50% Point)

RelativeError 0.02 0.00

1 4 IT IT - - inIT1 inIT4 g g 4 1 -T -T T T 0 0 -I -I t2 t2 Tg Tg -V -V - - 0 0 0 0 2 2 2 L Vt Vt s- s-L2 Pt Pt 1 1 L20- L20- 1 1 - -

11Pts 11Pts Waveform Model

Figure 4.36. Waveform accuracy for a waveform model based on a common set of PCs – 50% point (FreePDK45 – Iterations 1-4).

4.4.2 Accuracy Analysis of the Cell Models

The accuracy of cell models is affected by the accuracy of the waveform model that

the cells use. In this section, we investigate the accuracy improvement methods specific

to our cell models and not our waveform models.

Although it is possible to construct a variation-aware timing model just based on tables, the exponential complexity (with the table resolution as the base) of the characterization time and memory with respect to the number of dimensions (variables) makes such models impractical. Such models could be very accurate with enough resolution for each dimension, while being impractical as described. Such models can be compacted using linear regression to find their corresponding multivariable equations; however, they still suffer from exponential characterization time. Our compact variation-

144 aware models use just a small fraction of all the points based on analysis of variance or by sampling the whole space; therefore, characterization time will be much less while still being exponential with a base of 2.

The table-based model will be more accurate because in our models the existence of

residuals means that the equations do not pass through all the points in the tables;

therefore, the residuals will reduce accuracy in favor of compactness of the models.

We used a more accurate waveform model to increase accuracy of our cell modeling.

We want to know how much accuracy improvement can be achieved by building a

compact variation-aware model using slope instead of the variational waveform model.

In Section 4.4 we built our models using slope instead of the variational waveform

model. Although the models could give use reasonable accuracy, it obvious that a more

accurate waveform mode can increase the accuracy of the cell models. We did not

construct our cell models for the same technology; therefore, we do not have any data for

the level of the improvement of the accuracy.

In general, including more sampling points at strategic locations in designs can improve the accuracy of our cell models; therefore, building models using designs with more sampling points, such as central composite designs [52] and 3-level full-factorial designs [52], should improve the accuracy of the models but increases the characterizing time. It is interesting to see if the increase of the accuracy is worth the characterization time overhead; therefore, we included this topic as an item in our future research list.

4.4.3 Conclusions

The tabular cell models can be very accurate, but their time and space complexity prohibits using them for a large number of parameters; therefore our compact variation-

145 aware cell models can be useful considering their superior time and space complexity.

However, the accuracy is impacted. We listed the sources of the errors in our waveform and cell models as circuit-level simulation errors, experimental design sampling errors and model fitting errors. We categorized our accuracy improvement methods as (a) waveform model improvement and (b) cell model improvement. Since we use a waveform model in a cell model, the accuracy of a cell is affected by accuracy of the used waveform models. Consequently, our variational waveform model can improve the accuracy without imposing a large overhead on cell characterization.

We used TSMC180RF and FreePDK45 technologies in developing our waveform and cell models. We concluded several results from a set of experiments to evaluate the accuracy of our waveform and cell models. First, our methodology is not technology dependent. Second, we can use a common set of basis functions for our variational waveform model without much impact on its accuracy. Third, the accuracy of our variational waveform model increases with increasing the number of the points to represent the waveform, but it reaches to saturation after a limit. Forth, using the effective capacitance load model increases the accuracy of cell modeling in comparison with the number of fanouts load model that we used before, although using a resistive- capacitive load model is more accurate. Fifth, symmetric subranging for load and slope can increase the accuracy of the models in general but not always. Sixth, increasing the range of parameter variation can reduce the accuracy of the models in general, but not always. Seventh, our methodology is applicable for cell characterization with loads that can be affected by variation themselves, but the accuracy is affected depending on how we characterize our cells for a variational or fixed load. Eighth, the accuracy of our

146 models is affected by the difference between the supply voltage and the total swing of the input waveforms that can happen by the variation in the supply voltage. Ninth, the experimental design used can affect both our waveform and cell models and using more samples and better sample locations. Finally, the inaccuracies of the circuit-level model used as well as the simulator are reflected in high-level cell models.

While we tried to explore as many options as we could to find ways to improve the accuracy of our waveform and cell model, we had to narrow down our selection options.

We leave exploring more options with using the resistive-capacitive load model as a few items in our future research directions that we mentioned in 4.3.8.

147 CHAPTER V

FAST VARIATION-AWARE STATISTICAL DYNAMIC TIMING ANALYSIS

A statistical dynamic timing analysis framework is presented to study the impact of catastrophic defects and process variation on the delay behavior of a digital circuit considering the effect of gate switching on delays. It uses object-oriented programming and levelized code generation techniques to achieve fast runtimes with linear time complexity as the number of gates increases. The generated functional delay model along with experiments and statistical modules are compiled to machine code before execution; and random transition vectors approximate the delay profiles useful for virtual speed grading and yield estimation. The methodology was published in [85].

5.1. Introduction

Yield of digital circuits is reduced when a fraction of circuits do not meet timing constraints. Major sources of yield loss include process variability and random defects.

Statistical timing analysis enables estimation of yield loss from failures of timing specification tests.

Timing characteristics of individual circuits are primarily a function of process variations, especially from channel length [86]-[88] and threshold voltage [89]; however, manufacturing defects can also significantly degrade timing characteristics [90]. Hence, it is important to determine timing sensitivity to both process variations and to major sources of defects, such as resistive vias.

148 Timing analysis can be static or dynamic. Dynamic Timing Analysis (DTA) and

Static Timing Analysis (STA) are not alternatives to each other. STA estimates the delay of paths (from inputs to outputs) supposing the other inputs of gates not in the path have fixed logic values. Dynamic timing analysis verifies functionality of the design by applying input vectors and checking for correct output vectors. The quality of DTA increases with an increase in the number of input test vectors, at the expense of simulation time.

While a static timing analysis (STA) approach is very pessimistic, statistical STA methods are more realistic [6],[91]-[92]. Path-oriented statistical STA tools actually do a similar analysis for a set of predetermined critical (longest delay) paths and form the corresponding delay probability density functions considering parameter variations; which are combined to form their joint probability density function. The run time is a linear function of the number of paths although the total number of paths is an exponential function of the number of gates.

In high performance designs, such as pipelines, all paths are designed to have very close delays; therefore, all paths must be included in the analysis, which effectively makes the run time an exponential function of the number of gates [1]. Moreover, process variation can make a critical path non-critical or vice versa. To make sure all the paths are covered, all must be selected. In such situations block-based statistical static timing simulation has much better performance [1]; however statistical timing simulation with a proper set of vectors is much more accurate when the switching of gates is considered [93].

149 We have designed a statistical dynamic timing simulation framework to study the impact of catastrophic defects and process variations on the delay behavior of a digital circuit. The tool can be used for virtual speed grading, yield estimation, and delay fault diagnosis. Based on our knowledge, this is the first statistical DTA tool that considers statistical distributions of parameters, not just corners as in [98],[99] , and it has almost linear time complexity as the number of gates increases.

Based on the classification of statistical performance simulation methodologies in

[96], statistical timing simulation can be either Monte Carlo [93], Quasi-Monte Carlo

[97], or analytical block-based using joint probability density functions [6],[98]. While

Monte Carlo implementations suffer from intense computation requirements, analytical

approaches have exponential worst case run times. We used a Monte Carlo approach.

Our approach uses gate level simulation to achieve faster speed in comparison with circuit-level or switch-level simulation, combined with a levelized-compiled code approach to achieve higher performance. While statically-scheduled levelized code evaluates logic gates based on the partial order of causality, dynamically-scheduled code schedules evaluations just as needed [99].

Digital circuit simulators are either based on interpreted or compiled-code. Circuit

compilation increases the efficiency and speed of simulation at the cost of preprocessing

time and larger code. Circuit compilation is essentially a pre-processing step that

symbolically executes the simulation to “uncover” data structures that can usually be

statically allocated. The circuit graph traversal is eliminated by hard coding in the

simulator kernel. Moreover, most indirect memory references are replaced with direct

ones. Compilation unrolls most loops and embeds most function calls; therefore, it

150 reduces the context switching overhead and increases instruction level parallelism in parallel or superscalar processors [100].

Texam [101] makes compiled code cycle simulation more efficient by separation of timing and function. Boolean operations are executed in one cycle. Multi-value logic code generation was developed to eliminate the need for event-driven simulations. We enhanced this approach by including the effects of parameter variations on timing and expressing delays with equations in our framework.

Fault simulators use fault models to simulate faults. In this paper, only defects

resulting in delay faults are considered. Such defects may result from resistive vias,

which are common in deep submicron technologies, and random process variations.

Section 5.2 gives more information about the statistical dynamic timing analysis

(SDTA) framework and its implementation. Section 5.3 presents the experimental results with their comparison and interpretation. Section 5.4 is dedicated to the conclusion and future work.

5.2. VVCCP– A compiled-code SDTA tool

A framework has been created for statistical dynamic timing analysis, based on

Verilog/VHDL Compiler-Compiler Programs (VVCCP), which extends VCCP [101]

with dynamic static timing simulation code generation modules.

5.2.1 Fault simulation framework

C++ code generation allows building flexible experiments with integrated test

benches for statistical analysis and fast simulation runs. Fast simulation is achieved by

running machine level code, and no external simulation engine is required.

151 Using the framework, a faultable functional delay model is created. Delay faults are injected as variations of gate delays, which include global process shifts and random delay faults, defined in the corresponding generated test benches. The integrated statistical analysis in the test bench determines the simulation results.

Figure 5.1 shows the simulation framework. [102] and VHDL [103] are used as the front-end for the input. The fault simulation engine and critical path functional delay model are generated for the input model in C++. The generated code is run after compilation.

Figure 5.1. Fault simulation framework.

Here, the experiment of interest is the delay profile generation and comparison. A

delay profile is the cumulative probability density function of critical path delays of a

circuit. Examples of these delay profiles are shown in Figure 5.6 in Section 5.2.3 which

provides the delay as a function of a set of 4096 random patterns. The minimum and

maximum delay in Figure 5.6(b) is 0 and 1400 a.u., respectively. It can be seen that some

patterns produce significantly larger delay than the others.

152 The framework supports experiments, such as path delay as a function of a set of test patterns in the presence of global parametric shifts, random within die variation, and single and multiple delay faults. From these simulations, we obtain the impact on critical path delay and the impact on the delay distribution for all test patterns.

Figure 5.2 shows the block diagram of VVCCP. After lexical analysis of the gate level model, intermediate code is generated that is converted to an abstract model used by the code generators. Special code generators were designed and implemented to generate the functional delay model, the statistical module, and the automated experiments. The code generator for the critical path delay functional model is based on the functional delay model. Other code generator modules can be plugged into the suite.

Figure 5.2. VVCCP block diagram.

5.2.2 Transformation process of models

The transformation needs two mappings as shown in Figure 5.3. The first one maps

the input model into the functional delay model. This is symbolically shown by replacing

153 the input/output port labels by a delay_ prefixed label creating an input delay vector and an output delay vector. Gate delays are shown as vectors. The next mapping creates the critical path functional delay model by the conversion of the output delay vector into a scalar delay value. If logic is not considered, the critical path functional delay model actually gives an upper bound on the logic dependent critical path delays.

The input/output delay initializations are essential to the correct functionality of the model. The output delay vector (D Outputs ) is a function of the input delay vector

(D Inputs ) and the gate delay vector (D Gates ):

r r r DOutputs = DF (DInputs , DGates ) (5.1)

The functional critical path delay model is the sub-graph of the delay graph based on

the activity of the logic signals. The main three steps are the initialization of the logic

and delay elements, the logic network evaluation and the delay network evaluation.

Figure 5.3. Model transformations.

154 In static timing analysis, the critical path delay function is a scalar value that is the maximum delay considering all outputs, input delays, and gate delays. r r d = CPDF (D , D ) (5.2) Inputs Gates

Dynamic timing analysis accounts for the sensitizability of paths. The critical path delay function is the vectorized form of the former function where each element of the vector corresponds to each applied logic transition input vector (Linputs). r r r r d = CPDF (L , D , D ) (5.3) Inputs Inputs Gates

A six value system [104] is used for the logic manipulation. Faulty delays are computed through fault insertion. r r r r d = CPDF (L ,0, D ) (5.4) Good Inputs GoodGates r r r r d = CPDF (L ,0, D ) (5.5) Faulty Inputs FaultyGate s

5.2.3 Experiments

Four types of experiments on input models were implemented: a parametric shift experiment, a random within die variation experiment, fault-free circuit delay profile generation, and faulty circuit delay profile generation. The gate level delay model of

Section 5.2.2 is used in this implementation; however, the experiments can be modified to use actual delays derived from back annotation of a design extracted from a layout.

155 Figure 5.4 shows the graphical representation of the first two experiments (a, b). Red dots correspond to faulty gates, and green dots are fault-free gates.

Figure 5.4. Parametric and random experiments.

The parametric shift experiment increases the delay of all of the gates from 0% to

100% as the fault injection method. This corresponds to global process variation, which is analyzed for the worst case or corner analysis. A random within die variation experiment adds random gate delay variation of 0% to 100% of the nominal delay with an injection probability of 0% to 100% to the parametric experiment. This models variation due to random variation in channel doping and channel length. Comparison of the delay distributions is made after normalizing the maximum delay of each circuit instance.

Figure 5.5(a) shows the experimental design for fault-free and faulty delay profile generation experiments. The light green dot in the middle of the front box corresponds to

(b); the dark green dots in the middle of the sides of the front box correspond to (c); the red dot in the middle of the back box corresponds to (d); the blue dots at the corners of

156 the front box correspond to (e); and the pink dots at the corners of the back box correspond to (f).

The experiments that were performed are: (b) generation of the ideal fault-free delay profile (DP) using equation (5.4), (c) generation of the fault-free DP in the presence of process shifts (±30shift), (d) generation of the faulty DP with a single delay fault with

100% increase in delay, (e) generation of the good DP with ±5% random noise, and

±30% shift in the process that here good refers to not ideal and possibly faulty and (f) generation of the faulty DP with a single delay fault with +100% increase in delay, ±5% noise, and ±30% shift in the process. We generate the delay profiles using appropriate fault injection for each experiment and apply the same set of input transitions.

Experiments (c)-(f) use equation (5.5). Figure 5.5 shows the resulting delay distribution functions.

Details of implementation are different for each experiment; however, the general pseudo-code is shown in Figure 5.6. The path delay function first determines the logic and delay of each output; then it obtains the path delay by evaluating a function in terms of delay and logic of output nodes. The delay profile generator initializes all experiment parameters and then applies the set of transition vectors. In the loop, it first resets the delay of all gates and signals of the circuit; then it calls the path delay evaluation function and updates the statistical tables.

157

Figure 5.5. Delay profile generation experiments.

158 delay path_delay_function() {

… logic_function(); delay_function();

//Delay is expressed as a function of other delay and logic values delay one_bit_alu_max_delay= … /* s_delay ^ co_delay */

return one_bit_alu_max_delay; }

DelayProfileGeneration (…) {…

SetExperimentParameters (Single Delay Increase, Process Shift, Noise);

For all (transition_vectors) {

one_bit_alu_max_path_delay. reset();…

one_bit_alu_max_path_delay= path_delay_function();

UpdateTable(output path delay);}

…}

Figure 5.6. Delay profile generation pseudocode.

5.3. Experimental results

Delay profiles were generated for the experiments outlined in Section 5.2.3 using

ISCAS’85 benchmarks [105]. The circuits were simulated; and run-times are shown in

Table 5.1. The ideal, good, faulty, and single fault experiments correspond to

experiments (b), (e), (f), and (d) of Figure 5.5 respectively.

Table 5.2 shows compile time, run time and the number of gates, inputs and outputs

for the ideal case. The run times and the compile times vs. the number of gates are plotted

in Figure 5.7. The run-times had almost a linear relation with the number of gates; the

nonlinear relation of compile times with the number of gates can be reduced using more

159 efficient compilers. However, the compile is done just once, while simulations run many times. Consequently, the compile time overhead is averaged over many simulations.

Table 5.1. Run time for different experiments. Run Time (s) Model Ideal Good Faulty Single Fault c17 0.3 0.3 0.3 0.3 c432 7 5 5 7 c499 9 9 9 9 c880a 17 17 17 17 c1355 29 20 19 29 c1908 44 43 43 44 C1908a 42 42 42 42 c2670 60 48 48 60 C2670a 58 47 47 55 c3540 88 31 31 88 C3540a 85 37 38 86 c5315 123 47 47 124 C5315a 119 47 48 119 c6288 262 81 82 261 c7552 197 178 152 197

Figure 5.7. Fault simulation framework.

160 All the experiments were run on a computer with this specification: Intel Pentium IV

2.4 GHz, 500 MHz bus, 512 MB Ram, and Microsoft Windows XP; and the C++ compiler of Microsoft Visual Studio 6 was used for compilation.

Table 5.2. Number of inputs, outputs, gates, compile and run-times. Time(s) (Ideal) Model Inputs Outputs Gates Compile Run c17 5 2 6 0 0.3 c432 36 7 160 0 8 c499 41 32 202 0 9 c880a 60 26 357 60 17 c1355 41 32 546 60 29 c1908 33 25 718 60 45 c1908a 33 25 880 60 42 c2670 233 140 997 120 60 c2670a 157 64 1,193 120 58 c3540 50 22 1,446 180 88 c3540a 50 22 1,669 180 85 c5315 178 123 1,994 540 123 c5315a 178 123 2,307 540 118 c6288 32 32 2,416 780 261 c7552 207 108 2,978 1,320 197

5.4. Conclusion and future work

We have presented a dynamic statistical timing analysis framework with almost linear

time complexity as the number of gates increases. Despite the fact that it is based on

Monte Carlo simulation, it provides very fast run times resulting from compilation to

machine code. Having proper random transition vectors guarantees quality delay profiles

for yield estimation and virtual speed grading.

Future work can benefit from more accurate gate delay models [55]-[59], better

process variation models, more flexible experiments, and selection of a more efficient

161 compiler and development platform. Moreover, the framework may be used for other timing related experiments such as detection and location of delay faults, i.e. gate delay fault diagnosis in [104].

162 CHAPTER VI

FUTURE RESEARCH DIRECTIONS

Despite the existence of several possible interesting paths to continue my research after finishing each phase of my research, I had to narrow down my research and work on the direction that I explained in the previous chapters due to my time and resource limitations of my graduate studies. However, I documented the most interesting ideas as possible future research directions that I present in this chapter as my short term and long term research plans.

6.1. Short-Term Research Plan

My short-term research plan is to innovate by expanding my current research and

introducing a new perspective on the fundamentals. My research ideas on electronic

design automation for compact variation-aware modeling would (a) expand my compact

variation-aware modeling methodology described earlier, (b) contribute to the

fundamentals of statistical performance analysis to improve the speed and accuracy of

simulations, or (c) reduce the memory requirements of current-source models, currently

adopted by industry, using my statistical compact waveform models. I have detailed my

short-term goals.

(a) I want to build empirical models for performance functions based on the insight

gained from data analysis of cell performance functions for several standard cell libraries

from industry. This will help to determine the optimal polynomial order for each

163 variation parameter to find the general form of a compact performance function. Linear regression using the general form of the performance functions can improve the accuracy of the models in comparison with 2-level factorial designs that consider only the linear terms and their combinations.

(b) I aim to extend the cell characterization methodology for statistical power

analysis by considering current waveforms instead of voltage waveforms. I used the

compact variation-aware modeling methodology to target delay as the performance

parameter for statistical static timing analysis.

(c) I intend to formally classify transition waveforms based on the statistical analysis of switching transitions for several standard cell libraries from industry. The expected results can be used in formulating a very fast timing analysis engine considering real waveform shapes based on lookup tables. This research could fundamentally change the way timing analysis is performed by enabling simulation at a higher level of abstraction with the accuracy of circuit-level simulation. The main applications could be statistical performance simulation and variation-aware placement and routing.

(d) I am interested to determine how error propagates from a delay function in static timing analysis to the cumulative probability density function of delay in statistical static timing analysis. The range of error in static timing analysis estimation affects the cumulative probability density function of delay. This can be used to answer the question on what accuracy is needed for static timing analysis to guarantee a given accuracy for statistical static timing analysis.

(e) I plan to use a statistical model to compact the waveforms of current-source models (CSMs). Current source models) such as the Composite Current Source (CCS)

164 from and the Effective Current Source Model (ECSM) use 3-D tables that suffer from large size and poor scalability. The waveforms for current are represented as multi-parameter vectors, e.g., 20, instead of just one slope parameter. This increases the memory requirement and the size of the library for CSM while the variation-aware versions are much more resource intensive. In an effort to make the size of the models smaller, enumeration of waveforms has been used in the compact form of the models and this still requires a large table to populate the enumerated waveforms. It can be interesting to study the waveform data for current to see if a statistical variational waveform model with just a few parameters can be used instead. This can reduce the memory requirement and the size of the libraries.

6.2. Long-Term Research Plan

My long-term research plan would require a multidisciplinary research approach to

solve these issues: (a) variation-aware performance analysis of nanometer systems

beyond the CMOS technology, (b) accurate and efficient frameworks for statistical

performance analysis, and (c) probabilistic diagnosis of delay faults. Multidisciplinary

research is interesting because it combines the key ideas of different disciplines and

various areas of interest to find innovative solutions, which is not possible in any

individual discipline. Below, I have detailed my long-term goals.

(a) I want to adapt my compact variation-aware methodology to the characterization

of the switching elements of technologies with a potential to succeed CMOS technology

in the future because the methodology is general and not specific to CMOS technology.

165 To extend the Moore’s Law further, carbon-nanotube and graphene transistors may be able to overcome the fundamental limits of switches based on CMOS technology. The capability of intramolecular field effect transistors and intramolecular logic gates based on carbon nanotubes as the building blocks of digital systems has been demonstrated before, and researchers have been working on perfecting their solutions to the problem of the large-scale manufacturability of carbon-nanotube devices. Moreover, significant attention has been paid to graphene transistors and their evolution has been demonstrated by development of (i) basic logic gates of the Politecnico di Milano, (ii) the graphene- based frequency multiplier prototype of the Massachusetts Institute of Technology, and

(iii) 100-GHz graphene transistors of the IBM. However, the performance of these devices will be more sensitive to variation in process and environmental parameters than devices based on CMOS technology. Therefore, the statistical performance evaluation of the systems built with these devices will be essential for the same reasons mentioned for nanometer CMOS technology. Hence, we need a probabilistic view of performance analysis and compact variation-aware models for performance analysis. In addition, the optimization of the models for the set of new variation parameters would enhance the quality of the models.

(b) I intend to facilitate statistical performance analysis by creating the necessary probabilistic computation framework that could perform basic statistical operations accurately and efficiently utilizing techniques of some other areas of research. These are as follows.

i) I want to perform statistical performance analysis using HDL's [106]. Such analysis could accomplished through non-native function calls from the simulator to

166 perform primitive statistical operations (MIN and ADD) on cumulative probability density functions, which would enable the use of the same gate-level netlist to estimate performance at the early stages of design. This research uses the key concepts of hardware description languages and statistical timing analysis.

ii) I intend to speed up the primitive statistical operations (MIN and ADD) for statistical timing analysis by designing a special purpose execution unit for the operations or by taking advantage of parallel processing capabilities for vector processing of modern microprocessors to handle the operations. Since the mentioned operations are slow, this method would enable fast simulation for block-based statistical timing analysis. This research makes use of computer-architectural techniques to accelerate statistical static timing analysis effectively.

iii) I plan to model using hardware description languages for incorporation of

waveform shapes for verification of timing closure at the early stages of the design of a

digital system. The goal of this research is to enable timing analysis with HDLs at a

higher level of abstraction with accuracy close to that of circuit-level simulation. The

next step would be to construct models with hardware description languages,

incorporating process variation and environment parameters for the estimation of design

performance parameters, such as timing and power at the early stages of the design of a

digital system. The goal is to perform static timing analysis on a higher level of

abstraction that accounts for process variations, simulated with accuracy close to circuit-

level simulation. Using a higher level of abstraction would speed up the simulations.

This enables a path-oriented statistical static timing analysis using static timing

167 simulation. This research direction is different from those in items i and ii since the ideas

of which are applicable to block-based statistical timing analysis.

(c) I am interested to determine the location of a delay defect probabilistically using a

delay signature, which I define as the probability density function of the critical path

delay of a digital circuit for a set of input transition vectors. A delay signature database

can be built by simulating a circuit for each possible delay defect for a set of input

transitions. Comparison of the delay signature for a circuit under test with the delay

signatures in the database would provide probabilistic location information for each delay

fault. This method does not impose any overhead on testing because a delay signature for

the circuit under test is obtained by just storing the output vectors for different clock

frequencies during binning while each output vector corresponding to the sequence of

input vectors is captured by the scan chain during normal logic testing. This research

combines pattern recognition and timing analysis for fault location.

168 APPENDIX A

PRINCIPAL COMPONENTS ANALYSIS EQUATIONS

We have listed the important equations related to principal component analysis that we have used in the text. The reader can consult the reference [26] for more information.

A.1. Assumptions x is column vector of samples with p elements.

 x1    x2  x = (A.1)  ...    x p  If we have n samples, the average vector of samples is

 x1    x2  x = (A.2)  ...    x p  for each i, we have

∑ xij x = (j= 1..n). (A.3) i n S is the covariance matrix of X;

 s1 s12 ... s1p    s21 s2 ... s2 p  S = , (A.4)  ......    s p1 s p2 ... s p  where

169

n∑ xik x jk − ∑ xik ∑ x jk s = . ij [n(n − )]1 (A.5) S is a symmetric matrix.

R is the correlation matrix of X;

 1 r12 ... r1n    r21 1 ... r2n R =   , (A.6)  ......    rn1 rn2 ... 1  where

sij r = . (A.7) ij (si s j ) R is a symmetric matrix.

A.2. Singular value decomposition

Singular value decomposition (SVD) is based on finding U, an orthonormal matrix, such

that

U 'SU = L , (A.8) while L is a diagonal matrix. S is a p x p covariance matrix as defined before. Elements

of diagonal matrix L are eigenvalues (or latent roots) of S.

l1 0 ... 0    0 l2 ... 0  L = (A.9) ......    0 0 ... l p  Columns of U are eigenvectors (characteristic vectors) of S.

170 u11 u12 ... u1p    u21 u2 ... u2 p  U = (A.10)  ......    u p1 u p2 ... u p 

This matrix shows ui (eigenvector i of S) corresponding to li (eigenvalue i of S) as an example

ui1    ui2 u =   (A.11) i  ...    uip  Eigenvalues are obtained by solving the characteristics equation as shown below.

S − l * I = 0 (A.12) Here, I and 0 are the identity matrix and zero matrix of the same size as S, respectively.

Solving this set of simultaneous equations gives the corresponding eigenvector for each eigenvalue of lk, k= 1.. p.

[S− l * ]tI = 0 (A.13) k k In the above homogenous linear equation,

tk1    tk 2  tk = ;  ...  (A.14)   tki 

we can set tk1 = 1 to find the rest of tki , i= 2.. p.

The normalizing equation gives uk, i= 1.. p.

t u = k k (A.15) tk 'tk

The characteristic vectors make up U.

171 U = [u | u ...| | u ] 1 2 p (A.16) A.3. Properties of U

U is orthonormal.

U '* U = 1, u '* u = 1, u '* u '= 0 ( j ≠ k) k k j k (A.17) The inverse of U is its transposed matrix.

−1 U = U ' (A.18)

A.4. Principal Component Transformation

The principal component transformation maps p correlated variables in x into p new

uncorrelated variable z.

z = U '[x − x] (A.19) Or, we can write it as follows:

 z1    z2  z = , (A.20)  ...    z p 

z = u '[x − x] i i (A.21) th th The correlation of i PC, zi, and the j original variable, xj, is

u ji li rzx = (A.22) s j

This is the equation for the inverse of principal components transformation.

172 x = x +Uz (A.23)

A.5. Generalized Measures of Variability

Generalized variance is the determinant of the covariance matrix, i.e. S . The root of this quantity is proportional to the area or volume generated by the set of data.

p = S ∏li (A.24) 1

Trace of S is the sum of variances of variables.

p = 2 Trace (S) ∑ si (A.25) 1 or

p = Trace (S) ∑li (A.26) 1

A.6. Scaling of Characteristic Vectors

Orthonormal characteristic vectors, U-vectors, are scaled to unity; these are vectors we obtain from using functions ‘princomp’ or ‘pcacov’ of Matlab. V-vectors and W-vectors

are other popular alternative form of U-vectors which are defined below.

v = u l or V = UL 2/1 (A.27) i i i

V 'V = L (A.28)

VV '= S (A.29)

2 V 'SV = L (A.30)

173 z l = v '[x − x] (A.31) i i i

w = u / l or W= UL − 2/1 (A.32) i i i

−1 W'W = L (A.33)

−1 WW'= S (A.34)

W 'SW = I (A.35)

y = w '[x − x] i i (A.35)

2 Several statistical formulas, such as T -statistics, are expressed using yi. The pairs that

convert yi to zi and zi to yi are as follows.

y = z / l (A.35) i i i

z = y l (A.36) i i i

A.7. Overall Measure of Variability

Hotelling’s T 2 -statistics is actually the sum of magnitude of score vectors. It has T 2 -

distribution.

You can calculate it using the following formulas:

2 T = y' y (A.37)

p 2 = 2 T ∑ yi (A.38) 1

2 −1 T = [x − x]' S [x − x] (A.39)

2 T = [x − x]' WW [' x − x] (A.40)

174 2 −1 T = z' L z (A.41) T2-distribution is related to F-distribution [43]. It can be used to check the significance level of the test if each new data point belongs to a dataset for process control.

2 p(n − )1 2 T = F (A.42) p,n,α n − p p,n− p,α

Here, n, p, and α are number of samples, number of PCs and significant level for the test,

respectively.

A.8. Residual Analysis

We briefly describe several popular statistics that we used for residual analysis.

Q-Statistics represents the sum of squares of the distance x − xˆ from the k-dimensional distance that the PCA model defines. It has approximately N(0,1) distribution.

Q = [x − xˆ][ x − xˆ] (A.43) It is also the weighted sum of squares of unretained PCs.

p = 2 Q ∑ zi (A.44) i=k +1

p = 2 Q ∑li yi (A.45) i=k +1

We can obtain an upper limit of Q, given:

p θ = 1 ∑li , (A.46) i=k+1

p θ = 2 2 ∑li , (A.47) i=k +1

175 p θ = 3 3 ∑li , (A.48) i=k +1

2θ θ h = 1− 1 2 , 0 2 (A.49) 3θ3 c is normally distributed with zero mean and unit variance.

h  Q  θ h (h − )1    − 2 0 0 −    2 1  θ θ   1  1  (A.50) c = θ1 , 2 2θ 2 h0

The critical value for Q is

1

 2  h0 cα 2θ h θ h (h − )1 Q = θ  2 0 + 2 0 0 +1 , (A.51) α 1  θ θ 2   1 1 

where cα is the normal deviate cutting off an area of α under the upper tail of the distribution if h0 is positive and under the lower tail if h0 is negative.

If E is the covariance matrix after k characteristic vectors have been extracted, then we have:

θ = Trace (E) , (A.52) 1

θ = Trace (E 2 ) , (A.53) 2

θ = Trace (E 3 ) , (A.54) 3 The Hawkins statistics is the unweighted sum of squares of the unretained PCs.

p 2 = 2 TH ∑ yi , (A.55) i=k +1

176 It is similar to T 2 -statistics while T 2 -statistics is the unweighted sum of squares of all

PCs.

177 APPENDIX B

The Standard Cells Used in the Research

We used TSMC 180 nm and FreePDK45 (Version 1.3) transistor models in our circuit-level simulations. We created our own layout for cells based on TSMC 180 nm and we used OSU (Oklahoma State University) standard cells, which are based on

FreePDK45. As an alternative the standard cells from Nangate could be used, which are still based on FreePDK45 (Version 1.3). The Nangate standard cells include the nonlinear-delay model (NLDM), Composite Current Source (CCS), and ECSM.

178 APPENDIX C

Comparing Experimental Designs for Generating the Waveform and Cell Models

We used a full-factorial design to generate the set of waveforms to generate the PC waveform models and the cell models. Alternative designs are explored to know how we can improve the accuracy of the waveform models and cell models and to determine the best way to incorporate variations in cell characterization. Our Chapter III models are 9- dimensional based on the parameters of Table 3.1, but we show 3-dimensional models with one parameter for visual comparison. We list the designs as follows.

C.1. Full-factorial Designs

Two-level full-factorial designs are specific form of factorial designs where we consider just two levels for each variable. They explore all corners of the design. Figure

C.1 shows a 2-level full-factorial design with three variables. Our FF model selects all terms. our FF2 and FF3 select significant factors up to 2 and 3 factors, respectively.

P1

Slope

Fanout

Figure C.1. Two-level full-factorial design.

179

C.2. Fractional-Factorial Designs

A fractional-factorial designs are subsets of the full-factorial designs. The subset is chosen in a specific as described in [43]. Two-level fractional-factorial designs are subsets of 2-level full-factorial designs and they explore just a fraction of the corners.

Figure C.2 shows a 2-level fractional-factorial design with three variables. This is the design that we can use to reduce the number of circuit simulations although it does not improve the accuracy. Here, there is a trade-off between accuracy and characterization speed. The number of samples of our model based on this design, i.e. FRF, is ¼ of FF.

P1

Slope

Fanout

Figure C.2. Two-level fractional-factorial design.

C.3. Designs Based on Latin Hypercube Sampling

In these designs, the corners are not necessarily covered, but sampling points are uniformly distributed throughout the space. Figure C.3 shows a design based on Latin hypercube sampling with three variables. The methodology to construct these designs is

180 explained in [50]-[51]. The sample sizes of our designs based on Latin hypercube sampling, i.e. LHC and LHCQ, are the same as FF and FRF, respectively.

P1

Slope

Fanout

Figure C.3. Design based on Latin hypercube sampling.

C.4. Central Composite Designs

Central composite designs explore some center points in addition to the corners. We need a larger number of samples in comparison with full-factorial designs and hope for accuracy improvement as we explore some middle points between the corners [42]. They are categorized as follows.

1) Circumscribed central composite (CCC) designs – Here, the middle points are far from the center as much as the corners to make them located on a circumscribed hypersphere outside the hypercube. Figure C.4 shows a circumscribed central composite design with three variables. The accuracy of estimates should be better; however, the mid-points will be outside the hypercube of the extreme values at the corners. This one cannot be a good choice for us because we do not want to build our models with parameter values larger than the chosen extremes to make the cell function properly.

181

Figure C.4. Circumscribed central composite design.

2) Inscribed central composite (ICC) designs – Here, the middle points are far from the center in such a way that they are located on the faces of the inscribed hypersphere inside the hypercube. Moreover, the corners of the design must be located on this hypersphere that does not let the extreme values of the parameters be explored. Figure

C.5 shows an inscribed central composite design with three variables. The estimates are good for the center of the design; however, the extreme corners are not properly explored.

Therefore, it cannot be a good choice for us.

182

Figure C.5. Inscribed central composite design.

3) Faced central composite (FCC) designs – Here the middle points are located on the faces of the hypercube while the classical full-factorial corners are preserved. Figure

C.6 shows a faced central composite design with three variables. The design should be fare all over the space, but it should be poor for pure quadratic coefficients. We used this design in Chapter IV to know if it can improve the waveform accuracy in comparison to our full-factorial designs.

183

Figure C.6. Faced central composite design.

184 APPENDIX D

The Spread of C1, R, and C2 of the Pi-model for RC- interconnect Networks of the Inverter Chain

The accuracy of our compact variation-aware models, as a specific type of a statistical model, is dependent on the sampling space that we use to make the models. The models have the best accuracy for the parameter values within the range of parameters in the sampling space.

For our compact variation-aware cell models based on a Pi-model, we determined the range of the parameters of the Pi-model for the RC-interconnect networks, which is our sampling space, of the inverter chain that we used as the test circuit. Figure D.1shows the spread of C1 , R, and C2 for our Pi-models.

Although we did not do a comprehensive analysis on all the RC-interconnect

networks of the JPEG2 clock tree and limited our experiments just on a single inverter

chain, it is very interesting that for the RC-interconnect networks within the inverter

chain the following sentences are correct.

(a) The values of C1 and the values of C2 make their own clusters although we used different RC-interconnect networks with different topologies, different number of parasitics, and different values for each of the parasitics.

(b) The values of C1 are smaller than C2 , in general. This means that the capacitance at the near-side of the resistor to the gate is smaller than the capacitance at the far-side of the resistor in the model.

185 1.00E+03 1.00E+01 1.00E-01

1.00E-03 C1PI 1.00E-05 RPI 1.00E-07 C2PI 1.00E-09 1.00E-11 1.00E-13 1.00E-15

Magnitudes: R (ohm), C1(F), andC2(F) C1(F), (ohm), R Magnitudes: 1.00E-17 0 2 4 6 8 10 Stage Number Figure D.1. The spread of C1 , R, and C2 parameters of our Pi-Models.

186

APPENDIX E

Complexity analysis for cells that use a saturated ramp waveform model

For the cell models that we developed in Section 4.3 using a saturated ramp waveform model instead of our PCA waveform model, it is instructive to compare the complexity of our full-factorial family models with the tabular cell models. Since we do not include the complexity of PCA in our comparisons, we can show the advantage of using full-factorial cell models to the tabular cells even with the same waveform model.

Our list of the compared models includes (a) Tabular with a multi-dimensional table that has one dimension per variation parameter (b) FF with all the terms, (c) FF3 with significant factors up to the 3 rd order terms, (d) FF2 with significant factors up to the 2 nd order terms, (e) FF1 with significant linear terms, and (f) Tabular1 with 2-dimensional tables for each response, e.g. delay and output slope, and the response's linear sensitivity to variation parameters. Although the accuracy of FF1 is the least in the family, we included it in our list of compared models to make the comparison with the linear case of the tabular model, i.e. Tabular 1.

We use the arguments of our complexity analysis in Section 3.6 to derive the space complexity, the simulation time complexity, and the characterization time complexity for our cell models that uses a saturated ramp waveform model instead of our PC waveform model, which has been used in that section. Again, p is the number of parameters characterizing a cell, which is p = 2 for a slope- fanout pair; q is the number of sources

of variations; k is the number of levels in each dimension of the tables in tabular cell

187 models with a typical values of 10; s is the number stages in the inverter chain; and n is

( p+q) the number of simulations performed, i.e. n = 2 for full-factorial designs.

We have created Table E.1, Table E.2, and Table E.3 based on Table 3.8, Table 3.9,

and Table 3.10, respectively. We just modified the complexity order expression by

removing the terms related to the 2 nd dimension of the waveform model as well as the

terms for obtaining the compact variational waveform model using PCA.

Table E.1. Comparing space complexity of methods for a cell that uses a saturated ramp waveform model (per delay/transition entry per input). Method Complexity ( p+q) Tabular (general case) O(k ) FF (general case) O ( p + 2 ( p + q ) p ) FF3 (3 nd order case) O( p(( p + q)3 ) FF2 (2 nd order case) O( p(( p + q)2 ) FF1 (linear case) O( p( p + q)) p Tabular (linear case) O(qk )

Table E.2. Comparing simulation time complexity of methods for a cell that uses a saturated ramp waveform model (per delay/transition entry per input). Method Complexity Tabular (general case) O(sk ( p + q)) ( p+q) FF (general case) O(sp 2 ) FF3 (3 nd order case) O(sp ( p + q)3 ) nd 2 FF2 (2 order case) O(sp ( p + q) ) FF1 (linear case) O(sp ( p + q)) Tabular (linear case) O(s( pk + q)

188 Table E.3. Comparing characterization time complexity of methods for a cell that uses a saturated ramp waveform model (per delay/transition entry per input). Method Complexity ( p+q) Tabular (general case) O(k ) FF (general case) O( pn ( p + q)) FF3 (3 nd order case) O ( n( p + q) 6 + p( p + q) 9 ) FF2 (2 nd order case) O ( n( p + q) 4 + p( p + q) 6 ) FF1 (linear case) O( n( p + q ) 2 + p ( p + q ) 3 ) p Tabular (linear case) O(qk )

In Table E.4, we have replaced all the terms but the q, the number of sources of variation, in the order expressions with their values or typical values and simplified the order expressions. Comparing the models within the full-factorial family from the least accurate model, i.e. FF1, to the most accurate model, i.e. FF, shows the more accurate the models, the higher the order of all the complexity order expressions. Moreover, we compare our most (least) accurate full-factorial model, i.e. FF (FF1) with Tabular

(Tabular1). At one extreme, FF1 has a higher characterization complexity order than

Tabular1 while its space complexity and simulation time complexity are the same as those of Tabular1. That means the linear form of full-factorial family models, which is the least accurate member as well, does not have any advantage over the linear case of the tabular model while it suffers from poor characterization time comparing to the linear case of the tabular model. At the other extreme, FF has a lower space complexity, a lower characterization complexity, and a higher simulation time complexity in comparison with Tabular. Although the space complexity and the characterization complexity of both FF and Tabular model are exponential, the base is 5 times smaller for

FF. This means the FF models offer better scalability at the cost of longer simulation

189 times. Using FF3 and FF2 can reduce the simulation time as well at the cost of losing some accuracy while the accuracy lost can be tolerable because of using the analysis of variance in the selection of the significant terms in the model equations.

To better show the advantage of our models, suppose that we have 10 variation parameters, the space complexity and the characterization complexity of the Tabular method will be 5 10 = 9,765,625 times of those of the FF method. This makes it prohibitive. If it takes 1 second to characterize a cell using FF, it takes about 4 months

(113 days) to characterize a cell using Tabular method. If the simulation time for one cell using Tabular method takes 1 microsecond, that for a cell using FF3 (FF2) method takes

1 (0.1) millisecond although the simulation time is 1000 (100) times more than that of

Tabular method. We notice the characterization time decreases by a factor of 9,765,625 at the cost of the increase in simulation time by a factor of 1000 (100) for FF3 (FF2).

Table E.4. Comparing the complexity of methods for a cell that uses a saturated ramp waveform model (per delay/transition entry per input). Method Space Simulation Characterization Complexity Time Time Complexity Complexity q q Tabular (general case) O 10( ) O(sq ) O 10( ) FF (general case) O 2( q ) O(s2q ) O(q2 q ) FF3 (3 nd order case) O(q 3 ) O(sq 3 ) O ( q 9 ) FF2 (2 nd order case) O(q 2 ) O(sq 2 ) O ( q 6 ) FF1 (linear case) O(q) O(sq ) O( q 3 ) Tabular (linear case) O(q) O(sq ) O(q)

190 APPENDIX F

The 3-dimentional Plots for Model Accuracy Comparison

Figures F.1-F.21 are 3-dimentional plots similar to the plots of Figures 4.23-4.25; however, the plots compare models 1-21 of Table 4.21 for changes in options of categories of Table 4.20. In all the plots, the horizontal axis is labeled with the waveform model name (Tr, Tf, or Tg in w.1) concatenated with the circuit model used (RCCC, RC,

NR, or C in wc.8); and the vertical axis is labeled with concatenation of the number of sampling points for the waveform model (wc.2), the range of variation of L (wc.5), the

range of variation of Vt (wc.5), the status of subranging for load (item 2 of wc.4), the load assumptions (wc.6), the status of subranging for slope (item 3 of wc.4) , and the type of experimental design (full factorial or central composite in wc.1).

Accuracy of Waveform Models Max (11-pt) 11Pts-L20-Vt20 xxPts 11Pts-L20-Vt20-SR-F-CCF 0.35-0.4 _Lxx 11Pts-L20-Vt20-SR-F 0.3-0.35 _Vtxx 11Pts-L20-Vt20-SR-FT 0.25-0.3 _SR? 0.2-0.25 11Pts-L20-Vt05-SR-F 0.15-0.2 _F?T? 19Pts-L20-Vt05-SR-F _CCF? 11Pts-L05-Vt05-SR-F Tr_C Tf_C Tg_C Tr_NP Tf_NP Tr_RC Tf_RC Tg_NP Tg_RC Tr_RCCC Tf_RCCC Tg_RCCC

Tx_Parasitics

Figure F.1. Waveform model accuracy compared by the maximum of the absolute value of relative errors using 11 points of waveforms.

191 Accuracy of Waveform Models Average 11Pts-L20-Vt20 (11-pt)

11Pts-L20-Vt20-SR-F-CCF xxPts 0.05-0.06 _Lxx 0.04-0.05 11Pts-L20-Vt20-SR-F 0.03-0.04 _Vtxx 11Pts-L20-Vt20-SR-FT 0.02-0.03 _SR? 0.01-0.02 11Pts-L20-Vt05-SR-F _F?T? 0-0.01 19Pts-L20-Vt05-SR-F _CCF?

11Pts-L05-Vt05-SR-F Tf_C Tr_C Tg_C Tf_NP Tr_NP Tf_RC Tr_RC Tg_NP Tg_RC Tf_RCCC Tr_RCCC Tg_RCCC

Tx_Parasitics

Figure F.2. Waveform model accuracy compared by the average of the absolute value of relative errors using 11 points of waveforms.

Accuracy of Waveform Models Max Ave. 11Pts-L20-Vt20 (11-pt)

11Pts-L20-Vt20-SR-F-CCF xxPts 0.1-0.12 _Lxx 0.08-0.1 11Pts-L20-Vt20-SR-F _Vtxx 0.06-0.08 11Pts-L20-Vt20-SR-FT 0.04-0.06 _SR? 0.02-0.04 11Pts-L20-Vt05-SR-F _F?T? 0-0.02 19Pts-L20-Vt05-SR-F _CCF? 11Pts-L05-Vt05-SR-F Tr_C Tf_C Tg_C Tr_NP Tf_NP Tr_RC Tf_RC Tg_NP Tg_RC Tr_RCCC Tf_RCCC Tg_RCCC

Tx_Parasitics

Figure F.3. Waveform model accuracy compared by the maximum of the average of absolute value of relative errors for each of 11 points of waveforms.

192 Accuracy of Waveform Models Max Ave. 11Pts-L20-Vt20 (9-pt)

11Pts-L20-Vt20-SR-F-CCF xxPts 0.1-0.12 _Lxx 0.08-0.1 11Pts-L20-Vt20-SR-F 0.06-0.08 _Vtxx 11Pts-L20-Vt20-SR-FT 0.04-0.06 _SR? 0.02-0.04 11Pts-L20-Vt05-SR-F _F?T? 0-0.02 19Pts-L20-Vt05-SR-F _CCF? 11Pts-L05-Vt05-SR-F Tr_C Tf_C Tg_C Tr_NP Tf_NP Tr_RC Tf_RC Tg_NP Tg_RC Tr_RCCC Tf_RCCC Tg_RCCC

Tx_Parasitics

Figure F.4. Waveform model accuracy compared by the maximum of the average of the absolute value of relative errors for each of 9 points of waveforms.

Accuracy of Waveform Models Max (9-pt) 11Pts-L20-Vt20 xxPts 11Pts-L20-Vt20-SR-F-CCF 0.35-0.4 _Lxx 11Pts-L20-Vt20-SR-F 0.3-0.35 _Vtxx 11Pts-L20-Vt20-SR-FT 0.25-0.3 _SR? 0.2-0.25 11Pts-L20-Vt05-SR-F 0.15-0.2 _F?T? 19Pts-L20-Vt05-SR-F _CCF? 11Pts-L05-Vt05-SR-F Tr_C Tf_C Tg_C Tr_NP Tf_NP Tr_RC Tf_RC Tg_NP Tg_RC Tr_RCCC Tf_RCCC Tg_RCCC

Tx_Parasitics

Figure F.5. Waveform model accuracy compared by the maximum of the absolute value of relative errors using 9 points of waveforms.

193 Max. Mahalanobis Accuracy of Waveform Models Distance (11-pt) 11Pts-L20-Vt20 xxPts 11Pts-L20-Vt20-SR-F-CCF _Lxx 35-50 11Pts-L20-Vt20-SR-F _Vtxx 20-35 11Pts-L20-Vt20-SR-FT 5-20 _SR? 11Pts-L20-Vt05-SR-F -10-5 _F?T? 19Pts-L20-Vt05-SR-F _CCF? 11Pts-L05-Vt05-SR-F Tf_C Tr_C Tg_C Tf_NP Tr_NP Tf_RC Tr_RC Tg_NP Tg_RC Tf_RCCC Tr_RCCC Tg_RCCC

Tx_Parasitics

Figure F.6. Waveform model accuracy compared using the maximum Mahalanobis distance (11 points).

Max. Mahalanobis Accuracy of Waveform Models Distance (9-pt) 11Pts-L20-Vt20 xxPts 11Pts-L20-Vt20-SR-F-CCF _Lxx 35-50 11Pts-L20-Vt20-SR-F 20-35 _Vtxx 11Pts-L20-Vt20-SR-FT 5-20 _SR? 11Pts-L20-Vt05-SR-F -10-5 _F?T? 19Pts-L20-Vt05-SR-F _CCF? 11Pts-L05-Vt05-SR-F Tf_C Tr_C Tg_C Tf_NP Tr_NP Tf_RC Tr_RC Tg_NP Tg_RC Tf_RCCC Tr_RCCC Tg_RCCC

Tx_Parasitics

Figure F.7. Waveform model accuracy compared using the maximum Mahalanobis distance (9 points).

194 Average Mahalanobis Accuracy of Waveform Models Distance (11-pt) 11Pts-L20-Vt20 xxPts 11Pts-L20-Vt20-SR-F-CCF _Lxx 18-24 11Pts-L20-Vt20-SR-F 12-18 _Vtxx 11Pts-L20-Vt20-SR-FT 6-12 _SR? 11Pts-L20-Vt05-SR-F 0-6 _F?T? 19Pts-L20-Vt05-SR-F _CCF? 11Pts-L05-Vt05-SR-F Tf_C Tr_C Tg_C Tf_NP Tr_NP Tf_RC Tr_RC Tg_NP Tg_RC Tf_RCCC Tr_RCCC Tg_RCCC

Tx_Parasitics

Figure F.8. Waveform model accuracy compared using the average Mahalanobis distance (11 points).

Std. of Mahalanobis Accuracy of Waveform Models Distance (11-pt)

11Pts-L20-Vt20 xxPts 11Pts-L20-Vt20-SR-F-CCF 1.6-2 _Lxx 11Pts-L20-Vt20-SR-F 1.2-1.6 _Vtxx 0.8-1.2 11Pts-L20-Vt20-SR-FT _SR? 0.4-0.8 11Pts-L20-Vt05-SR-F _F?T? 0-0.4 19Pts-L20-Vt05-SR-F _CCF? 11Pts-L05-Vt05-SR-F Tf_C Tr_C Tg_C Tf_NP Tr_NP Tf_RC Tr_RC Tg_NP Tg_RC Tf_RCCC Tr_RCCC Tg_RCCC

Tx_Parasitics

Figure F.9. Waveform model accuracy compared using the standard deviation of Mahalanobis distance (11 points).

195 Average Mahalanobis Accuracy of Waveform Models Distance (9-pt) 11Pts-L20-Vt20 xxPts 11Pts-L20-Vt20-SR-F-CCF _Lxx 9-12 11Pts-L20-Vt20-SR-F _Vtxx 6-9 11Pts-L20-Vt20-SR-FT 3-6 _SR? 11Pts-L20-Vt05-SR-F 0-3 _F?T? 19Pts-L20-Vt05-SR-F _CCF? 11Pts-L05-Vt05-SR-F Tf_C Tr_C Tg_C Tf_NP Tr_NP Tf_RC Tr_RC Tg_NP Tg_RC Tf_RCCC Tr_RCCC Tg_RCCC

Tx_Parasitics

Figure F.10. Waveform model accuracy compared using the average Mahalanobis distance (9 points).

Std. of Mahalanobis Accuracy of Waveform Models Distance (9-pt)

11Pts-L20-Vt20 xxPts 11Pts-L20-Vt20-SR-F-CCF _Lxx 11Pts-L20-Vt20-SR-F 0.8-1.2 _Vtxx 0.4-0.8 11Pts-L20-Vt20-SR-FT _SR? 0-0.4 11Pts-L20-Vt05-SR-F _F?T? 19Pts-L20-Vt05-SR-F _CCF? 11Pts-L05-Vt05-SR-F Tf_C Tr_C Tg_C Tf_NP Tr_NP Tf_RC Tr_RC Tg_NP Tg_RC Tf_RCCC Tr_RCCC Tg_RCCC

Tx_Parasitics Figure F.11. Waveform model accuracy compared using the standard deviation of Mahalanobis distance (9 points).

196 Accuracy of Waveform Models Max. (50% point)

11Pts-L20-Vt20 xxPts 11Pts-L20-Vt20-SR-F-CCF _Lxx 0.14-0.17 11Pts-L20-Vt20-SR-F 0.11-0.14 _Vtxx 11Pts-L20-Vt20-SR-FT 0.08-0.11 _SR? 11Pts-L20-Vt05-SR-F 0.05-0.08 _F?T? 19Pts-L20-Vt05-SR-F _CCF? 11Pts-L05-Vt05-SR-F Tf_C Tr_C Tg_C Tf_NP Tr_NP Tf_RC Tr_RC Tg_NP Tg_RC Tf_RCCC Tr_RCCC Tg_RCCC

Tx_Parasitics

Figure F.12. Waveform model accuracy compared using the maximum of the absolute value of relative errors of 50% points.

Accuracy of Waveform Models Average (50% point) 11Pts-L20-Vt20 xxPts 11Pts-L20-Vt20-SR-F-CCF _Lxx 0.042-0.056 11Pts-L20-Vt20-SR-F 0.028-0.042 _Vtxx 11Pts-L20-Vt20-SR-FT 0.014-0.028 _SR? 11Pts-L20-Vt05-SR-F 0-0.014 _F?T? 19Pts-L20-Vt05-SR-F _CCF? 11Pts-L05-Vt05-SR-F Tr_C Tf_C Tg_C Tr_NP Tf_NP Tr_RC Tf_RC Tg_NP Tg_RC Tr_RCCC Tf_RCCC Tg_RCCC

Tx_Parasitics

Figure F.13. Waveform model accuracy compared using the average of the absolute value of relative errors of 50% points.

197 Relative Error Accuracy of Waveform Models Average (all points) 11Pts-L20-Vt20 xxPts 11Pts-L20-Vt20-SR-F-CCF _Lxx 0.0045-0.006 11Pts-L20-Vt20-SR-F 0.003-0.0045 _Vtxx 11Pts-L20-Vt20-SR-FT 0.0015-0.003 _SR? 11Pts-L20-Vt05-SR-F 0-0.0015 _F?T? 19Pts-L20-Vt05-SR-F _CCF? 11Pts-L05-Vt05-SR-F Tf_C Tr_C Tg_C Tf_NP Tr_NP Tf_RC Tr_RC Tg_NP Tg_RC Tf_RCCC Tr_RCCC Tg_RCCC

Tx_Parasitics

Figure F.14. Waveform model accuracy compared using the average of relative errors of all points (11 points).

Relative Error Std. Accuracy of Waveform Models (all points) 11Pts-L20-Vt20

11Pts-L20-Vt20-SR-F-CCF xxPts _Lxx 0.06-0.08 11Pts-L20-Vt20-SR-F 0.04-0.06 _Vtxx 11Pts-L20-Vt20-SR-FT 0.02-0.04 _SR? 11Pts-L20-Vt05-SR-F 0-0.02 _F?T? 19Pts-L20-Vt05-SR-F _CCF? 11Pts-L05-Vt05-SR-F Tf_C Tr_C Tg_C Tf_NP Tr_NP Tf_RC Tr_RC Tg_NP Tg_RC Tf_RCCC Tr_RCCC Tg_RCCC

Tx_Parasitics

Figure F.15. Waveform model accuracy compared using the standard deviation of relative errors of all points (11 points).

198 Relative Error Accuracy of Waveform Models Average (50% points) 11Pts-L20-Vt20 xxPts 11Pts-L20-Vt20-SR-F-CCF 0.03-0.04 _Lxx 11Pts-L20-Vt20-SR-F 0.02-0.03 _Vtxx 0.01-0.02 11Pts-L20-Vt20-SR-FT _SR? 0-0.01 11Pts-L20-Vt05-SR-F _F?T? -0.01-0 19Pts-L20-Vt05-SR-F _CCF? 11Pts-L05-Vt05-SR-F Tf_C Tr_C Tg_C Tf_NP Tr_NP Tf_RC Tr_RC Tg_NP Tg_RC Tf_RCCC Tr_RCCC Tg_RCCC

Tx_Parasitics

Figure F.16. Waveform model accuracy compared using the average of relative errors of 50% points (11 points).

Relative Error Std. Accuracy of Waveform Models (50% points) 11Pts-L20-Vt20

11Pts-L20-Vt20-SR-F-CCF xxPts _Lxx 11Pts-L20-Vt20-SR-F 0.04-0.06 _Vtxx 0.02-0.04 11Pts-L20-Vt20-SR-FT _SR? 0-0.02 11Pts-L20-Vt05-SR-F _F?T? 19Pts-L20-Vt05-SR-F _CCF? 11Pts-L05-Vt05-SR-F Tf_C Tr_C Tg_C Tf_NP Tr_NP Tf_RC Tr_RC Tg_NP Tg_RC Tf_RCCC Tr_RCCC Tg_RCCC

Tx_Parasitics

Figure F.17. Waveform model accuracy compared using the standard deviation of relative errors of 50% points (11 points).

199 Absolute Error Accuracy of Waveform Models Average (all points) 11Pts-L20-Vt20 xxPts 11Pts-L20-Vt20-SR-F-CCF 0.002-0.0025 _Lxx 11Pts-L20-Vt20-SR-F 0.0015-0.002 _Vtxx 0.001-0.0015 11Pts-L20-Vt20-SR-FT _SR? 0.0005-0.001 11Pts-L20-Vt05-SR-F _F?T? 0-0.0005 19Pts-L20-Vt05-SR-F _CCF? 11Pts-L05-Vt05-SR-F Tr_C Tf_C Tg_C Tr_NP Tf_NP Tr_RC Tf_RC Tg_NP Tg_RC Tr_RCCC Tf_RCCC Tg_RCCC

Tx_Parasitics

Figure F.18. Waveform model accuracy compared using the average of absolute errors of all points (11 points).

Absolute Error Std. Accuracy of Waveform Models (all points) 11Pts-L20-Vt20 xxPts 11Pts-L20-Vt20-SR-F-CCF _Lxx 0.12-0.16 11Pts-L20-Vt20-SR-F 0.08-0.12 _Vtxx 11Pts-L20-Vt20-SR-FT 0.04-0.08 _SR? 0-0.04 11Pts-L20-Vt05-SR-F _F?T? 19Pts-L20-Vt05-SR-F _CCF? 11Pts-L05-Vt05-SR-F Tf_C Tr_C Tg_C Tf_NP Tr_NP Tf_RC Tr_RC Tg_NP Tg_RC Tf_RCCC Tr_RCCC Tg_RCCC

Tx_Parasitics

Figure F.19. Waveform model accuracy compared using the standard deviation of absolute errors of all points (11 points).

200 Absolute Error Accuracy of Waveform Models Average (50% points) 11Pts-L20-Vt20 xxPts 0.008-0.01 11Pts-L20-Vt20-SR-F-CCF 0.006-0.008 _Lxx 11Pts-L20-Vt20-SR-F 0.004-0.006 _Vtxx 0.002-0.004 11Pts-L20-Vt20-SR-FT _SR? 0-0.002 11Pts-L20-Vt05-SR-F _F?T? -0.002-0 19Pts-L20-Vt05-SR-F -0.004--0.002 _CCF? 11Pts-L05-Vt05-SR-F Tr_C Tf_C Tg_C Tr_NP Tf_NP Tr_RC Tf_RC Tg_NP Tg_RC Tr_RCCC Tf_RCCC Tg_RCCC

Tx_Parasitics

Figure F.20. Waveform model accuracy compared using the average of absolute errors of 50% points (11 points).

Absolute Error Std. Accuracy of Waveform Models (50% points) 11Pts-L20-Vt20 xxPts 11Pts-L20-Vt20-SR-F-CCF 0.12-0.15 _Lxx 11Pts-L20-Vt20-SR-F 0.09-0.12 _Vtxx 0.06-0.09 11Pts-L20-Vt20-SR-FT _SR? 0.03-0.06 11Pts-L20-Vt05-SR-F _F?T? 0-0.03 19Pts-L20-Vt05-SR-F _CCF? 11Pts-L05-Vt05-SR-F Tr_C Tf_C Tg_C Tr_NP Tf_NP Tr_RC Tf_RC Tg_NP Tg_RC Tr_RCCC Tf_RCCC Tg_RCCC

Tx_Parasitics

Figure F.21. Waveform model accuracy compared using the standard deviation of absolute errors of 50% points (11 points).

201 APPENDIX G

Resources and facilities used in our research

The toolset and facilities that have been used for our research are as follows:

(a) State of the art: Related textbooks, and proceedings of conferences and journals are available through the Georgia Institute of Technology library and the IEEE explorer digital library; also, attending relevant conferences and workshops is very beneficial to gain better insight into relevant ideas and techniques applicable to the research, and to receive professional feedback from other experts in the field of timing analysis.

(b) Simulation, application development, and office tools: We have used: (i) Cadence

Virtuoso and Assura to design layouts, to perform design rule check, to check layout versus schematic, and to extract parasitics needed in a netlist; (ii) Hspice to create the necessary datasets for constructing our statistical models; (iii) Matlab to make our toolset for generating and testing our equation-based models because the built-in features are not adequate and Matlab Model Calibration Toolbox proved to be very slow and unsuitable due to the large size of our models; (iv) Microsoft Visual Studio, Microsoft C++ and

GNU C++ compilers to build our timing simulation tool for testing the cell models by simulating a chain of cells, and to convert data between the circuit simulator files and our datasets for constructing our cell models; and (v) Microsoft Office to manipulate datasets, to draw plots, to document the research, and to prepare the reports.

202 RESEARCH CONTRIBUTIONS

Traditionally, input waveforms are represented by delay-slope pairs. In this work, the slope is replaced by a set of PCSs. We formalized the mapping between the time domain and the PC domain to make it practical by defining valid waveforms in PC domain.

(a) We have found that there are restrictions on the waveforms that are transformed between the PC domain and the time domain, because the time domain waveform must move forward in time.

(b) To ensure valid time domain waveforms, we have defined an acceptability region in the PC domain. Only points in the acceptability region can be transformed to valid time domain waveforms. The acceptability region is defined by a set on linear relations that can be determined using linear programming.

(c) We have also found that augmenting the dataset with “time mirror” waveforms prior to Principal Components Analysis expands the acceptability region. As a result, the line segments bounding the acceptability region determined by equation (3.7) always pass through the origin. that makes finding the acceptability region very efficient by just looking at the slopes of all lines imposing relations and choosing the maximum positive and minimum negative ones.

(d) We have found a set of common basis function in [55],[56] by iteratively

inputting waveforms based on the principal components of the output basis functions.

The resulting output waveforms are analyzed to determine new principal component

basis functions which are compared to the previous set. If they match, convergence is

achieved.

203 (e) We have shown that our waveform model and cell model are not technology dependent.

(f) We have extended our methodology with resistive-capacitive load support; therefore, it can be applicable to interconnect dominant circuits, in which, the resistance of the interconnect networks affects the delay of the gates and considering it is crucial for accuracy of timing analysis.

(e) We have proposed a set of methods for accuracy improvement and have provided a categorization of the sources error in our high-level variational waveform model and compact variation-aware cell models. Moreover, we have categorized possible options in building our waveform and cell model and how they affect the accuracy of waveform and cell models.

(f) We have proposed a fast variation-aware statistical dynamic timing analysis framework. Our block-based statistical timing analysis framework incorporates logic- dependent gate delays instead of the logic-independent maximum delay in static timing analysis. Using compiler-compiler techniques we generate our timing models for each circuit. The timing models along with a suitable test bench and data analysis routines are compiled to machine-code to reduce the overhead of dynamic timing simulation. The simulation engine enables us to perform statistical timing analysis to observe how the circuit performance is affected by delay changes of the gates based on their delay distribution, which are linked to their parameters variation. Therefore, we can generate the delay distribution of a circuit as a measure of the performance distribution of a circuit.

204 REFERENCES

[1] D. Blaauw, K. Chopra, A. Srivastava, and L. Scheffer, "Statistical Timing Analysis: From Basic Principles to State of the Art," IEEE Transactions on Computer-Aided Design, vol. 27, pp. 589-607, Apr. 2008.

[2] C. Forzan and D. Pandini, "Statistical static timing analysis: A survey," Integration, the VLSI Journal, vol. 42, pp. 409-435, 2009.

[3] T. Kirkpatrick and N. Clark, “PERT as an aid to logic design,” IBM Journal of Research and Development, vol. 10, no. 2, pp. 135-141, March 1966.

[4] D. Lee, V. Zolotov, and D. Blaauw, “Static timing analysis using backward signal propagation,” Proceedings of Design Automation Conference, 2004, pp. 664-669.

[5] S.R. Nassif, A. J. Strojwas, and S.W. Director, “A methodology for worst-case analysis of integrated circuits,” IEEE Transactions on Computer-Aided Design, vol. 5, no. 1, pp. 104-113, Jan. 1986.

[6] A. Agarwal, V. Zolotov, and D.T. Blaauw, “Statistical timing analysis using bounds and selective enumeration,” IEEE Transactions on Computer-Aided Design, vol. 22, no. 9, pp. 1243-1260, Sept. 2003.

[7] X. Li, J. Le, M. Celik, and L.T. Pileggi, “Defining statistical timing sensitivity for logic circuits with large-scale process and environmental variations,” IEEE Transactions on Computer-Aided Design, vol. 27, no. 6, pp. 1041-1053, June 2008.

[8] H. Chang and S.S. Sapatnekar, “Statistical timing analysis under spatial correlations,” IEEE Transactions on Computer-Aided Design, vol. 24, no. 9, pp. 1467-1482, Sept. 2005.

[9] B. Cline, K. Chopram, D. Blaauw, and Y. Cao, “Analysis and modeling of CD variation for statistical static timing,” Proceedings of International Conference on Computer-Aided Design, 2006, pp. 60-66. `

[10] S. Bhardwaj, S. Vrudhula, and A. Goel, “A unified approach for full chip statistical timing and leakage analysis of nanoscale circuits considering intradie process variations,” IEEE Transactions on Computer-Aided Design, vol. 27, no. 10, pp. 1812-1825, Oct. 2008.

[11] L. Zhang, W. Chen, Y. Hu, J. A. Gubner, and C.C. Chen, “Correlation-preserved non-Gaussian statistical timing analysis with quadratic timing model,” Proceedings of Design Automation Conference, 2005, pp. 83-88.

205 [12] V. Khandelwal and A. Srivastara, “A general framework for accurate statistical timing analysis considering correlations,” Proceedings of Design Automation Conference, 2005, pp. 89-94.

[13] J. Singh and S.S. Sapatnekar, “A scalable statistical static timing analyzer incorporating correlated non-Gaussian and Gaussian parameter variations,” IEEE Transactions on Computer-Aided Design, vol. 27, no. 1, pp. 160-173, Jan. 2008.

[14] B. Choi and D.M.H. Walker, “Timing analysis of combinational circuits including capacitive coupling and statistical process variation,” Proceedings of VLSI Test Symposium, 2000, pp. 49-54.

[15] A. Gattiker, S. Nassif, R. Dinakar, and C. Long, “Timing yield estimation from static timing analysis,” Proceedings of International Symposium of Quality Electronic Design, 2001, pp. 437-442.

[16] M. Orshansky and K. Keutzer, “A general probabilistic framework for worst case timing analysis,” Proceedings of Design Automation Conference, 2002, pp. 556- 561.

[17] A. Agarwal, D. Blaauw, V. Zolotov, S. Sundareswaran, M. Zhao, K. Gala, and R. Panda, “Statistical delay computation considering spatial correlations,” Proceedings of Asia and South Pacific Design Automation Conference, 2003, pp. 271-276.

[18] C.S. Amin, N. Menezes, K. Killpack, F. Dartu, U. Choudhury, N. Hakim, and Y.I. Ismail, “Statistical static timing analysis: How simple can we get?” Proceedings of Design Automation Conference, 2005, pp. 652-657.

[19] S. Abbaspour, H. Fatemi, and M. Pedram, “Non-gaussian statistical interconnect timing analysis,” Proceedings of Design Automation and Test in Europe Conference, 2006, pp. 533-538.

[20] A.K. Murugavel and N. Ranganthan, “Petri net modeling of gate and interconnect delays for power estimation,” Proceedings of Design Automation Conference, 2002, pp. 455-460.

[21] P. Kartschoke and S. Hojat, “Techniques that improved the timing convergence of the Gekko PowerPC microprocessor,” Proceedings of International Symposium of Quality Electric Design, 2001, pp. 65-70.

[22] P. Ghanta and S. Vrudhula, “Variational interconnect delay metrics for statistical timing analysis” Proceedings of International Symposium of Quality Electric Design, 2006, pp. 19-24.

206 [23] E. Acar, S. Nassif, Y, Liu, and L.T. Pileggi, “Time-domain simulation of variational interconnect models,” Proceedings of International Symposium of Quality Electric Design, 2002, pp. 419-424.

[24] E. Acar, S. Nassif, Y, Liu, and L.T. Pileggi, “Assessment of true worst case circuit performance under interconnect parameter variations,” Proceedings of Design Automation Conference, 2001, pp. 431-436.

[25] “CMOS nonlinear delay model calculation,” Library Compiler User Guide, vol. 2, Synopsys, In., 1999.

[26] J.E. Jackson, A User’s Guide to Principal Components, John Wiley and Sons, Inc., 2003.

[27] P. Feldmann and S. Abbaspour, “Towards a more physical approach to gate modeling for timing, noise, and power,” Proceedings of Design Automation Conference, 2008, pp. 453-455.

[28] S.K. Tiwary and J.R. Phillips, “WAVSTAN: Waveform based Variational Static Timing Analysis,” Proceedings of Design Automation and Test in Europe Conference, 2007, pp. 1000-1006.

[29] R. Gandhi, J. Shiffer, D. Gandhi, and D. Velenis, “Delay modeling using ramp and realistic signal waveforms,” Proceedings of International Conference Electro Information Technology, 2005.

[30] C.S. Amin, F. Dartu, and Y.I. Ismail, “Weibull-based analytical waveform model,” IEEE Trans. Computer-Aided Design, vol. 24, no. 8. pp. 1156-1168, Aug. 2005.

[31] A. Jain, D. Blaauw, and V. Zolotov, “Accurate delay computation for noisy waveform shapes,” Proceedings of International Conference on Computer-Aided Design, 2005, pp. 946-952.

[32] V. Zolotov, J. Xiong , S. Abbaspour, D.J. Hathaway, and C. Visweswariah, “Compact modeling of variational waveforms,” Proceedings of International Conference Computer-Aided Design, 2007, pp. 705-712.

[33] S.R. Nassif and E. Acar, “Advanced waveform models for the nano-meter regime,” International Workshop on Timing Issues in the Specification and Synthesis of Digital Systems, 2004.

[34] A. Ramalingam, A. K. Singh, S.R. Nassif, M. Orshansky, and D.Z. Pan, “Accurate waveform modeling using singular value decomposition with applications to timing analysis,” Proceedings of Design Automation Conference, 2007, pp. 148-153.

207 [35] H. Fatemi, S. Nazarian, and M. Pedram, “Statistical logic cell delay analysis using a current-based model,” Proceedings of Design Automation Conference, 2007, pp. 253-256.

[36] S. Basu, P. Thakore, and R. Vemuri, “Process variation tolerant standard cell library development using reduced dimension statistical modeling and optimization techniques,” Proceedings of International Symposium of Quality Electronic Design, 2007, pp. 814-820.

[37] A. Goel, S. Vrudhula, F. Taraporevala, and P. Ghanta, “A methodology for characterization of large macro cells and IP blocks considering process variations,” Proceedings of International Symposium of Quality Electronic Design, 2008, pp. 200-206.

[38] S. Sundareswaran, J.A. Abraham, A. Ardelea, and R. Panda, “Characterization of standard cells for intra-cell mismatch variations,” Proceedings of International Symposium of Quality Electronic Design, 2008, pp. 213-219.

[39] M. Hashimoto, Y. Yamada, and H. Onodera, “Equivalent Waveform Propagation for Static Timing Analysis,” IEEE Transactions on Computer-Aided Design, vol. 23, no. 4, pp. 104-113, April 2004.

[40] Star-Hspice Manual, Avant! Corporation and Avant! Subsidiary, 2001.

[41] Physical Verification User Guide, Cadence Design Systems, Inc., 2005.

[42] Using MATLAB, The MathWorks, Inc., 1999.

[43] G.E.P. Box, W.G. Hunter, and J.S. Hunter, Statistics for Experimenters, John Wiley and Sons, Inc., 1978.

[44] C. Visweswariah, “Death, taxes and failing chips,” Proceedings of Design Automation Conference, 2003, pp. 343-347.

[45] F.N. Najm and N. Menezes, “Statistical timing analysis based on a timing yield model,” Proceedings of Design Automation Conference, 2004, pp. 460-465.

[46] H. Mangassarian and M.Annis, “On Statistical Timing Analysis with Inter- and Intra-Die Variations,” Proceedings of Design Automation Conference, 2005, pp. 132-137.

[47] S. Borkar, T. Karnik, S. Narendra, J. Tschanz, A. Keshavarzi, and V. De, “Parameter variations and impact on circuits and microarchitecture,” Proceedings of Design Automation Conference, 2003, pp. 338-342.

208 [48] B. Lasbouygues, R. Wilson, N. Azemard, and P. Maurine, “Temperature and voltage aware timing analysis: application to voltage drops,” Proceedings of Design Automation and Test in Europe Conference, 2007, pp. 1012-1017.

[49] M. Ketkar, K. Dasamsetty, and S. Sapatnekar, “Convex delay models for transistor sizing,” Proceedings of Design Automation Conference, 2000, pp. 655-660.

[50] G. Wang, “Adaptive response surface method using inherited Latin hypercube design points,” Transactions of the American Society of Mechanical Engineers, Journal of Mechanical Design, vol. 125, pp. 210-220, June 2003.

[51] M.D. McKay, R.J. Beckman, and W.J. Conver, “A comparison of three methods for selecting values of input variables in the analysis of output from a computer code,” Technometrics, vol. 42, no. 1, pp. 55-61, Feb. 2000.

[52] D. Montgomery, Design and Analysis of Experiments, 2nd Edition, John Wiley and Sons, New York, 1976.

[53] D. Montgomery and E.A. Peck, Introduction to Linear Regression Analysis, 2nd Edition, John Wiley & Sons, New York, 1992.

[54] T.D. Sander, “Optimal unsupervised learning in a single-layer linear feedforward neural network,” Neural Networks, vol. 2, no. 6, pp. 459-473, 1989.

[55] S-A. Aftabjahani and L. Milor, “Timing analysis with compact variation-aware standard cell models,” Proceedings of Design of Circuits and Integrated Systems, 2007, pp. 1-6.

[56] S-A. Aftabjahani and L. Milor, “Compact variation-aware standard cell models for static timing analysis,” Proceedings of Design of Circuits and Integrated Systems, 2008, paper 7B.1.

[57] S-A. Aftabjahani and L. Milor, “Compact variation-aware standard cell models for timing analysis – complexity and accuracy analysis,” Proceedings of International Symposium of Quality Electric Design, 2008, pp. 148-151.

[58] S-A. Aftabjahani and L. Milor, “Timing analysis with compact variation-aware standard cell models,” Proceedings of the 2009 World Congress on Computer Science and Information Engineering, IEEE CS Press, 2009, pp. 475-479.

[59] S-A. Aftabjahani and L. Milor, “Timing analysis with compact variation-aware standard cell models,” Integration, the VLSI Journal, vol. 42, no. 3, pp. 312-320, June 2009.

[60] H. Veendrick, Deep-Submicron CMOS ICS: From Basics to Asics, 2nd Edition, Kluwer Academic Publishers, 2000.

209

[61] FreePDK, North Carolina State University [Online]. Available: http://www.eda.ncsu.edu/wiki/FreePDK .

[62] Nangate 45nm Open Cell Library, Nangate Inc. [Online]. Available: http://www.nangate.com/openlibrary/ .

[63] J. M. Rabaey, A. Chandrakasan, and B. Nikolic, Digital Integrated Circuits A Design Perspective, 2 nd Edition, Prentice Hall, 2004.

[64] Predictive Technology Model, Nanoscale Integration and Modeling (NIMO) Group, Department of Electrical Engineering, Arizona State University [Online]. Available: http://www.eas.asu.edu/~ptm/ .

[65] M.V. Dunga, X. Xi, J. He, W. Liu, K.M. Cao, X. Jin, J.J. Ou, M. Chan, A.M. Niknejad, and C. Hu, BSIM4.6.0 MOSFET Model - User’s Manual, Device Group, Department of Electrical Engineering and Computer Sciences, University of California, Berkeley [Online]. Available: http://www- device.eecs.berkeley.edu/~bsim3/BSIM4/BSIM460/doc/BSIM460_Manual.pdf .

[66] International Technology Roadmap for Semiconductors [Online]. Available: http://www.itrs.net/Links/2007ITRS/Home2007.htm .

[67] Performing Transistor-Level Parasitic Extraction, Mentor Graphics Corporation, 2004.

[68] J. Qian, S. Pullela, and L. T. Pileggi, "Modeling the effective capacitance for the RC interconnect of CMOS gates," IEEE Transactions on Computer-Aided Design, vol. 13, no. 12, pp. 1526-1535, Dec. 1994

[69] M. Celik, L. Pileggi, and A. Odabasioglu, Interconnect Analysis, Springer Science & Business Media, 2002.

[70] F. Dartu, N. Menezes, and L.T. Pileggi, “Performance Computation for Precharacterized CMOS Gates with RC Loads,” IEEE Transactions on Computer- Aided Design, vol. 15, no. 5, pp. 104-113, April 1996.

[71] J. Bhasker, and R. Chadha, Statistic Timing Analysis for Nanometer Designs A Practical Approach, Springer Science & Business Media, 2009.

[72] L. Pillage and R. Rohrer, "Asymptotic Waveform Evaluation for timing analysis," IEEE Transactions on Computer-Aided Design , vol. 9, no. 4, pp. 352-366, Apr. 1990.

[73] M. A. Horowitz, "Timing models for MOS circuits," Ph.D. thesis, Stanford University, 1980.

210

[74] A. Odabasioglu, M. Celik, and L. T. Pileggi, "Efficient and accurate delay metrics for RC interconnect," PATMOS: International Workshop on Power and Timing Modeling, Optimization, and Simulation, Oct. 1999.

[75] B. Tutuianu, F. Dartu, and L. Pileggi, "An explicit RC-circuit delay approximation based on the first three moments of the impulse response", Proceedings of ACM/IEEE Design Automation Conference, 1996

[76] C. W. Ho, A. E. Ruehli, and P. A. Brennan, "The modified nodal approach to network analysis," IEEE Transactions on Circuits and Systems, vol. CAS-22, pp. 504-509, June 1975.

[77] C. L. Ratzlaff and L. T. Pillage, "RICE: Rapid interconnect circuit evaluation using AWE," IEEE Transactions on Computer-Aided Design, pp.763-776, Jun. 1994.

[78] A. Odabasioglu, M. Celik, and L. Pileggi, "PRIMA: Passive reduced-order interconnect macromodeling algorithm," in Technology Digest 1997 IEEE/ACM International Conference on Computer-Aided Design, Nov. 1997.

[79] A. Odabasioglu, M. Celik, and L. Pileggi, "PRIMA: Passive reduced-order interconnect macromodeling algorithm," IEEE Transactions on Computer-Aided Design of Circuits and Systems, vol. 18, no. 8, pp. 645-654, Aug. 1998.

[80] A. Odabasioglu, M. Celik, and L. Pileggi, "Practical considerations for passive reduction of RLC circuits," in Proceedings of 36th IEEE/ACM Design Automation Conference, pp. 214-219, Jun. 1999.

[81] G. Strang, Introduction to Linear Algebra, Second Edition. Wellesley-Cambridge Press, June 1998.

[82] Cadence Encounter Menu Reference, Cadence, Feb. 2008.

[83] S. Abbaspour, DD. Ling, C. Visweswariah, and P. Feldmann, "A Moment-Based Effective Characterization Waveform for Static Timing Analysis", Proceedings of Design Automation Conference, 2009, pp. 19-24.

[84] P.C. Mahalanobis, "On the generalised distance in statistics", Proceedings of the National Institute of Sciences of India, vol. 1, pp. 49-55.

[85] S-A Aftabjahani and L. Milor, “Fast variation-aware statistical dynamic timing analysis,” Proceedings of 2009 World Congress on Computer Science and Information Engineering, IEEE CS Press, 2009, pp. 488-492.

211 [86] B. Cline, K. Chopra, D. Blaauw, and C. Yu;, “Analysis and modeling of CD variation for statistical static timing,” Proceedings of International Conference on Computer-Aided Design, 2006, pp. 60-66.

[87] M. Orshansky, L. Milor, C. Pinhong, K. Keutzer, and H. Chenming, “Impact of spatial intrachip gate length variability on the performance of high-speed digital circuits,” IEEE Transactions on Computer-Aided Design, vol. 21, no. 5, May 2002, pp. 544-553.

[88] K.A. Bowman, S.G. Duvall, and J.D. Meindl, “Impact of die-to-die and within-die parameter fluctuations on the maximum clock frequency distribution for gigascale integration,” IEEE Journal of Solid-State Circuits, vol. 37, no. 2, Feb. 2002, pp. 183-190.

[89] D. Markovic, V. Stojanovic, B. Nikolic, M.A. Horowitz, and R.W. Brodersen, “Methods for true energy-performance optimization,” IEEE Journal of Solid-State Circuits, vol. 39, no. 8, Aug. 2004, pp. 1282-1293.

[90] K. Baker, G. Gronthoud, M. Lousberg, I. Schanstra, and C. Charles Hawkins, “Defect-Based Delay Testing of Resistive Vias-Contacts A Critical Evaluation,” Proceedings of International Test Conference, 1999, pp. 467-476.

[91] C. Visweswariah, “Death, taxes and failing chips,” Proceedings of Design Automation Conference, 2003, pp. 343- 347.

[92] H. Chang and S.S. Sapatnekar, “Statistical timing analysis under spatial correlations,” IEEE Transactions on Computer-Aided Design, vol. 24, no. 9, Sept. 2005, pp. 1467-1482.

[93] L. Jing-Jia, A. Krstic, J. Yi-Ming, and C. Kwang-Ting, “Modeling, Testing, and Analysis for Delay Defects and Noise Effects in Deep Submicron Devices,” IEEE Transactions on Computer-Aided Design, vol. 22, no. 6, 2003, pp. 756-769.

[94] Modelsim SE User’s Manual, Mentor Graphics Corporation, 2005.

[95] VCS/VCSi User Guide, Synopsys, June 2002.

[96] M. Orshansky, J.C. Chen, C. Hu, D. Wang, and P. Bendix, “Approaches to statistical circuit analysis for deep sub-micron technologies,” Proceedings of International Workshop on Statistical Metrology, 1998, pp. 6-9.

[97] A. Singhee and R.A. Rutenbar, “From finance to flip flops: A study of fast quasi- Monte Carlo methods from computational finance applied to statistical circuit analysis,” Proceedings of International Symposium of Quality Electric Design, 2007, pp. 685-692.

212 [98] A. Agarwal, D. Blaauw, V. Zolotov, and S. Vrudhul, “Statistical timing analysis using bounds”, Proceedings of Design Automation and Testing in Europe, 2003, pp. 62-67.

[99] M. A. Riepe, J.P.M Silva, K.A. Sakallah, and R.B. Brown, “Ravel-XL: a hardware accelerator for assigned-delay compiled-code logic gate simulation,” IEEE Transaction on VLSI, vol. 4, 1996, pp. 113-129.

[100] J. Darringer, E. Davidson, D.J. Hathaway, B. Koenemann, M. Lavin, J.K. Morrell, K. Rahmat, W. Roesner, E. Schanzenbach, G. Tellez, and L. Trevillyan, “EDA in IBM: past, present, and future,” IEEE Transactions on Computer-Aided Design, vol. 19, 2000, pp. 113-129.

[101] S. A. Aftabjahani and Z. Navabi, “Functional Fault Simulation of VHDL Gate Level Models,” Proceedings of IEEE Computer Society VHDL International Users’ Forum, 1997, pp. 20-23.

[102] IEEE Standard for Verilog® Hardware Description Language, the Institute of Electrical and Engineers, Inc., 2006.

[103] IEEE Standard VHDL Language Reference Manual, the Institute of Electrical and Electronics Engineers, Inc., 2007.

[104] H. B. Wang, S.Y. Huang, J.R. Huang, “Gate-delay fault diagnosis using the inject- and-evaluate paradigm,” Proceedings of International Symposium on Defect and Fault Tolerance in VLSI Systems, 2002, pp. 117-125.

[105] F. Brglez and H. Fujiwara, “A Neutral Netlist of 10 Combinational Circuits and a Target Translator in FORTRAN,” Proceedings of International Symposium of Circuits and Systems, Special Session on ATPG and Fault Simulator, IEEE, 1985, pp. 663-698.

[106] M. Dietrich, U. Eichler and J. Haase, "Digital statistical analysis using VHDL: impact of variations on timing and power using gate-level Monte Carlo simulation," Proceedings of Design Automation and Testing in Europe, 2009, pp. 1007-1010.

213 VITA

Seyed-Abdollah Aftabjahani received his B.S. degree in computer engineering from the National University of Iran (Shahid Beheshti University), Tehran, Iran, in 1994, and his M.S. degree in electrical and computer engineering from the University of Tehran,

Tehran, Iran, in 1997. He conducted his Ph.D. research in the Semiconductor Testing and Yield Enhancement Group Laboratory of the School of Electrical and Computer

Engineering at the Georgia Institute of Technology, where he also participated in educational activities as a teaching assistant and later as an instructor.

His research and development experience in industry complements his academic education. Prior to going back to school in 2001, he was a software engineer at the TRW and the Computer Science Corporation in Atlanta, Georgia. He was a research engineer for 7 years at the Iran Research Center, where he designed and developed embedded systems for telecommunications systems, fax machines, and automatic mark readers. He subsequently became the technical lead of the software development team and provided solutions for software localization standardization problems. Moreover, he served as a consultant for several private companies and gained experience in automation and control using computers and embedded systems for industrial and commercial products.

His research interests include: statistical variation-aware timing analysis and modeling, simulation acceleration using compiler techniques, digital testing and testable design, and modeling for digital testing using hardware description languages.

He has authored 9 publications and 8 technical reports related to his research. He has also served on the technical program committee of a few conferences.

214