Advancing nano-CMOS circuits simulation: convergence optimization, GPU parallelization and reliability analysis

By Francesco Lannutti

A Thesis submitted to the Department of Information Electronic and Telecommunication Engineering (DIET) University of Rome “La Sapienza” In partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY

Ing. Francesco Menichelli Prof. Alberto Sangiovanni-Vincentelli (Thesis Supervisor) (Second Thesis Supervisor) Abstract

Advancing nano-CMOS circuits simulation: convergence optimization, GPU parallelization and reliability analysis By Francesco Lannutti

Submitted to the Department of Information Electronic and Telecommunication Engineering (DIET) University of Rome “La Sapienza” In partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY

The idea of having an open-source circuit simulator originated in 60s at Berkeley from a team led by Prof. Ron Rohrer and Prof. Donald Pederson, who strongly believed that students could learn electronics in depth only through an instrument which let them participate actively in designing new circuits and analyzing existing ones (from Prof. Andrei Vladimirescu presentation at MOS-AK 2013). At the beginning, it was CANCER (Computer Analysis of Nonlinear Circuits, Excluding Radiation), then SPICE1 and 2 (Simulation Program with Integrated Circuit Emphasis), released multiple times during the subsequent years, both developed by Laurence Wolfgang (Larry) Nagel during his PhD, under the supervision of Prof. Donald Pederson. With the passing of time, SPICE had become so crucial that several semiconductor companies supported the program in house (e.g., Intel, TI, ST, and ATT) and added models for devices that were dependent on internally developed technology. Several small companies (e.g., Meta Software with HSPICE, OrCAD with PSPICE) modified SPICE slightly and sold their products in the market. Later on, Cadence Design Systems developed SPECTRE that was based on the research carried out by Kenneth Kundert and Jacob White at Berkeley which featured robust integration

ADVANCING NANO-CMOS CIRCUITS SIMULATION I

methods. SPICE and its derivatives are now the mostly used software in the world for accurate analog circuit simulation and standard cell characterization for digital flows. Fortunately, there are still open versions of SPICE that can be found on the web, among which the most known and used one is NGSPICE, an open-source version which integrates SPICE3, CIDER (which combines SPICE3 with an internal C-based device simulator, DSIM) and XSPICE developed at Georgia Tech. I have chosen NGSPICE for my research, because of the strong support by the community, which can provide invaluable user feedbacks during the research. Despite that, NGSPICE still suffers from coding and algorithmic point of view. While the first one comes from ancient inherits and treats about performances and memory occupancy, the second one is even worse, because it can generate quality of results issues, due to the usage of old algorithms in certain areas, and a usability problem, which makes NGSPICE difficult to be adopted in some research sectors, including new device Compact Models development in Verilog-A, without manually converting it in C code. My PhD research activity concerns the simulation of nano-scale CMOS integrated circuits and is focused on the following four topics: 1) Allowing the efficient simulation of very large circuits, including convergence techniques and the use of a GPU to speed up the analysis 2) Speeding up the simulation leveraging on parallelization algorithms 3) Implementing the Reliability Analysis to stress test the emergent technology nodes 4) Introducing algorithmic techniques to improve elaboration capacity and versatility and to allow the derivation of compact models that are needed to model new materials, new devices and new device effects During the first year of activity, I developed a new algorithm for convergence that attracted interest in the scientific community. It has been presented at MOS-AK 2013 in Bucharest in September 2013 and at PRIME2014 in Grenoble in July 2014 [1], where received the Silver Leaf Award by the conference committee. The results achieved during the second year sped up the simulation up to 3.74 times, by using the GPU acceleration and a novel approach to integrate the outputs of several independent models to only one Circuit Matrix and RHS. The obtained speedup implies decreasing the elapsed

ADVANCING NANO-CMOS CIRCUITS SIMULATION II

time, for example, from 1 hour, using the CPU version of NGSPICE, to 15 minutes, using GPU acceleration. During the third year, the Reliability Analysis has been successfully implemented in NGSPICE, generating a new framework that can leverage on state of the art aging models and can consider both the short term and long term behaviors. The fourth line of activity started during the end of the second year, continued all along the third year and unfortunately is not yet completed due to its complexity, yet it produced a Verilog- A Model Compiler with a complete chain of Lexer and Parser and a TCL user interface.

ADVANCING NANO-CMOS CIRCUITS SIMULATION III

Table of contents

CHAPTER 1 - INTRODUCTION ...... 1

1.1 THE NGSPICE PROJECT ...... 2

1.2 AUTOTOOLS TOOLCHAIN ...... 3

1.3 THE AIM OF MY PHD ...... 4

CHAPTER 2 - KLU – IMPROVEMENTS OVER MY MASTER’S THESIS RESULTS ...... 5

2.1 BINDING TABLE SORTING ...... 6

2.2 KLU SUPPORT FOR ALL THE NGSPICE DEVICE MODELS ...... 7

2.3 KLU ENHANCEMENT IN SYMBOLIC FACTORIZATION ...... 7

2.4 AN IDEA FOR THE PARALLELIZATION ...... 8

CHAPTER 3 - PERIODIC STEADY STATE ANALYSIS ...... 10

3.1 INTRODUCTION ...... 10

3.2 STABILIZATION ...... 13

3.3 SHOOTING ...... 14

3.4 PERIODIC STEADY STATE ...... 16

3.5 RESULTS ...... 17

CHAPTER 4 - A NOVEL CONVERGENCE ALGORITHM BASED UPON KCL VERIFICATION ...... 20

4.1 THE NEWTON-RAPHSON METHOD ...... 24

4.2 THE STATE OF THE ART OF CONVERGENCE ALGORITHMS FOR CIRCUIT SIMULATION ...... 25

4.3 THE FALSE CONVERGENCE PHENOMENON ...... 26

4.4 KCL VERIFICATION ...... 26

4.5 F(VK) IN SPICE ...... 27

4.6 NODES CLASSIFICATION ...... 28

4.7 CONTRIBUTIONS IN THE BSIM4 MODEL ...... 28

4.8 A NEW HOMOTOPY METHOD FOR FASTER CONVERGENCE ...... 33

4.9 IMPLEMENTATION RESULTS ...... 35

4.10 LINEAR AND NON-LINEAR SEPARATION ...... 36

CHAPTER 5 - CUSPICE – THE REVOLUTIONARY NGSPICE ON CUDA PLATFORM ...... 43

5.1 INTRODUCTION ...... 43

5.2 THE FERMI ARCHITECTURE ...... 44

5.3 THE FERMI ARCHITECTURE MEMORY ACCESS ...... 47

ADVANCING NANO-CMOS CIRCUITS SIMULATION IV

5.4 THE CUDA PLATFORM ...... 48 5.4.1 The basic CUDA application ...... 48 5.4.2 The kernel in detail ...... 49

5.5 THE NEW KEPLER ARCHITECTURE ...... 51 5.5.1 The new Streaming Multiprocessor and the memory ...... 51 5.5.2 The scheduling and the energy efficiency ...... 52

5.6 THE GPU PARALLELIZATION OF THE DEVICE MODEL EVALUATION ...... 52 5.6.1 From Linked Lists to Structures of Arrays ...... 53 5.6.2 The Topology Matrix Method ...... 60 5.6.3 Other code refinements to achieve awesome performances ...... 65

5.7 RESULTS ...... 66

CHAPTER 6 - RELIABILITY ANALYSIS ...... 72

6.1 THE AGING MODEL ...... 75

6.2 THE SHORT TERM BEHAVIOR ...... 76

6.3 THE LONG TERM BEHAVIOR ...... 77

6.4 THE IMPLEMENTATION IN NGSPICE ...... 78

6.5 RESULTS ...... 97

CHAPTER 7 - VERILOG-A MODELS COMPILER ...... 102

7.1 COMPACT MODELS ...... 102

7.2 THE VERILOG-A STANDARD ...... 102 7.2.1 A simple example ...... 103

7.3 WHAT IS A COMPILER ...... 104

7.4 WHY A COMPILER IS NEEDED ...... 104

7.5 EXISTING VERILOG-A COMPILERS ...... 105

7.6 WHY A NEW VERILOG-A COMPILER IS NEEDED ...... 105

7.7 THE AIM OF MY VERILOG-A MODELS COMPILER (VAMC) ...... 106 7.7.1 The Front-End Implementation ...... 107 7.7.2 The Back-End Implementation ...... 109 7.7.3 Results ...... 111

CHAPTER 8 - CONCLUSION ...... 112

8.1 FUTURE WORKS ...... 114

ADVANCING NANO-CMOS CIRCUITS SIMULATION V

Chapter 1 - Introduction

The idea of having an open-source circuit simulator originated in 60s at Berkeley from a team led by Prof. Donald Pederson, who strongly believed that students could learn electronics in depth only through an instrument which let them participate actively in designing new circuits and analyzing existing ones. At the beginning, it was CANCER (Computer Analysis of Nonlinear Circuits, Excluding Radiation), stating that Berkeley University was a symbol of liberalism in 1960s, when most of the circuit simulators were developed under the US Department of Defense. CANCER was developed by Laurence Wolfgang (Larry) Nagel and other students under the supervision of Prof. Ronald Rohrer. When Prof. Rohrer left Berkeley, the program supervisor became Prof. Donald Pederson, which wanted the circuit simulator be available in the public domain. So SPICE was going to born… The first public release of this new circuit simulator was SPICE1, which came out in 1973. It was written in FORTRAN and it supported only the Nodal Analysis, which doesn’t support , voltage sources and some controlled sources. In addition to that, it had only few components in the library and the Transient Analysis was based upon fixed time step. For these reasons, the first popular release of SPICE was SPICE2, which came out in 1975 and it was coded in FORTRAN as well. Larry substituted the Nodal Analysis with the Modified Nodal Analysis to overcome its limitations and introduced the variable time step, based upon the Trapezoidal or Gear Methods for the numerical integration step. This SPICE version was maintained until 1983 to the 2G6 release, when another student, Thomas Quarles, took the place of Larry in developing the tool, under the supervision of Prof. Richard Newton, Prof. Donald Pederson and Prof. Alberto Sangiovanni-Vincentelli. This work produced a new mainstream code of SPICE, called SPICE3, which added the support of the X11 system interface, called NUTMEG, developed by another student, Wayne Christopher, and transformed the code language from FORTRAN to C. With this new mainstream code, the license system became BSD until the latest developed version at UC Berkeley, which was the 3F5, in 1993. In the meantime, Kenneth Kundert developed a very efficient and fast sparse matrix solver, called SPARSE, to replace the previous solver, called SMP, under the supervision of Prof. Alberto Sangiovanni-Vincentelli.

ADVANCING NANO-CMOS CIRCUITS SIMULATION 1

With the passing of time, SPICE had become so crucial that several semiconductor companies supported the program in house (e.g., Intel, TI, ST, and ATT) and added models for devices that were dependent on internally developed technology. Several small companies (e.g., Meta Software with HSPICE, OrCAD with PSPICE) modified SPICE slightly and sold their products in the market. Kenneth Kundert developed SPECTRE at the University of California at Berkeley, under the supervision of Prof. Alberto Sangiovanni-Vincentelli, and then joined Cadence Design Systems, where he developed the industrial version of SPECTRE with the support of Jacob White, who carried out his research at Berkeley in circuit simulation, under the supervision of Prof. Alberto Sangiovanni-Vincentelli, and then joined MIT. SPECTRE featured a different circuit formulation method and deployed novel robust integration methods. SPICE and its derivatives and SPECTRE are now the mostly used software in the world for accurate analog circuit simulation and standard cell characterization for digital flows.

1.1 The NGSPICE Project Fortunately, there are still open versions of SPICE that can be found on the web, among which the most known and used one is NGSPICE, an open-source version started by an idea of Paolo Nenzi in 1999, which integrates SPICE3, CIDER (which combines SPICE3 with an internal C-based device simulator, DSIM) and XSPICE developed at Georgia Tech. Its name means New Generation SPICE, but its core is still based upon SPICE3F5 (the latest version released by Berkeley). Right now, this part has changed so much that it’s almost impossible to recognize it; in particular, a lot of memory leaks have been fixed, especially in the frontend part, and several new models and input language commands have been added. In addition to that, some algorithms have been implemented to enhance the already existing NGSPICE functionalities or to add new ones. NGSPICE maintains the BSD license and it’s freely available as SourceForge GIT repository in various compiled versions for Windows, Mac and Linux, together with the source code. In the repository, there are a lot of experimental branches, which serve to code new ideas, without polluting the master branch, which hosts the code to be released. The release cycle is not fixed, but a new version gets approximately released every year.

ADVANCING NANO-CMOS CIRCUITS SIMULATION 2

NGSPICE is written in C, like SPICE3F5, and is built using Autotools Toolchain automation system, which is very powerful, especially for Open Source software, even though there are alternatives, like CMake. All the work I have made is integrated in NGSPICE, because it’s a complete open source platform, which respects the real old SPICE criteria, and also because Paolo Nenzi has been part of my university for long time; he helped me in starting the Master’s Thesis and understanding the basics of NGSPICE.

1.2 Autotools Toolchain The Autotools Toolchain automation system is composed by several programs, which act together to simplify the compilation process for the developer and produces easily the final binaries (in general more than one). They are: • Autoconf • Automake • Libtool The first one converts a description, which is inside ‘configure.ac’, written in a particular language composed of macros and called M4, in the ‘configure’ executable. The second one converts a description of the Makefile, written in another particular language, to the real Makefile. It’s capable to write very complex Makefiles starting from quite simple descriptions, without even knowing the Makefile language. By the way, the distinction between their jobs is not easy, because they are parts of the same global goal. The third one is optional in the toolchain, but it’s needed if the Makefiles are composed by libraries, which could be installable or not installable (placeholder). In NGSPICE, the ‘configure.ac’ file is quite long and complex, because it has to configure the circuit simulator for each supported platform (Windows, Mac and Linux) and configure also the various components the user wants to have in the final build, like XSPICE, CIDER, the PSS analysis, KLU, and so on. The Makefile structure is quite simple globally: there is one Makefile for each folder/subfolder and it builds all the files in that folder to a library and/or it acts recursively using the ‘SUBDIRS’ construct. Only the main Makefile, which is in the root folder, and the one under the ‘src’ folder are different. The first one is richer under the extra files which have to be deleted, naming them

ADVANCING NANO-CMOS CIRCUITS SIMULATION 3

one by one, when the developer wants to delete all the residual of the compilation process, using the ‘make maintainer-clean’ command, while the second one is the real main Makefile, because it assembles all the libraries back together in a particular order and in a particular library superset, prior to be integrated in the final NGSPICE executable. There are also at least two executables in there, because there is not only NGSPICE to be compiled, but also NUTMEG, as stand-alone program. Libraries in NGSPICE are just placeholder, because it has been decided long time ago to produce only static libraries and static linking. This means that all the ‘Makefile.am’ files for Automake and Libtool use the ‘noinst’ keyword prefix for all the libraries, so that they are produced only during the compilation process and not installed in the final folder when the ‘make install’ command is used. So, they are a placeholder in the sense that they act as a package for each folder/subfolder, but at the end they are linked all together in the final executable.

1.3 The aim of my PhD I started my PhD adventure, after that I got passionate to the circuit simulators’ world and in particular to NGSPICE. I decided, then, to pursue the PhD with the aim of enhancing this fantastic open source circuit simulator and let available to the public domain, in the open source form, a modern circuit simulator, capable of overcoming issues, including the false convergence phenomenon, speeding up the simulation time using the modern NVIDIA GPUs, supporting new Compact Models written in Verilog-A and more. It’s a strange research, respect to the ones we are used to have in the Electronic Engineering world, because it’s more related to Computer Science, Linear Algebra, Graph Theory and maybe more, but it has a very practical impact and usage in the designer community. The research path I have run during my PhD is not only theoretical, but there is a working code, committed in some branches of NGSPICE, showing very good results, which every designer can access to, try and further enhance. The only exception is the Verilog-A Models Compiler, which is unfortunately not completed yet, because it’s a huge project and requires a lot of time and extra knowledge, including what a parser is and how to build a proper one using valid instruments, like Bison and Flex. However, a lot of code is already done and working; in fact, the Front-End part is completed, just some modifications are needed in order to satisfy operators precedence, and part of the Back-End is completed too.

ADVANCING NANO-CMOS CIRCUITS SIMULATION 4

Chapter 2 - KLU – Improvements over my Master’s Thesis results

During my Master’s Thesis I have analyzed in detail and ported a new sparse matrix solver, called KLU, to NGSPICE, with the ultimate intent of speeding up the entire simulation. KLU has been invented by Ekanathan Natarajan Palamadai, under the supervision of Prof. Tim Davis, at the University of Florida as Master of Science in 2005, and it’s specifically designed for circuit simulation. It’s based upon a chain of several algorithms which: 1. Reorder the sparse matrix to avoid zeroes along the main diagonal (Maximum Transversal) 2. Partition the sparse matrix in sub-matrices (BTF – Block Triangular Form – optional) 3. Reorder the sub-matrices to reduce fill-ins insertion during the LU factorization (AMD or COLAMD) 4. Factorize the sub-matrices using the Gilbert-Peierls LU decomposition, with partial pivoting, and the Eisenstat-Liu symmetric pruning The first three steps lead to the symbolic factorization, which is then used to perform the numerical factorization, including the partial pivoting calculation. This flow is true for the first factorization or when the pivoting is not reusable, leading to an unstable sparse matrix, otherwise the same symbolic factorization and the same pivoting can be reused for multiple subsequent factorizations. In SPICE, this case is really frequent, so KLU makes also available another feature, which is called Re-Factorization. KLU includes also other features, which are needed in a circuit simulator: • Linear system solution, after the LU decomposition • Complex numbers support through the separate storage of real and imaginary parts, under the form of packets of two numbers each one During my PhD, I have enlarged the work I have done on this topic [2], by enhancing the data structure, bringing the speedup from 11X to 13X, and by including the support of all the NGSPICE device models.

ADVANCING NANO-CMOS CIRCUITS SIMULATION 5

2.1 Binding Table Sorting The first enhancement concerns the binding table I have used to mimic the associative array concept, not available in C, to store the pointers to SPARSE 1.3, KLU Real and KLU Complex representations of the circuit matrix. Each device model has to access the table to retrieve the correct pointer to a certain coordinate location (x, y), repeated for all the coordinates each device instance has to know. This operation is executed only once for a Transient and OP Analyses, because each model has to only change its pointers to the circuit matrix from the SPARSE 1.3 representation to the KLU Real representation, and several times for an AC Analysis, because some conversions from KLU Real to KLU Complex and vice versa are needed in addition to the previous one, which is still performed during the initial OP Analysis. So, it’s clear that optimizing this part is crucial to achieve a good overall speedup. Moreover, the PZ and Sensitivity Analyses change the sparse matrix on the fly or by using the devices’ setup routine, so a process faster than this conversion has to be also developed to further speed up these analyses. I’m evaluating the possibility of getting rid of SPARSE 1.3 when KLU is used, by using the COO format as starting point for having the CSC representation KLU uses, but this poses some challenges. Then, in order to enhance the speed, I have worked on the data structure, by introducing a sorting algorithm, in particular Quick Sort, because the reordered Binding Table can be exploited by a Binary Search Algorithm, which is really better and more efficient than a Linear Search Algorithm, since the computational cost decreases from � � to � log � . So, the Binding Table has been implemented as array of structure, so that each packet has three pointers (at the moment SPARSE 1.3, KLU Real and KLU Complex) and can be easily expanded to whatever number of pointers, if necessary. However, this doesn’t give any speedup per se. In order to benefit from this new data structure, I have added a new pointer inside each device model data structure to store the address of each needed packet inside the Binding Table, sacrificing some space for time, but achieving � 1 in all the subsequent pointers swapping, needed for example in the AC Analysis, as mentioned before, instead of repeating the Binary Search all the times. Exchanging space for time is a well-accepted paradigm nowadays, since space costs less than time, so the idea put in place here is generally valid.

ADVANCING NANO-CMOS CIRCUITS SIMULATION 6

2.2 KLU support for all the NGSPICE device models The second enhancement expands the KLU support to every NGSPICE device models. In fact, my first implementation of KLU, during my Master’s Thesis, supported only a reduced number of device models and in particular only the ones needed to execute the ISCAS85 Benchmarking Suite, meaning BSIM4, , and Voltage Source. Now, with this enhancement, practically all the models are supported, with the exception of only URC, which is still unsupported, because its data structure is different from the one needed by KLU.

2.3 KLU enhancement in Symbolic Factorization There is a third enhancement, which is an idea come out recently in my mind, when I have recalled an analysis I made long time ago during my Master’s Thesis. The problem is that the Symbolic Factorization is divided in two subsequent steps, called in series, just one after the other: • lsolve_symbolic • construct_column The first routine extracts all the columns with a loop from the CSC representation, considering also the reordering happened along the way, and calls the DFS to extract the columns’ indices impacted by the current reordered one in the loop, but only if the current reordered column is inside the current block, otherwise it does nothing and simply continue the loop. This is a waste of computation, because the second routine scatters all the columns from the CSC representation to the dense representation (a vector), considering also here the reordering happened along the way, but taking into account all the columns in the loop, without skipping the ones outside the current block. After this analysis, it’s evident that the first loop is equal to the second one, with the only difference that the first one skips some columns, so it’s definitely possible to merge them all into only one loop, compressing the two routines in only one and assigning the work properly to each category of column. The speedup is not so evident, unfortunately, and it’s just about 1%, but the modification is so simple that is worth doing it. Anyway, the modification acts only in the off-block columns, so it depends on the Block Triangular Form, which depends on the circuit under analysis. The speedup should be more evident if the number of off-block columns is high, which is generally not the case,

ADVANCING NANO-CMOS CIRCUITS SIMULATION 7

because circuits have quite always a big block and some small ones, coming from the strongly connected components algorithm.

2.4 An idea for the parallelization

Finally, an idea which came out in my mind during my Master’s Thesis, but realized only in part, exploits an interesting article of Dr. Maxim Naumov from NVIDIA [3]. Basically, the Breadth First Search (BFS) algorithm can be used in place of DFS used by KLU to find the elimination tree, which is the sequence of independent operations that can be exploited in the sparse matrix. However, just by substituting the DFS with the BFS, the Gilbert-Peierls LU factorization algorithm [4] is changed in its heart, modifying the principle that the LU factorization happens in the number of multiplication between L and U, which is the theoretical minimum number of operations; moreover, the Eisenstat-Liu pruning, which cuts the computation time of the DFS, is not considered anymore. So, it’s true that BFS exposes the parallelism, but it’s also true that the amount of needed calculations is not the minimum anymore, so the available parallelism cannot compete in general with the amount of computation needed by the serial version of the Gilbert- Peierls algorithm. Moreover, if the GPU is considered as the hardware where the parallelism is exploited, there is also a frequency reduction to be taken into account. In fact, by implementing the modification on a GPU, there is only a slowdown, which could be quite big in certain cases, as showed in the Figure 1.

Figure 1 – Slowdown of the GPU of KLU respect to the serial one

ADVANCING NANO-CMOS CIRCUITS SIMULATION 8

So, an enhancement could be made to this idea: it’s possible to use the Gilbert-Peierls standard algorithm with Eisenstat-Liu pruning to extract in the symbolic factorization whose columns are needed during the next numerical factorization, but then use the BFS on top of that to extract the available parallelism. Since the available parallelism will be very few, this solution is not suitable for the GPU, but it could be worth trying it on a multi-core CPU, where the available cores are generally in a number of 4-8 nowadays. The available parallelism depends on the current column, because a DFS, and so a BFS, is launched for each column of L, considering also the fill-ins, so at the beginning the amount will be almost null, but it could increase along the way, as showed in the Figure 2. If there is no available parallelism in a certain column, then the execution will be serial, but, at least, the CPU frequency is the same, while the GPU runs generally at less than a quarter of the CPU frequency.

Number of parallel operations for each factorization on «c432»

200

180

160

140

120

100

Parallelism 80

60

40

20

0 944 1112 1313 1507 1643 1775 1856 1969 2073 2180 2279 2376 2481 2589 2699 2812 2921 3030 3116 3232 3374 3479 3564 3661 3773 3863 3944 4033 4117 4202 4284 4366 4447 4528 4609 4680 4755 4826 4897 Figure 2 – Number of Parallel Operations for each column of the ‘c432’ circuit of the ISCAS85

ADVANCING NANO-CMOS CIRCUITS SIMULATION 9

Chapter 3 - Periodic Steady State Analysis

3.1 Introduction RF analyses are one of the simulation classes generally available in a circuit simulator, but NGSPICE doesn’t have any of them; only an experimental Periodic Steady State Analysis was present at the time when I started looking into this topic. That experimental code has been written by a friend of mine, Stefano Perticaroli, based upon his PhD dissertation, but it was full of bugs and memory leaks. So, we decided to start over and he explained me the main algorithm he was intended to code, letting us start a new implementation from scratch, adding at the same time new ideas which came in my mind to reach the goal. The Periodic Steady State Analysis is the basic analysis for RF systems; it serves to investigate the large signal and/or the non-linear behavior of both autonomous and not-autonomous periodic systems. The Figure 3 illustrates the two categories: the first one comprehends systems without inputs, which have at least one output, like oscillators, while the second category regards all the systems in which an RF signal is applied to the input and the response can be acquired at the output, like an LNA or a mixer.

Figure 3 – Autonomous and non-autonomous periodic systems

ADVANCING NANO-CMOS CIRCUITS SIMULATION 10

Autonomous systems are different from the non-autonomous by the fact that only initial conditions are known for the former ones, while also the working frequency is known for the latter ones, which represents in turn one of the goal of the Periodic Steady State analysis for the first ones. In both cases, the voltage at each node and the current at each branch are to be retrieved from the analysis. There are several ways to reach the Periodic Steady State solution: - Transient Shooting - Harmonic Balance - Mixed Domain Techniques Since the first one is preferred in case of highly non-linear and/or large circuits, this is the way the Periodic Steady State Analysis has been implemented in NGSPICE. The Transient Shooting method tries to solve the following equation: � � � � , � = � � � + � � � + � � = 0 �� Equation 1 – Transient Shooting formulation

And the PSS state is reached when: � � + � − � � = 0 Equation 2 – Periodic Steady State condition

Since the period T is unknown for autonomous circuits, a closed form solution doesn’t exist in this case and a residual error must be accepted, leading to the formulation in Equation 3: � � + � − � � = �

Equation 3 – Periodic Steady State condition using quantized quantities and a residual error

ADVANCING NANO-CMOS CIRCUITS SIMULATION 11

In order to achieve the goal, the original Transient Analysis has been modified, by adding the Shooting Method, which is divided in three steps, as illustrated in Figure 4: 1. Stabilization 2. Shooting 3. Periodic Steady State and automatic frequency correction

Figure 4 – Block diagram of the PSS implementation in case of autonomous circuits

ADVANCING NANO-CMOS CIRCUITS SIMULATION 12

3.2 Stabilization The first part is basically the free evolution of the system, like a pure Transient Analysis, since some time is required to let the system go at speed; so the user has an analysis parameter to be tweaked, in order to give the system at least the minimum amount of time to reach its stability. This user parameter is mandatory, because this amount of time depends on the circuit. This concept is depicted in Figure 5:

Figure 5 – Stabilization of a Van Der Pol oscillator during a Periodic Steady State Analysis

ADVANCING NANO-CMOS CIRCUITS SIMULATION 13

3.3 Shooting After the stabilization, the shooting block tries to retrieve the period by taking a user specified guessed frequency, which depends on the circuit, and also by taking the RHS as Pseudo State Vector, because it contains the equivalent conductance of components with memory plus other things, including components without memory, independent sources and controlled sources. However, Shooting Method would require the state matrix, which expresses the charges and the fluxes inside the circuit and it’s then composed only by active linear components, like and inductors, but this matrix is not available in NGSPICE, since every device model adds his contribution to the circuit matrix and the state matrix is not extractable at the end, by simply post-processing the circuit matrix. So, in order to have that, all the device models have to be changed in order to write two matrices, the circuit matrix and the state matrix, but this implies another issue: the basic formulation is composed by simple circuits, which have a simple state matrix, but what is the state matrix in the general case, for example when there is a node in the circuit which has no capacitors or inductors? Should that node be empty or should it be collapsed? Instead of answering to this question, NGSPICE uses the circuit matrix as state matrix. During the natural evolution of the system in Transient Analysis, the mean quadratic error of the Pseudo State Vector and its derivative are calculated. Then, they are used, together with the Pseudo State Vector error, to update the period, which can be used as starting point of the next shooting cycle or can be the final one. This concept is very powerful, but it requires a fixed time step, which is unfortunately not available in NGSPICE, because the time step depends on the Local Truncation Error, which is calculated by each device model separately and is then taken as the minimum of each one.

Figure 6 – Time steps are not constant due to Local Truncation Error

ADVANCING NANO-CMOS CIRCUITS SIMULATION 14

However, this is in contrast to what the shooting cycle is trying to achieve, because it would try to update the period near the end of it. In order to do that, the implementation injects artificial breakpoints to make the next time step be at least close to the one calculated by the shooting cycle, as the Figure 6 tries to show.

The choice of � and � influences heavily the convergence accuracy and the algorithm speed. In order to guarantee convergence at each shooting cycle, the conditions in Equation 4 and Equation 5 are considered. They have to be valid for each node or branch to claim that convergence is reached.

�������_����� ≤ ��� � ∗ ��� + �_������ ∗ ��_��� ∗ ������_����� Equation 4 – Voltage error calculation for convergence purposes

�������_����� ≤ ��� � ∗ ��� + �_������ ∗ ��_��� ∗ ������_����� Equation 5 – Current error calculation for convergence purposes

The steady coefficient is a new parameter available only for Periodic Steady State Analysis, while the other parameters are already available in the Transient Analysis, even though their values can be different. Their default values are shown in Table 1: Default Values of Periodic Steady State Analysis convergence parameters rel_tol 0.001 v_abstol 1µV i_abstol 1nA tr_tol 7

Table 1 – Default Values of Periodic Steady State Analysis convergence parameters

ADVANCING NANO-CMOS CIRCUITS SIMULATION 15

3.4 Periodic Steady State The third part is the research of the Periodic Steady State, after that the shooting phase is completed. At this point, the period is known, so the system is let evolved for another period in order to collect proper information for plotting (discussed later on) in Time and Frequency Domains, through the usage of a DFT routine, which calculates the harmonics, whose number is specified as user parameter from the input netlist. The output of the DFT is also used for the automatic correction of the final frequency, because the shooting cycle can converge to a local minimum when a far guessed frequency has been provided as starting point, so a multiple of the real frequency can be proposed as result. In order to automatically correct this, a property is exploited: the harmonic with the highest magnitude is the first one, if the frequency is the correct one and not a multiple. So, by post processing the result of the DFT, this condition is checked and, if it isn’t verified, another Periodic Steady State Analysis is launched on top of the current one using the frequency of the first harmonic as new guessed one, otherwise the analysis terminates. All the plots in both Time and Frequency Domains are retained, even when the final frequency is not correct. This algorithm intent is shown in the Figure 7:

Figure 7 – The Periodic Steady State Analysis is trapped in a local minimum and the DFT is used to get out

ADVANCING NANO-CMOS CIRCUITS SIMULATION 16

3.5 Results Finally, some results of the implemented Periodic Steady State Analysis are shown in the figures below.

Figure 8 – Colpitt’s Oscillator

ADVANCING NANO-CMOS CIRCUITS SIMULATION 17

Figure 9 – Complementary CMOS Oscillator

Figure 10 – Hartley Oscillator

ADVANCING NANO-CMOS CIRCUITS SIMULATION 18

Figure 11 – Vackar Oscillator

Figure 12 – Van Der Pol Oscillator

ADVANCING NANO-CMOS CIRCUITS SIMULATION 19

Chapter 4 - A novel convergence algorithm based upon KCL verification

One of the issues of SPICE since day-zero is the so called False Convergence Phenomenon, which consists in believing that the Newton-Raphson cycle is completed, and so convergence is reached, even though this is not true, because KCL is not verified and satisfied. This behavior is due to the original algorithm put in place by Larry Nagel in SPICE, which just checks if a current has reached convergence by using a threshold. This concept is an arbitrary criterion and it’s suitable for direct unknowns, but it’s wrong in case of indirect unknowns and this is the case when the current doesn’t come from an independent current source, but it’s calculated inside a device model, starting from the voltages applied to it. Said that, let’s step back to recall some background information in order to introduce a formal description of the problem and its solution. In order to solve a linear circuit, it’s possible to leverage upon the KCL or the KVL, where the first one says that the sum of the entering currents at each node of the circuit is zero and the second one says that the sum of voltages at each closed branch is zero. In SPICE, the KCL formulation has been used. From the algorithmic point of view, SPICE represents each node of the circuit with a number, which is actually the row and column index of the circuit matrix which is being assembled to form the linear system to be solved and it takes the name of Nodal Analysis. A special case exists: the reference node, or ground, is always indicated as the node number 0 and it isn’t stored inside the matrix, but it’s only present in the RHS.

Figure 13 – Ladder Network

ADVANCING NANO-CMOS CIRCUITS SIMULATION 20

Considering a very simple 3 nodes circuit, called Ladder Network, KCL can be formulated in this way:

�,

�, = �, �, Equation 6 – General KCL formulation for a 3 nodes circuit

where � is the node index, � is the current index which insists at the � node, � is the total number of nodes in the circuit and � is the total number of currents which insists at the � node. Now, let’s make a step further and explicit the sum of the non-zero currents, introducing also the matrix notation, at the same time. Considering that � is equal to 7 in this example, we have:

� � −� + � + � = 0 −1 1 1 0 0 0 0 � 0 −� + � + � = 0 à 0 0 −1 1 1 0 0 ∙ � = 0 −� + � + � = 0 0 0 0 0 −1 1 1 � 0 � � Equation 7 – Linear System for the Nodal Analysis applied to the Ladder Network

In this linear system, all the currents are unknown, but this means also that we need 7 equations to be able to solve the system. Considering that � is the current source, it isn’t unknown, in fact, so it’s possible to simplify the system:

� � � � −1 1 1 0 0 0 0 � 0 1 1 0 0 0 0 � � 0 0 −1 1 1 0 0 ∙ � = 0 à 0 −1 1 1 0 0 ∙ = 0 � 0 0 0 0 −1 1 1 � 0 0 0 0 −1 1 1 0 � � � � Equation 8 – Final Linear System of the Equation 7

Now, we need something more to formulate additional equations and continue simplify the system. We can leverage on the components which connect each node and use the component constitutive equations. In this case, they are simple and we know in this case that the Ohm’s Law is valid, so we can say that: � = �� or � = = ��, where � is the resistance and � is the conductance.

ADVANCING NANO-CMOS CIRCUITS SIMULATION 21

In this circuit, we have:

� = � ∙ �

� = � ∙ � − �

� = � ∙ �

� = � ∙ � − �

� = � ∙ �

� = � ∙ � Equation 9 – Circuit currents respect to the component constitutive equations

Reorganizing the system and denoting � as �, we have:

� ∙ � + � ∙ � − � = � � + � ∙ � − � ∙ � = � −� ∙ � − � + � ∙ � + � ∙ � − � = 0 à −� ∙ � + � + � + � ∙ � − � ∙ � = 0 −� ∙ � − � + � ∙ � + � ∙ � = 0 −� ∙ � + � + � + � ∙ � = 0

� + � −� 0 � � à −� � + � + � −� ∙ � = 0 0 −� � + � + � � 0 Equation 10 – Final Linear System to be solved from SPICE

This is the final linear system to be solved, since it’s a 3 equations in 3 unknowns. How to solve this system is not the aim of this paragraph, but the general form to be solved is denoted by �� = �, where � is the circuit matrix, � is the unknown voltages vector and � is the known currents vector. In particular, the circuit matrix � can be called conductances matrix, since contains only those and can be denoted as �, so: �� = �. In fact, this notation brings back us to the Ohm’s Law: �� = �, where now � is a matrix and the other two elements are vectors. Now, let’s replace the current source with the Thevenin model. There is one more node in the circuit, which will be called �, the current branch along the generator will be � and its voltage will be �. The resulting circuit can be schematized as exposed in Figure 14:

Figure 14 – Ladder Network with a Voltage Source

ADVANCING NANO-CMOS CIRCUITS SIMULATION 22

With this modification, the system cannot be composed anymore like before, because � is unknown now, but, looking at the current situation, simply by separating what it’s known from what it isn’t, it’s possible to re-write the system as in Equation 11:

� + � + � ∙ � − � ∙ � − � ∙ � = 0 −� ∙ � + � + � + � ∙ � − � ∙ � = 0 −� ∙ � + � + � + � ∙ � = 0 à � − � ∙ � + � ∙ � = 0 � = �

� + � + � −� 0 −� 0 � 0 −� � + � + � −� 0 0 � 0 0 −� � + � + � 0 0 ∙ � = 0 . −� 0 0 � 1 � 0 0 0 0 1 0 � � Equation 11 – New Linear System to be solved when there is at least a Voltage Source in the circuit

Clearly there are two more nodes in the system, even though there is only one more node in the circuit. This happens because now there is an additional unknown, which is the current of the voltage source, while the voltage of the voltage source must be present in the system anyway, since it contributes to the KCL formulation, so it doesn’t modify the system. So, at the end, there is one more equation for the one more node and other one for the voltage source constitutive relation. This method is known as Modified Nodal Analysis (MNA) and every SPICE circuit simulator uses it. It basically extends the Nodal Analysis to include voltage sources, inductors and controlled sources, where current nodes are added to take care of the direct unknown current branches of those components; so the circuit matrix has a more complicated structure, which is represented in Equation 12: � � � = � � Equation 12 – General circuit matrix representation and its contributions where: • � is the conductances matrix seen extensively above, equivalent to the one in the Nodal Analysis • � is the current branches matrix, due to every direct unknown current branch; it contains only ones and zeroes

ADVANCING NANO-CMOS CIRCUITS SIMULATION 23

• � is another current branches matrix and it's due to the MNA voltage constraints when at least a current branch is present in �; it contains only ones and zeroes as well • � is the controlled sources matrix and it contains also the inductors entries; its elements have resistance dimension So, the linear system which SPICE solves is represented in Equation 13: � � � � ∙ = � � � � Equation 13 – General Linear System solved by SPICE, composed by using the MNA formulation

If the circuit has non-linear devices, as usual in SPICE simulations, � becomes the Jacobian Matrix, represented as �, resulting from the linearization through the Taylor's series expansion. The solution is found by solving a sequence of linear systems until convergence is reached.

4.1 The Newton-Raphson Method The Newton-Raphson Method is based upon the Equation 14:

� � � , � = 0 à � � � , � = � � + � = 0 à � � + � � ∙ � − � + ℎ. �. �. +� = 0 Equation 14 – Newton-Raphson Method construction where � � is the non-linear function, � � is the Jacobian Matrix and ℎ. �. �. stands for higher order terms, which are dropped by the linearization process. Since it’s an iterative method, it needs a termination criterion; valid options are: • A maximum iteration limit • The limitation of the difference between two adjacent solutions SPICE implements both of them, leveraging on the second one for every iteration until convergence is satisfied or it's necessary to start an homotopy method, leaving the first one as worst case. Now, let’s write the Newton-Raphson iteration differently in the Equation 15:

� � + � � ∙ � − � + ℎ. �. �. +� ≅ � � + � � ∙ � − � + � = 0 à

� � + � � ∙ � − � + � = 0 Equation 15 – Newton-Raphson Method – Final Formula

where � is the iteration step. In this case, both � and � are direct unknowns.

ADVANCING NANO-CMOS CIRCUITS SIMULATION 24

Now, rearranging the Equation 15, it’s possible to recognize the analogy with the MNA formulation:

� � ∙ � + � = � � ∙ � − � � Equation 16 – Newton-Raphson Method rearranged like the Linear System above

Assembling this formulation back into the matrix notation, it becomes:

� � � � � � ∙ � − � � + � ∙ = � � � � Equation 17 – General Linearized System by using the Newton-Raphson Method which is exactly what SPICE solves as �� = �. The Newton-Raphson Method is really powerful, because it converges quadratically to the solution, but only if the initial guess is close enough to it. Moreover, it's known that it cannot guarantee global convergence when it's applied on a non-monotonic function, which is, unfortunately, sometimes the case in SPICE; in fact, some Compact Models are artificially modified to guarantee monotonicity and to be at least C2, as SPICE expects, leading to some non-physical solutions sometimes.

4.2 The state of the art of convergence algorithms for circuit simulation As already mentioned above, SPICE criteria to check if a Newton-Raphson iteration is terminated correctly comprehends three convergence tests: • Delta-V is satisfied, for direct unknown voltages

• � � is satisfied, for indirect unknown current branches • Delta-I is satisfied, for direct unknown current branches In order to verify Delta-V, the Equation 18 is applied, which tests the difference between the current value of � and the previous one against a fixed threshold:

� − � < ������ ∙ ��� �, � + ����� Equation 18 – Equation to check Delta-V

ADVANCING NANO-CMOS CIRCUITS SIMULATION 25

A similar check is performed to test � � , using the Equation 19, which tests if every indirect current, known by each device model in the given circuit, is limited:

� � − � � < ������ ∙ ��� � � , � � + ������

Equation 19 – Equation to check � �

Since direct currents are also unknowns, a similar test has to be performed for them; this task is accomplished by the Delta-I convergence test in Equation 20:

� − � < ������ ∙ ��� �, � + ������ Equation 20 – Equation to check Delta-I

As it’s possible to understand from the latest two formulas, direct currents and indirect currents are treated as the same and this is the issue, because direct unknowns can be limited by an arbitrary criterion, while indirect unknowns cannot and must be limited by some law, which binds them together with direct unknowns.

4.3 The False Convergence Phenomenon

During the simulation, there are some curves which satisfy Delta-V and Delta-I, but not � � , and other curves which satisfy the opposite criterias. The False Convergence Phenomenon appears when the iteration is so slow to pass all the convergence criteria, including � � convergence test, without assuring that the KCL is satisfied.

In fact, a good circuit simulator should verify the KCL, instead of using � � convergence test. This concept appeared for the first time in a paper from Kenneth Kundert [5].

4.4 KCL Verification

In order to avoid the False Convergence Phenomenon, � � convergence test has to be replaced by the KCL Verification for the entire circuit, after that Delta-V is satisfied for every voltage node and Delta-I for every current node. The KCL formulation in the Equation 21 is derived by the previous matrix notation considering

� = �, or, equally, extracting � � , �, � from every device model and writing the KCL. In addition, when convergence in satisfied, � = �.

� � + � − � < ������ ∗ ��� � � , � , � + ������ Equation 21 – General KCL formulation

ADVANCING NANO-CMOS CIRCUITS SIMULATION 26

where � is the voltage node index, � is the current branch index of the � voltage node, � are the indirect unknown current branches at every voltage node and � is the total sum of all the

� at every voltage node. � is the bunch of direct unknown current branches and � is the total sum of all the � at every current node. � and � are the same thing but for the current sources.

4.5 F(Vk) in SPICE The left hand side part of the Equation 21 can be calculated by changing every device model to

directly output every contribution of � , � , � independently and then assembling them inside the same device routine, which calculates the Jacobian entries of that device. This approach isn't straightforward, because it requires changing every model, but it's efficient, because � (the RHS) is already calculated inside every device model in general as � � ∗ � −

� � + �, which exactly includes � � and �, only for current sources. So it's enough to compute the same intermediate terms in another way to make � � directly visible alone and let the compiler optimize everything.

In devices which have � � ≠ 0, � = 0 and � = 0, � � represents the device total current between two voltage nodes, either internal (introduced by the model itself) or external ones, and can be composed by three different terms:

1. A Purely Linear and Static Device Model (Resistor) always has � = � � ∗ � − � � = 0,

so � � = � � ∗ � = � ∗ �

2. A Purely Linear and Dynamic Device Model (Capacitor) always has � = � � ∗ � −

� � = � ∗ � − �, so � � = �; � depends on the chosen integration method and includes the impressed current, which is the current at the previous time step 3. A Non-Linear Device Model ( is the simplest one) always has two parts: Static and Dynamic. The Static part serves to determine the Operating Point, while the Dynamic part

is needed during the Transient Analysis (or other analyses). In this kind of devices, � =

� � ∗ � − � � , where � � ∗ � is the linearized part of the Non-Linear device. Here

it's possible to extract � � by summing every current between two voltage nodes, considering the Non-Linear current, when present, instead of its linear approximation.

A "special case" appears when � � = � � ∗ �. In this case:

ADVANCING NANO-CMOS CIRCUITS SIMULATION 27

� = � � ∗ � − � � = � � ∗ � + � � ∗ � − � � ∗ � = � � ∗ � + � � ∗ � − � � ∗ � = � � ∗ �

so � doesn't contain � � . This means that, in this case, � � has to be calculated explicitly in the same manner it's calculated for a linear static device, but remembering that actually it's non-linear, so it changes at every Newton-Raphson iteration.

In devices which have � � = 0, � ≠ 0 and � = 0, like independent voltage sources and inductors, only the direct unknown current branches have to be considered for the KCL Verification.

The remaining devices have � � = 0, � = 0 and � ≠ 0 and they are independent current sources. Here the contribution for the KCL Verification is simply their own current.

4.6 Nodes Classification

Since the KCL Verification is really needed only on non-linear nodes, a proper node classification is mandatory to separate non-linear nodes from linear ones, in order to save run- time by avoiding useless operations. Definition 1: a node is non-linear if it connects to at least one non-linear device model Definition 2: a node is linear if it connects only device models So, by knowing this classification, it's possible to skip the KCL Calculation and Verification on linear nodes.

4.7 Contributions in the BSIM4 model As a general example, let’s examine now the case of the BSIM4 model. Regarding the nodes classification, the BSIM4 model has potentially 11 non-linear nodes: the first 4 ones are the classic ones (Drain, Source, Gate and Body in the order), while the other ones are related to specific BSIM4 customizations for specific reasons and this is why their declaration of non-linearity is encapsulated in ‘if’ statements. By using the above definitions, all the nodes of the circuit are initialized as linear, externally to a specific model, and then the non-linear ones get overwritten by the particular model, which knows their properties, as showed in the Code 1.

ADVANCING NANO-CMOS CIRCUITS SIMULATION 28

ckt->CKTnodeIsLinear [here->BSIM4gNodePrime] = 0 ; ckt->CKTnodeIsLinear [here->BSIM4bNodePrime] = 0 ;

if (here->BSIM4rgateMod == 2) { ckt->CKTnodeIsLinear [here->BSIM4gNodeExt] = 0 ; } else if (here->BSIM4rgateMod == 3) { ckt->CKTnodeIsLinear [here->BSIM4gNodeMid] = 0 ; }

if (here->BSIM4rbodyMod) { ckt->CKTnodeIsLinear [here->BSIM4dbNode] = 0 ; ckt->CKTnodeIsLinear [here->BSIM4sbNode] = 0 ; }

if (model->BSIM4rdsMod) { ckt->CKTnodeIsLinear [here->BSIM4dNode] = 0 ; ckt->CKTnodeIsLinear [here->BSIM4sNode] = 0 ; } else { ckt->CKTnodeIsLinear [here->BSIM4dNodePrime] = 0 ; ckt->CKTnodeIsLinear [here->BSIM4sNodePrime] = 0 ; }

if (here->BSIM4trnqsMod) { ckt->CKTnodeIsLinear [here->BSIM4qNode] = 0 ; } Code 1 – Set the Non-Linear nodes of the BSIM4 device model

Then, it’s necessary to create an instance of the ‘KCL current’ data structure for each current which contributes to the KCL, like in the Code 2: error = CKTmkCurKCL (ckt, here->BSIM4dNodePrime, &(here- >KCLcurrentdNodePrimeRHS_1)) ; Code 2 – Make an instance of the KCL current data structure

ADVANCING NANO-CMOS CIRCUITS SIMULATION 29

This data structure is a linked list and it’s necessary, because it serves to store the current contribution to a specific node, which is initialized to zero. Since the linked list is per node, the second parameter expresses the node index and it’s used to attach the linked node instance to the specific node. The Code 3 shows the routine used by each device model to instantiate the data structure: int CKTmkCurKCL (CKTcircuit *ckt, int i, double **node) { CKTmkCurKCLnode *tempNode ;

tempNode = TMALLOC (CKTmkCurKCLnode, 1) ; tempNode->KCLcurrent = 0.0 ; tempNode->next = ckt->CKTmkCurKCLarray [i] ; ckt->CKTmkCurKCLarray [i] = tempNode ; *node = &(tempNode->KCLcurrent) ;

return (OK) ; } Code 3 – The routine used to create an instance of the KCL current data structure

Now, the KCL contributions have to be calculated somewhere and this happens inside every device model load routine (the main one), and so for the BSIM4 too, where those terms are already computed for the circuit matrix and RHS: *(ckt->CKTfvk+here->BSIM4dNodePrime) -= m * (ceqjd_fvk - ceqbd_fvk - ceqdrn_fvk - ceqqd_fvk + Idtoteq_fvk) ; *(here->KCLcurrentdNodePrimeRHS_1) = -(m * ceqjd_fvk) ; *(here->KCLcurrentdNodePrimeRHS_2) = m * ceqbd_fvk ; *(here->KCLcurrentdNodePrimeRHS_3) = m * ceqdrn_fvk ; *(here->KCLcurrentdNodePrimeRHS_4) = m * ceqqd_fvk ; *(here->KCLcurrentdNodePrimeRHS_5) = -(m * Idtoteq_fvk) ; Code 4 – An example of KCL currents calculations; both the sum and the single contributions

In particular, the sum of the currents at a certain node and also the single currents are needed, because the KCL verification formula needs the maximum current at the node, which is a circuit

ADVANCING NANO-CMOS CIRCUITS SIMULATION 30

property. Alternatively, it was also possible to defer the calculation of the sum to the moment where it was actually needed, but it would have involved the usage of a loop over the linked list of a certain node, while here the sum is written directly and probably the compiler will optimize everything, since the same variables are called multiple times. However, one optimization has been made here: the Pure Static Linear contributions calculation has been moved outside the BSIM4 load routine, so that they can be calculated only when they are needed. In particular, the correct place is the ‘NIConv’ routine, where the convergence checks are called, including the KCL verification: error = CKTloadKCL (ckt) ; Code 5 – Explicit load of the KCL contributions

Said that, the KCL Verification is finally performed in ‘NIConv’ as reported in : for (i = 1 ; i <= size ; i++) { node = node->next ;

if ((node->type == SP_VOLTAGE) && (!ckt->CKTnodeIsLinear [i])) { new = ckt->CKTrhs [i] ; old = ckt->CKTrhsOld [i] ; tol = ckt->CKTreltol * (MAX (fabs (old), fabs (new))) + ckt- >CKTvoltTol ; if (fabs (new - old) > tol) { ckt->CKTtroubleNode = i ; ckt->CKTtroubleElt = NULL ;

return 1 ; } } else if ((node->type == SP_CURRENT) && (!ckt->CKTnodeIsLinear [i])) { new = ckt->CKTrhs [i] ; old = ckt->CKTrhsOld [i] ; tol = ckt->CKTreltol * (MAX (fabs (old), fabs (new))) + ckt- >CKTabstol ; if (fabs (new - old) > tol) { ckt->CKTtroubleNode = i ; ckt->CKTtroubleElt = NULL ;

return 1 ; } } }

ADVANCING NANO-CMOS CIRCUITS SIMULATION 31

/* KCL Verification */ error = CKTloadKCL (ckt) ; node = ckt->CKTnodes ; for (i = 1 ; i <= size ; i++) { node = node->next ;

if ((node->type == SP_VOLTAGE) && (!ckt->CKTnodeIsLinear [i])) { maximum = 0 ; ptr = ckt->CKTmkCurKCLarray [i] ;

while (ptr != NULL) { if (maximum < fabs (ptr->KCLcurrent)) maximum = fabs (ptr->KCLcurrent) ;

ptr = ptr->next ; }

if (ckt->CKTuseDeviceGmin) { if (maximum < fabs (ckt->CKTgmin * ckt->CKTrhsOld [i])) maximum = fabs (ckt->CKTgmin * ckt->CKTrhsOld [i]) ; } else { if (maximum < fabs (ckt->CKTdiagGmin * ckt->CKTrhsOld [i])) maximum = fabs (ckt->CKTdiagGmin * ckt->CKTrhsOld [i]) ; }

/* Check Convergence */ if (ckt->CKTuseDeviceGmin) { if (fabs (ckt->CKTfvk [i] + ckt->CKTgmin * ckt->CKTrhsOld [i]) > (ckt->CKTreltol * maximum + ckt->CKTabstol)) { ckt->CKTtroubleNode = i ; ckt->CKTtroubleElt = NULL ;

return 1 ; } } else { if (fabs (ckt->CKTfvk [i] + ckt->CKTdiagGmin * ckt- >CKTrhsOld [i]) > (ckt->CKTreltol * maximum + ckt->CKTabstol)) { ckt->CKTtroubleNode = i ; ckt->CKTtroubleElt = NULL ;

return 1 ; } } } } Code 6 – The KCL Verification during the convergence test execution

ADVANCING NANO-CMOS CIRCUITS SIMULATION 32

4.8 A new homotopy method for faster convergence During the implementation of KCL verification, I have incurred in a strange issue related to the GMIN stepping algorithm: all the netlists which need GMIN stepping algorithm to reach convergence never verified the KCL. This behavior was clearly due to the GMIN stepping algorithm itself, because KCL was verified successfully without using it. After a thorough investigation, I discovered that SPICE3 changed the algorithm respect to SPICE2, but nobody knew that, because it hasn’t been documented anywhere, so I asked to my second supervisor, Alberto Sangiovanni-Vincentelli, who confirmed this. There are basically two differences between the SPICE3 and SPICE2 versions of the algorithm: 1. GMIN conductance is applied on each diagonal element of the circuit matrix after its reordering which removes zeroes from it, but these diagonal elements aren't the circuit diagonal elements anymore, so they have no topological meaning 2. GMIN is applied on every diagonal element of the circuit matrix, disregarding if it's a voltage or a current node; in the latter case, what is actually summed is something with a resistance dimension and a GMIN value, not a conductance So, basically, the SPICE3 version of GMIN stepping algorithm is just one of the possible homotopy methods which have been invented so far and it isn’t GMIN at all. On the contrary, the SPICE2 version has definitely a circuital meaning and it’s topological; it adds a conductance between each node of the circuit and ground, in order to have a starting circuit which is more resistive, so more linear. This means adding the contribution to the circuit matrix prior than it gets reordered and skipping the current nodes, which come only from the MNA approach and don’t exist in the real circuit. So, after recoding back the lost SPICE2 version, I was able to verify the KCL even when the GMIN stepping algorithm was called. At this point, I wondered if there could be a way to furthermore enhance this GMIN stepping algorithm and remain topological at the same time, so that KCL could be still verified. I came up with the following idea: is it possible to use the already existing Device GMIN, which is inserted in parallel to an exponential law (i.e. a diode junction) and it’s different from the previous mentioned GMIN, if the purpose is smoothing the non-linear behavior of the circuit, in order to have a good starting point for the Newton-Raphson method? Yes, we can!

ADVANCING NANO-CMOS CIRCUITS SIMULATION 33

The two pictures below illustrate the idea: the Figure 15 shows the Device GMIN, which is already in parallel to a diode inside its model and the Figure 16 shows the diode curve, considering the stepping of the Device GMIN, which goes from 1e-3 to 1e-15. Let’s call this new algorithm ‘Device GMIN Stepping Algorithm’, in order to compare it with the old one, which can be called ‘DiagGMIN Stepping Algorithm’.

Figure 15 – Device GMIN to be used as new homotopy method

It’s clear from the Figure 16 that the diode curve is linear when Device GMIN is high and it’s non-linear when Device GMIN is negligible, so the idea is valid in principle and it represents an optimal starting point for the N-R iteration, because the non-linear component is actually linear, when a good value of Device GMIN is chosen.

Figure 16 – Example of Device GMIN Stepping on a Diode Device Model

ADVANCING NANO-CMOS CIRCUITS SIMULATION 34

4.9 Implementation Results The presented algorithm has been implemented in NGSPICE and verified on the ISCAS85 Benchmark Suite [6]. The ISCAS85 suite has been implemented with a standard cell library, which uses BSIM4 devices along with resistors and capacitors. Both standard and back-annotated netlists have been tested. In my previous work on this benchmark suite [2], I found that not all netlists converged to a valid operating point and I had to arise the GMIN value. By just inserting KCL Verification and decreasing the GMIN target value to 1e-18, I still faced the False Convergence of c7552_ann netlist, but I was able to make c6288_ann converge with such a high precision (red line of Figure 17). By also inserting the new Device GMIN Stepping Algorithm, I was able to make all the netlists converge with such a high precision (green line of Figure 17). Regarding the speed, KCL Verification has introduced a lot of calculation due to the "special" nodes mentioned above, but the slowdown has been contained in about 15% of time loss overall (Figure 17). This good behavior is due to the efficient implementation previously discussed.

Figure 17 – NO KCL vs KCL w/ or w/o Device GMIN Stepping Algorithm

By introducing the novel Device GMIN Stepping Algorithm, I managed to speed up all the simulations by an average of 1.31X and let all the netlists converge with the same accuracy, too.

ADVANCING NANO-CMOS CIRCUITS SIMULATION 35

This happens because this new homotopy method doesn't let the simulation switch to Source Stepping Algorithm and reduces the number of fill-ins, as showed in Table 2. Algorithm Time (s) Fill-ins DC Iterations Old GMIN 1264.530 124242 533 Device GMIN and 860.160 119450 443 KCL

Table 2 – GMIN Results Comparison on a representative testcase of the ISCAS85 Benchmark Suite

4.10 Linear and Non-Linear Separation

This topic is quite interesting and it’s actually devoted to serve several features: 1. It’s a good companion of the KCL Verification, because it isn’t needed to verify the KCL if a node is linear, since that node is convergent by default 2. It also serves to speed up the Model Evaluation phase, because it isn’t necessary to calculate again every contribution at each Newton-Raphson iteration and/or at each time step, since some items are constant in somewhere 3. It can be used to separate the circuit matrix in two separate parts, using the Schur complement So, now let’s see all of them one by one. The first point is part of the KCL Verification and it has been discussed extensively in the previous chapter, showing also an example for the BSIM4 device model. The second purpose has been already coded and it’s available in a dedicated branch of NGSPICE. It separates linear and static contributions, like resistors, from linear and dynamic contributions, like capacitors, from non-linear contributions, like transistors. In the latter case, a further separation could be performed between the static and the dynamic part, but it doesn’t matter for this purpose, since the inner Newton-Raphson loop needs both non-linear static and dynamic contributions. All those parts are then combined together when the matrix is passed to the linear solver for the LU calculation. Depending on how many components of these kinds are inside the circuit under simulation, the speedup can be quite high. So, back-annotated circuits, where parasitics contribution is taken into account through an RC network, are expected to gain the most advantage from this.

ADVANCING NANO-CMOS CIRCUITS SIMULATION 36

On the same direction is the third point, which regards the circuit matrix separation in two pieces, because it aims to speed up the linear system solution by reducing its size. The idea is leveraging on the linear static, linear dynamic and non-linear nodes classification to separate the circuit matrix in two (or three) pieces using the Schur complement and calculate the LU of the linear part upfront, since it’s never changed during the Transient Analysis or only when there is a new time step, depending if it’s static or dynamic. Also in this case, the speedup can be quite high, depending on the circuit, so also this approach is really suitable for back-annotated circuits. The problem of this methodology is finding a linear part which is full rank, otherwise the inverse matrix doesn’t exist and the Schur complement cannot be calculated. Fortunately, theoretically speaking, it’s always possible to find a permutation which adds nodes (so rows and columns) from the linear part to the non-linear until a full rank for the linear part is found. Unfortunately, this method has a practical and clear drawback: more nodes we move from the linear part to the non-linear part, more complex the non-linear part becomes, so we may possibly lose the practical advantage of this methodology. After this brief description, let’s revisit it in a more rigorous way.

It’s possible to formulate the following definition: a voltage node � is non-linear, if it connects to at least a non-linear device, while it’s linear, if it connects only to linear devices.

Thereafter, let’s define � = �| � �� ������ and � = �| � �� ��������� and reorder the

� index of voltage nodes such that � = , where � denotes linear voltage nodes and � denotes � non-linear voltage nodes. Considering this reordering of voltage nodes, it’s possible to partition the matrix � inside the linear system �� = � into: � � � � = � � � � Equation 22 – First partition of the matrix A into linear and non-linear parts where � is the relationship within non-linear nodes and contains part of the Jacobian matrix and

�, � and � are constant matrices, independent from the Newton-Raphson cycle. In particular, � is the relationship within linear nodes, so it’s constant by definition. The other two matrices connect a linear node and a non-linear node, then it has to be constant, since there must be a linear device in between. A way to see this is by supposing that a non-linear device connects

ADVANCING NANO-CMOS CIRCUITS SIMULATION 37

these two nodes, then these two nodes must be non-linear ones, standing at the definitions above, and this is a contradiction, so the claim is proved. Now, as previously mentioned, there are two different possible cases to be analyzed.

Case 1: ��� is full rank The linear system �� = � can be decomposed in the following way: Step 1: LU decomposition of A using the Schur complement

� � � � � = = �� � � �� � � − �� � Equation 23 – LU decomposition of A using the Schur complement

Step 2: solve �� = �

� � � � � = −→ = �� � � � � � − �� � Equation 24 – Lower triangular solve after Schur complement decomposition

Step 3: solve �� = � by LU decomposition of � − �� � � � � � � � � − �� � = � −→ � = � − �� � � − �� � � Equation 25 – Upper triangular solve after Schur complement decomposition

In order to have � full rank, let’s think about the fact that for sure resistors, capacitors and all the device models which add four contributions for each node, with the exclusion of independent sources, controlled sources and inductors, make the circuit matrix weakly diagonally dominant, but the condition to have a full rank matrix by construction is that it’s strongly diagonally dominant. We can obtain this by considering the GMIN stepping algorithm, but this doesn’t happen every time, even though we could think to add a small GMIN anyway, just to make it true, without changing the behavior of the circuit. So, following this concept, independent sources, controlled sources and inductors have to stay in the non-linear part of the circuit matrix, but there is another thing to say: we can split the device model in pieces. For example, the has an equivalent resistance part (so it’s like a capacitor in this) and a separate current node, which is the real problem, and only this one has to be put in the non-linear part.

ADVANCING NANO-CMOS CIRCUITS SIMULATION 38

The last but not the least, this condition is only sufficient, which means that, even though it isn’t verified, we cannot say that � is not full rank; it could still be. So, the overall solution will not be optimal, in general.

Case 2: ��� isn’t full rank It’s still possible to find two permutation matrices P, Q and rewrite the matrix as in the Equation 26:

� � ��� = � �

Equation 26 – Permuted matrix to find a full rank A11 where � is full rank, by moving some linear nodes to the non-linear part of the system. So, it’s possible to rewrite the system in the following way: Step 1: insert the permutation matrices P and Q into the original linear system � � � � � � � � = � � � � � � � � Equation 27 – Permutation matrices P and Q insertion to find a full rank sub-matrix

Then, the new linear system becomes: ��� �� �� �� = �� � � � Equation 28 – Reordered system after Equation 27

� � Step 2: apply ��� = and separate ��, ��, �� and �� � �

� � �� �� �� � � �� �� = �� �� �� � � � Equation 29 – Change names to a sub-matrix to separate interesting parts

Now, let’s simplify the notation of the Equation 29 in the form expressed in the Equation 30, so that it appears like the case 1: � � � � = � � � � Equation 30 – Same system as Equation 29 but simplified recalling that �, � and � are constant matrices, similarly to �, � and �, which are constant matrices.

ADVANCING NANO-CMOS CIRCUITS SIMULATION 39

� � Step 3: perform the LU decomposition on , since now it’s like the case 1 � �

� � � � � = = �� � � �� � � − �� � Equation 31 – LU decomposition of the new system

So, just to simplify again the notation, let it be: � = �� � � = � − �� � = � − � Equation 32 – Simplification of the Equation 31

This means doing exactly steps 2 and 3 of case 1, so it means performing the LU decomposition of matrix M during the Newton-Raphson cycle. Said that, let’s concentrate on a possible flow to calculate everything we need. The procedures below are based upon the general case 2 and upon the Topology Matrix methodology, described in the following CUSPICE chapter. Setup Phase: - Determine linear and non-linear nodes

- Construct the topology matrix � for �

- Construct the topology matrix � for the RHS vector (the same as described in CUSPICE)

� � - Compute ��� = � � - Compute ��, ��

- Compute the LU decomposition of � - Compute � = �� � (this may be time-consuming because of � �)

� �� - Compute the sparsity pattern of �, where � = �� �

- Compute the sparsity pattern of M by �������� � = �������� � ∩ �������� �

Newton-Raphson Phase:

Step 1: use the topology matrix to update the matrix �, since ��� � = ��, where � is the global vector where all the device models write their contributions

ADVANCING NANO-CMOS CIRCUITS SIMULATION 40

Step 2: scatter � into:

� �� � = �� � Equation 33 – Non-linear part of the matrix after permutation which can be accomplished by table lookup

Step 3: assemble the matrix M by calculating � = � − � Now, let’s notice that it’s possible to simplify the steps 2 and 3 in another way: let’s rewrite

� = � − � by calculating:

0 0 � �� 0 0 � = + − � = + � 0 � �� 0 0 � Equation 34 – Simplified M matrix

Since � is a constant matrix, if � gets enlarged such that �������� � = �������� � , then only two steps are needed to update the matrix M: 1. ��� � = ��� �

2. Scatter � into M Step 4: perform the LU decomposition of the matrix M by KLU � � � � Step 5: solve = by calculating: � � � �

� � � � � 1. = = �� � � �� � � − �� � Equation 35 – LU decomposition of the new system

�� � 2. = �� � � Equation 36 – RHS for the new system

Step 5.1: solve �� = �

� � � � � � = − −→ � = �� � � � − �� � Equation 37 – Lower triangular solve of the new system

ADVANCING NANO-CMOS CIRCUITS SIMULATION 41

Step 5.2: solve �� = � � � � � � � � − �� � = � − −→ � = � � � Equation 38 – Upper triangular solve of the new system

Step 6: solve x by solving:

�� � = �� � � Equation 39 – Extract ‘x’ from ‘z’

� � � for example by computing = � . � � � However, by performing these separations, there are potential issues: 1. M may become denser because of different reordering. Remember that the initial column reordering is performed with the intent of minimizing fill-ins insertion, but only local

reordering on � and M is performed in this case and this could be not a good choice. However, this could be turned into an advantage, because dense LU may be faster on GPU; it depends only on how big it is. For example, if the dimension of M is � = 1024 and a

K20X GPU card is used, then a good estimation of dense LU could be = 1��, considering that it performs 1Tflops on DGEMM. 2. A triangular solver on sparse vector, or sparse matrix, is needed for example for � �.

ADVANCING NANO-CMOS CIRCUITS SIMULATION 42

Chapter 5 - CUSPICE – The Revolutionary NGSPICE on CUDA Platform

5.1 Introduction Some years ago it was pretty clear that it isn’t possible anymore to achieve significant speedup by simply increasing the pure CPU clock frequency of a computing platform or architecture. So, different techniques have to be used to maximize it and parallelization is one of them. Basically, in principle, it’s possible to exploit several independent processing units and execute a single job cooperatively. The big issue of this concept is that the original single core code is not suitable to work in a parallel environment efficiently, because it isn’t able to exploit the available hardware parallelism and there is no efficient automatic compiler, which can do this job on behalf og the programmer. This means that all the existing software has to be re-designed to work in parallel. Parallelizing an algorithm is virtually simple, because it’s sufficient to use the ancient Roman paradigm dividi et impera, by separating the code in independent pieces and assign them to different cores, so that they can work in parallel. Unfortunately, the real world is much more complicated and heterogeneous. For instance, there are software categories, like operating systems, which are suitable for parallelization by their nature, because each process is independent from each other. There are other categories of problems which are difficult to be parallelized, or even impossible to. In general, parallel systems can be divided into 2 classes: • Multi Core • Many Core Multi Core is a homogeneous architecture, where the CPU has more than one core and generally they can be up to 8 or 16 on the same dye, otherwise there exist the distributed systems, which are not covered in this work. Many Core is a heterogeneous architecture, where there is a CPU, which can be in turn even Multi Core, and a dedicated hardware accelerator for pure calculations, including one or more

ADVANCING NANO-CMOS CIRCUITS SIMULATION 43

GPUs. However, having more than one GPU means that an additional data exchange between those is needed and this leads to additional problems and work to be done, but this case isn’t discussed in this work. This work will use a simple Many Core system, which includes a Multi Core CPU (but only a single core will be used) and a single GPU. GPUs for GPGPU (General Purpose GPU) computing came out from NVIDIA in 2006, along with the first CUDA platform. At that time, they were capable to work only with single precision numbers, but the Fermi generation brought up the double precision in 2010-2011, enabling the scientific software to be ported on the GPU, because few scientific calculations can be performed in single precision.

5.2 The Fermi Architecture The NVIDIA GPU is a SIMT machine (Single Instruction – Multiple Thread), which means it’s able to execute one single instruction on several threads simultaneously and it’s different from a SIMD machine, because thread and data are two separate entities in the former one, since it’s possible to choose how the GPU is going to work. For instance, a single thread can work on a single data or all threads can work on the same data or even different configurations, depending on the application. The schematized layout of the Fermi class GPU is illustrated in the Figure 18, where it’s possible to recognize 16 macroblocks (the small orange-light blue-green columns in the picture), each one is a Streaming Multiprocessor (SM), connected to an L2 cache; in addition, there are 6 banks of RAM, an interface to the system (Host Interface) and a global scheduler (GigaThread). Since NVIDIA donated two TESLA C2070 cards to my university, one of these has been used to code the original version of CUSPICE. So, let’s see its specifications: • It has only 14 Streaming Multiprocessor (2 are disabled through BIOS settings or physically) • It has 6GB of GDDR5 DRAM, even though only about 5GB are available, since 12% of the maximum capacity is reserved for the ECC capability; the amount of available RAM is divided in 6 chips of 1GB and a 64bit data bus each one, so the total bus amplitude from the system to the GPU is 64��� ∙ 6�ℎ��� = 384���, by using a PCI-Express 2.0 x16 bus

ADVANCING NANO-CMOS CIRCUITS SIMULATION 44

Figure 18 – NVIDIA Fermi Architecture Layout – The green areas are the Streaming Multiprocessors (SMs)

The GPU RAM, which is called Global Memory, is the only one in the GPU which can be addressed by the system (the CPU), so all the data are forced to transit through it. Every SM has the structure illustrated in Figure 19, where it’s possible to notice: • 2 Warp Schedulers (a warp is a bunch of 32 serially indexed threads), which testify that the Fermi architecture is Dual Warp, meaning that it can execute two warps simultaneously • A single 32bit register bank • 16x2 CUDA Cores • 16 units for reading/writing • 4 special units to calculate special functions, like sine and cosine • The internal interconnecting bus • The Shared Memory (64KB total, divided generally in 48KB of Shared Memory and 16KB of L1 cache) The global scheduler distributes the workload on the SMs; in particular, it’s able to schedule the thread blocks on several free SMs, since each one is independent from each other. A huge number of total blocks let the GPU have a high level of parallelism, which gets easily dispatched by the global scheduler; NVIDIA calls this MultiThreading. Standing the independence of each SM, the communication of a calculated data by a certain SM to another SM must be transit through the GPU RAM.

ADVANCING NANO-CMOS CIRCUITS SIMULATION 45

Figure 19 – The Streaming Multiprocessor (SM)

Each CUDA Core is a complex unit capable to operate or on integer either on Single Precision Floating Point numbers, both represented using 32bit; there are also some units able to perform calculations on Double Precision Floating Point numbers, represented using 64bit, reducing the performances depending on how many units are available for this purpose on the specific card. For those kind of calculations, the GPU follows the IEEE754-2008 standard. In the Fermi GPU card, it’s possible to exploit the FMA (Fused Multiply-Add) instruction when multiplications with accumulation are performed together in one shot. Since each elaboration lasts for two cycles, the instruction’s fetch happens every two cycles to maintain the current instruction in memory for a sufficient amount of time so that all the CUDA Cores which execute the same warp can finish their elaboration. So, the peak throughput is:

ADVANCING NANO-CMOS CIRCUITS SIMULATION 46

������������ ������ ������ �� ����� ���� ����� ∙ ����� ∙ � = 32 ∙ 0,5 ∙ 770��� = 12320������������� = � Each CUDA Core executes a thread, but the elementary unit which goes really in execution is the Warp (a bunch of 32 adjacent thread, starting from the thread number 0), as previously mentioned about the working model of the SIMT architecture.

Figure 20 – The CUDA Core

5.3 The Fermi Architecture Memory Access A key topic to be analyzed is the memory access. It’s possible to distinguish two kinds of fundamental memory access: • Global Memory access • Shared Memory access The Global Memory is shared among all the SMs and this let have the memory access parallel, from a physical stand point, but it can happen anyway serially, if the code is not well written, and this obviously degrades the performances. The Global Memory access happens through the L1 cache, if enabled, otherwise through the L2 cache. In the first case, since a cache line occupies 128bytes and since each data is composed by 4bytes, it’s possible to perform 32 accesses simultaneously.

However, this optimal working model happens only if the � thread accesses the � data, meaning that the access is strided-1; in this way, all the threads in a warp (32 threads) access to different and adjacent data (32 accesses), which are stored on a single cache line.

ADVANCING NANO-CMOS CIRCUITS SIMULATION 47

If the access is misaligned, then it gets penalized it by a factor of 2, because each warp is going to read always two cache lines. If the access is random, meaning that each thread accesses a data in random order, then the entire warp is going to execute several accesses, in a number equal to how many cache lines have to read to fetch all the needed data, which are at the most 32 serialized readings, resulting in a huge performances degradation. On the contrary, the Shared Memory exists for each SM, so the total accesses get parallelized among the 14 existing SMs. However, within each SM, there are rules to be respected to guarantee a parallel access from all the interested threads, in a similar fashion it happens for the Global Memory. In particular, the Shared Memory is divided in banks; each one has a certain number of elements, composed by 32bits each one, so that consecutive data gets read/written on consecutive banks to parallelize the access. So, also in this case, in order to maximize the reading throughput, it’s mandatory to read data consecutively.

5.4 The CUDA Platform

5.4.1 The basic CUDA application An application, in order to work properly on the GPU, has to respect some basic criteria, which reflect what it has been previously said about parallelism and constitutes what is generally referred as “parallelism model”; already existing models, like OpenMP, MPI or PThread are not supported by the NVIDIA GPU, because they have been created for a shared memory hardware (OpenMP and PThread) or distributed memory hardware (MPI), which are visible in both cases at any time by all the threads, which are the base elements of the elaboration. In the GPU case, instead, the architecture is not so simple, so the CUDA parallelism model has been created ad hoc. So, a CUDA application supports three different languages: • CUDA C, which is a variant of C or, actually, C++ • OpenCL, which is sponsored by a consortium and supports also the AMD GPUs • DirectCompute For my purpose, I will use the CUDA C language. A CUDA application is called “CUDA kernel” (or just “kernel”) and the simplest one is constituted by six steps: 1. Instantiate the needed pointers to the GPU Global Memory

ADVANCING NANO-CMOS CIRCUITS SIMULATION 48

2. Allocate the needed space in the Global Memory, whose address will be stored in the previously instantiated pointers, which are part of the system (meaning the CPU side), even though the memory is physically allocated on the GPU 3. Copy the needed data from the system to the Global Memory 4. Launch the kernel 5. Synchronize the GPU execution with the CPU, because a kernel launch is always asynchronous, so the CPU must wait for the ending of the GPU calculation 6. Copy back the result data from the Global Memory to the system These steps are schematized in the Figure 21, where is shown the relationship between the Host (the CPU) and the Device (the GPU) through the Global Memory and the interface of each block with it. These operations are executed by the Host, but how the calculation, not illustrated in this schema, is managed by the kernel code. The execution time of an empty kernel and a synchronization barrier between GPU and CPU is equal to roughly 46�� on my Fermi GPU card.

Figure 21 – The CUDA hierarchy, composed by threads, blocks and a grid; in addition, the various memories are shown

5.4.2 The kernel in detail CUDA (Compute Unified Device Architecture) is the codename which identifies the NVIDIA platform for the general purpose calculation on the GPU (GPGPU), of which the TESLA C2070 represents the top most product of the Fermi generation. The programmer has to specify the configuration of the kernel at its launch; it’s possible to specify the number of threads per block and the number of blocks per grid, so the distribution is uniform and it isn’t possible to have blocks with a certain number of threads and other blocks with another number: all the blocks have to be equal. Since the host is the only one which can control

ADVANCING NANO-CMOS CIRCUITS SIMULATION 49

the kernel parameters, the only way to change the kernel configuration is exiting from the GPU execution, go back to the system and relaunch the same kernel, after changing the configuration. The blocks are ordered along three dimensions, and so the threads; then, at the most there are nine dimensions (3x3). The base dimension is along the ‘x’ axis, so, if some dimension isn’t specified (the valid order is x->y->z), it’s automatically equal to one. In order to select the blocks and the threads within the kernel, some auto-instantiated variables are used; these ones are created by the CUDA driver at the launch time and are: • threadIdx.{x,y,z} • blockIdx.{x,y,z} • blockDim.{x,y,z} • gridDim.{x,y,z} In Figure 22, it’s shown a conceptual example of how the kernel division in blocks and threads happens and how the memory (Shared one, in this case) access happens, as well. In order to benefit from the Shared Memory, it has to be explicitly declared by the programmer inside the kernel, through the keyword “__shared__”, and the kernel data structure has to be decomposed in such a way to enter inside the Shared Memory. The last, but not the least, the data has to be accessed frequently in read mode, otherwise the overhead necessary to move the data from the Global Memory to the Shared Memory kills the overall performances. If this happens, it’s better not using the Shared Memory at all. So, basically, it’s always beneficial to let the memory access happen in the best conditions as possible, in order to not introduce useless additional calculation time.

Figure 22 – The kernel division in blocks and threads at code level, including the memory (Shared) accesses

ADVANCING NANO-CMOS CIRCUITS SIMULATION 50

5.5 The New Kepler Architecture At the end of March 2012, NVIDIA revealed the new Kepler architecture, capable, on the paper, of an incredible performances jump respect to the Fermi architecture. Even though the shown architecture is referred to the high-end GPUs for videogames (core GK104) and not for the scientific calculation (built, by the way, using 3 billion transistor), some key news will be implemented also on the new professional-end graphic cards. After this announce, the core GK110 for the professional GPUs has been presented during the GTC Conference in San Jose (CA) in May 2012.

5.5.1 The new Streaming Multiprocessor and the memory The new Streaming Multiprocessor, called SMX, has an amount of 192 CUDA Cores (4X the 48 CUDA Cores of the Fermi 2.1 architecture) and it’s a Quad Warp architecture, meaning that it can load four warps simultaneously, instead of the Dual Warp architecture of Fermi generation. This big change has a precise reason: the Shader Clock has been purged and now everything works at the Core Clock speed. Actually, in the Fermi architecture, the warp got divided in two slices, called half-warp, and each one was executed in a very fast chain one by one at the speed of the Shader Clock. Instead, now, all the warp gets executed from 32 effective CUDA Cores and not from 16 CUDA Cores in two cycles (this represents a first 2X). This also means, however, having 2X the available CUDA Cores, considering the two cycles. However, the pure performances of a single SMX should be more than 2X, because of the higher frequency of the Core Clock, respect to the Shader Clock (even though this depends on the particular GPU). The last, but the least, having one clock less to distribute simplifies the design of the architecture. In addition to that, all the hardware inside the SMX has been doubled respect to the SM, with particular attention to the register bank and the Load/Store units, doubling the performances (this represents another 2X). So, at the end, a total 4X speedup improvement is achieved by increasing the number of available CUDA Cores by 4X in a single SMX. The GK104 chip is a bunch of 4 GPC and each one contains 2 SMX, so 8 independent SMXs in total.

ADVANCING NANO-CMOS CIRCUITS SIMULATION 51

From the memory point of view, another big step has been done, because its clock has been pushed to 6GHz, thanks also to a controller revision, considering that the GDDR5 memory can reach 7GHz by itself. The architecture has 4 controllers in total and this means that each GPC can work independently on the memory, without generating any conflict. Finally, the bus towards the system becomes the PCI Express 3.0 x16, doubling the performances respect to the 2.0 generation.

5.5.2 The scheduling and the energy efficiency Another big step has been done in terms of scheduling, even though this represents a step back, actually. In particular, the complex hardware scheduler of the Fermi generation, put in place to avoid errors due to the data dependency when the execution was performed out of order, has been purged and the old static scheduler has been restored. So, basically, the scheduling is performed upfront by the compiler. This scheduler is less complex, but it’s sufficient enough and more efficient respect to the older one, since the latency of each mathematical instruction is fixed and known a priori, so the compiler can reorder the instructions properly for having the maximum achievable performances, maintaining at the same time an execution in order. Then, by reducing the scheduler complexity, the total power has been cut down. However, the biggest contributor for reducing the dissipated power has been the elimination of the Shader Clock, previously mentioned. In fact, by eliminating all the clock tree, a lot of Std Cells have by deleted, reducing also the amount of area, eventually replaced with other features.

5.6 The GPU parallelization of the Device Model Evaluation In this chapter, I’m going to focus the attention on the Device Model Evaluation phase of a circuit simulation, especially on the Transient Analysis flow, because it represents a very good candidate to be efficiently parallelized and also because it represents about 40%-70% of entire Transient Analysis time, depending on the circuit size and complexity. In fact, parallelizing the Device Model Evaluation per se is conceptually quite trivial, since every device instance of a certain model is independent, so the same code is actually called multiple times on different input data. The main problem of this phase is that every device writes its contribution to the circuit sparse matrix and RHS vector and there is only one of those for the entire circuit, so there is an evident bottleneck at this point.

ADVANCING NANO-CMOS CIRCUITS SIMULATION 52

Another important aspect of the parallelization is the target machine and its architecture. In this case, GPUs are massively parallel systems, which are capable to work efficiently only if they are managing a contiguous memory space (arrays) and if the software avoids to use control sections (‘if’ statements). So they are suited for purely computation. By knowing how the hardware works is key to achieve the best possible performances. Therefore, by following these concepts, a lot of changes have been needed to achieve the goal. Let’s see them one by one.

5.6.1 From Linked Lists to Structures of Arrays In NGSPICE, the code portion which controls the Model Evaluation has not been changed from SPICE3F5, so the code uses ancient paradigms, like a sort of object oriented programming paradigm in pure C style or implicit structures bindings based upon the member’s length or moreover static memory allocation, explicit casts and extern declarations. Device models use the so called SPICE3 Kit interface, which is composed by several routines: • DEVload – it contains the model equations and it’s called at every Netwon-Raphson iteration • DEVsetup – it setups the model space, including memory allocation and initialization with a user or a default value • DEVtemp – it calculates everything which depends on the temperature • DEVinit – it contains the function pointers to each DEV routine and it’s the only thing the simulator knows • DEVparam – it grabs a parameter for an instance from outside and store it inside the instance data structure • DEVmodelParam – it’s the same as above but for the model • DEVask – it outputs the value of an instance parameter • DEVmodelAsk – it’s the same as above but for the model • DEVgetIc – it setups the initial conditions which have been specified from the netlist

• DEVconvTest – it contains the convergence tests for � � as described in the KCL chapter In order to parallelize the Model Evaluation, not only DEVload has to be changed, but also DEVsetup and DEVtemp. This change is known as AOS to SOA paradigm, which means transitioning from Arrays of Structures to Structures of Arrays and, in particular, in this case is

ADVANCING NANO-CMOS CIRCUITS SIMULATION 53

even worse, because the source data structure is a linked list. Therefore, all the instance private parameters have been turned into a vector-based information, stored in the CPU and then copied to the GPU. This is required, because DEVload will read and write through those. Each vector is long as many instance of the model are declared in the circuit, so that each location belongs to a specific instance. Regarding DEVsetup, only memory allocation on both CPU and GPU is performed, along with some supporting information. Below there is an example of what it’s needed at this stage:

/* cudaMalloc MACRO to check it for errors --> CUDAMALLOCCHECK(name of pointer, dimension, type, status) */ #define CUDAMALLOCCHECK(a, b, c, d) \ if (d != cudaSuccess) \ { \ fprintf (stderr, "cuCKTsetup routine...\n") ; \ fprintf (stderr, "Error: cudaMalloc failed on %s size of %d bytes\n", #a, (int)(b * sizeof(c))) ; \ fprintf (stderr, "Error: %s = %d, %s\n", #d, d, cudaGetErrorString (d)) ; \ return (E_NOMEM) ; \ } Code 7 – A macro to check if ‘cudaMalloc’ has been executed successfully

The Code 7 shows a macro used to check if the memory allocation on the GPU has been executed successfully. Its usage is shown in other pieces of code below.

ADVANCING NANO-CMOS CIRCUITS SIMULATION 54

/* cudaMemcpy MACRO to check it for errors --> CUDAMEMCPYCHECK(name of pointer, dimension, type, status) */ #define CUDAMEMCPYCHECK(a, b, c, d) \ if (d != cudaSuccess) \ { \ fprintf (stderr, "cuCKTsetup routine...\n") ; \ fprintf (stderr, "Error: cudaMemcpy failed on %s size of %d bytes\n", #a, (int)(b * sizeof(c))) ; \ fprintf (stderr, "Error: %s = %d, %s\n", #d, d, cudaGetErrorString (d)) ; \ return (E_NOMEM) ; \ } Code 8 – A macro to check if ‘cudaMemCpy’ has been executed successfully

Similarly to the above macro, the Code 8 shows another macro used to check if the copy between the CPU memory and the GPU global memory has been executed successfully.

model->BSIM4paramCPU.BSIM4gbsRWArray = (double *) malloc (size * sizeof(double)) ; status = cudaMalloc ((void **)&(model- >BSIM4paramGPU.d_BSIM4gbsRWArray), size * sizeof(double)) ; CUDAMALLOCCHECK (model->BSIM4paramGPU.d_BSIM4gbsRWArray, size, double, status) Code 9 – An example of CPU and GPU array allocation a certain BSIM4 instance parameter

In the Code 9, it’s shown an example of CPU array allocation, needed to host the conversion of the linked list approach to the array one on the CPU side, and its copy on the GPU side, so that there is one couple of allocations for each instance parameter. A check for the GPU allocation is also performed using one of the macros shown above.

ADVANCING NANO-CMOS CIRCUITS SIMULATION 55

Regarding DEVtemp, the copy of all the parameters is performed at this stage, as illustrated in Code 10, because it has no sense copying them before, since some get modified even at this step: status = cudaMemcpy (model->BSIM4paramGPU.d_BSIM4gbsRWArray, model- >BSIM4paramCPU.BSIM4gbsRWArray, size * sizeof(double), cudaMemcpyHostToDevice) ; CUDAMEMCPYCHECK (model->BSIM4paramGPU.d_BSIM4gbsRWArray, size, double, status) Code 10 – An example of memory copy between CPU and GPU for a certain BSIM4 instance parameter

model->pParamHost = (struct bsim4SizeDependParam **) malloc (size * sizeof(struct bsim4SizeDependParam *)) ; status = cudaMalloc ((void **)&(model->d_pParam), size * sizeof(struct bsim4SizeDependParam *)) ; CUDAMALLOCCHECK (model->d_pParam, size, struct bsim4SizeDependParam *, status)

i = 0 ; for (here = model->BSIM4instances ; here != NULL ; here = here- >BSIM4nextInstance) { if (here->pParam != NULL) { status = cudaMalloc ((void **)&(model->pParamHost [i]), sizeof(struct bsim4SizeDependParam)) ; CUDAMALLOCCHECK (model->pParamHost [i], 1, struct bsim4SizeDependParam, status)

status = cudaMemcpy (model->pParamHost [i], here->pParam, sizeof(struct bsim4SizeDependParam), cudaMemcpyHostToDevice) ; CUDAMEMCPYCHECK(model->pParamHost [i], 1, struct bsim4SizeDependParam, status) } else { model->pParamHost [i] = NULL ; } i++ ; }

ADVANCING NANO-CMOS CIRCUITS SIMULATION 56

status = cudaMemcpy (model->d_pParam, model->pParamHost, size * sizeof(struct bsim4SizeDependParam *), cudaMemcpyHostToDevice) ; CUDAMEMCPYCHECK (model->d_pParam, size, struct bsim4SizeDependParam *, status) Code 11 – Handling of the special ‘pParam’ data structure inside the BSIM4 model

The Code 11 shows the treatment needed to handle the special ‘pParam’ data structure, which is part of the BSIM4 model. The code shown is suboptimal, but it’s really straightforward to be implemented. Basically, each instance of the model can have this private data structure, but this is decided inside the DEVtemp routine, so a vector of structure pointers is allocated on both CPU and GPU sides to handle the eventuality. Then, using a loop over each instance of the model, if the

‘pParam’ data structure got allocated by the DEVtemp, an allocation to the ith location of the vector and a copy of the structure to it are performed, otherwise the ith location in the vector gets assigned to NULL. Now, since this is a vector of structures, it requires an allocation for each structure plus an allocation for the vector which hosts these pointers, but ‘cudaMalloc’ returns a pointer to the system, not to the GPU, so, by only doing these mentioned allocations, the GPU doesn’t know the pointers of the structures in the GPU global memory. Therefore, in order to finish the job, another copy has to be performed: the one between the vector which hosts the structures’ pointers from the CPU side to the GPU side, so that the GPU can know the location of those structures. However, this suboptimal implementation has a benefit: it maintains the memory coalescence, because the indices of the vector which hosts the structures’ pointers are maintained aligned to the CUDA thread indices, as it happens in the other parts of the code.

In DEVload, an initial preparation is needed prior to launch the CUDA kernel. In particular, the correct number of threads and blocks have to be calculated. In this case, the thread space has been divided into a 1x256 matrix statically, using the Code 12: thread_x = 1 ; thread_y = 256 ; dim3 thread (thread_x, thread_y) ; Code 12 – Thread space used to launch a BSIM4 model CUDA kernel – Static

ADVANCING NANO-CMOS CIRCUITS SIMULATION 57

Instead, the block space is calculated on demand, using the following formulas: if (model->n_instances % thread_y != 0) block_x = (int)((model->n_instances + thread_y - 1) / thread_y) ; else block_x = model->n_instances / thread_y ; Code 13 – Block space used to launch a BSIM4 model CUDA kernel – Calculated on demand

Basically, the Code 13 shows that blocks are on a single dimension and they are as much as the number of instances divided by the number of threads in the ‘y’ dimension, rounded to the superior limit. Standing this situation in general, a little bit more complex thread distribution has to be performed inside the CUDA kernel: instance_ID = threadIdx.y + blockDim.y * blockIdx.x ; if (instance_ID < n_instances) { if (threadIdx.x == 0) { // Device Model Code } } Code 14 – Thread distribution inside the BSIM4 model CUDA kernel

In the Code 14, ‘instance_ID’ can be higher than the total number of instances, so an ‘if’ statement is mandatory to assure that each CUDA core works on proper and existing data. The other nested ‘if’ statement is optional, since there is only one thread along the ‘x’ dimension.

ADVANCING NANO-CMOS CIRCUITS SIMULATION 58

In the DEVload of the BSIM4 model, also CUDA Streams have been exploited to overlap the execution of NMOS and PMOS sub-models: cudaStream_t stream [2] ; for (i = 0 ; i < 2 ; i++) cudaStreamCreate (&(stream [i])) ;

i = 0 ; for ( ; model != NULL ; model = model->BSIM4nextModel) { cuBSIM4load_kernel <<< block_x, thread, 0, stream [i] >>> (…) ; i++ ; }

cudaDeviceSynchronize () ;

for (i = 0 ; i < 2 ; i++) cudaStreamDestroy (stream [i]) ; Code 15 – An example of CUDA Streams usage inside the BSIM4 model to handle NMOS and PMOS in parallel

In the Code 15, two CUDA Streams get created to host the two BSIM4 sub-models, as mentioned above, then a loop over those sub-models launch the CUDA kernels on the Streams asynchronously. After that, a synchronization barrier is needed to synchronize the execution, prior to delete the Streams, at the end, otherwise the execution could get interrupted by the deletion. Probably this part can be enhanced by moving the synchronization barrier and the CUDA Streams creation and deletion outside the BSIM4 model, where they are actually needed, because these operations are redundantly repeated right now. The big bottleneck could result from the synchronization barrier, because it could let emerge the calculation time expended in easier models, which could be hidden by the BSIM4 model execution.

The rest of the kernel code is pretty much the original model code, with some modifications, needed to let it run on the GPU. All the scalar data, needed only for reading, are passed by value, so the corresponding GPU copy is not needed; a quite good space saving is achieved in doing this. Obviously, the first modification consists in replacing all the input and output data with the GPU ones. In doing this, another smart idea has been used: the location inside each output vector corresponding to each instance is not assigned directly, but it’s retrieved using a Position Vector,

ADVANCING NANO-CMOS CIRCUITS SIMULATION 59

so that the position can be changed, for example by using a heuristic, at run-time, if there is some convenience. As default order, the original order in the linked list is taken. This concept is going to be discussed more in detail in the next paragraph. Since the conversion of DEVload to CUDA has been performed manually, nothing regarding the core algorithm has been changed, but only the data structure. This means that some ‘if’ statements could lead to thread divergence; if this is true, an automatic conversion of the model can schedule better those portions of the code and avoid or minimize the divergence phenomenon. Since DEVload uses additional routines, also these ones have to be translated properly for CUDA. In particular, there are the Limitation Routines and the Integration Method. Regarding the Limitation Routines, they have been ported as is in the GPU side, treating them as: - extern “C”, for C style linking, since NGSPICE is an ANSI C program - __device__, to mark a routine which is called by another GPU routine - static, to be sure the routine lives only inside the current file The Integration Method has been ported in the same way, but it has been also declared as included inside the file that hosts the GPU routine which needs it: a sort of inline. #include "ngspice/CUSPICE/cuniinteg.cuh" Code 16 – Example of CUDA code integration from another file, treated as a sort of inline

5.6.2 The Topology Matrix Method As mentioned before, parallelizing the model itself is quite easy, conceptually speaking: it’s just a matter of changing properly the data structure. The big issue is the final Circuit Matrix and RHS Update. In the original single core code, each model knows the locations where it has to write its contributions as pointer offsets to the Circuit Matrix and RHS, so the update is performed in this form: *(pointer + offset) += value ; Code 17 – An example of single core code for Circuit Matrix and RHS Update

ADVANCING NANO-CMOS CIRCUITS SIMULATION 60

This is nice for a serial execution, but it isn’t suitable for a parallel execution, because the contributions which insist on the same location may incur in the Read After Write hazard, so atomic operations have to be used, but they are really slow and still serial. The other issue is reproducibility: we must ensure that the operations are performed each time using the same order and this is not possible without atomics, but in the meantime maintaining the same code style. So, an idea is needed to overcome this bottleneck; I have called this new technique Topology Matrix Method. Basically, the above operation can be seen as: *(pointer + offset) = *(pointer + offset) + 1 * value ; Code 18 – Code 17 rewritten in the form of constant-constant multiplication

In presence of multiple values insisting on the same location, a vector-vector multiplication can be used:

����� ∗ ������� + ������ =∗ ������� + ������ + 1 1 −1 ∙ ����� ����� Code 19 – Code 18 augmented in the case of multiple values on the same location – Vector-vector multiplication

This kind of representation can be efficiently parallelized. Moreover, the implementation has not to be coded, since it’s possible to leverage on existing libraries; in particular, CUSPARSE from NVIDIA does the job. Said that, this is the case only for a particular location, but we need to cover the entire Circuit Matrix and RHS, so we could re-write the Code 19 in this way:

����� ������ ������ 1 1 −1 = + ∙ ����� ������ ������ −1 1 −1 ����� Equation 40 – Code or equation to cover the almost general case as Matrix-Vector multiplication where the pointer notation has been translated into a vector.

ADVANCING NANO-CMOS CIRCUITS SIMULATION 61

Just to be as much general as possible, let’s consider the final form:

����� ������ ������ 1 1 0 1 ����� = + ∙ ������ ������ −1 1 −1 0 ����� ����� Equation 41 – Code or equation to cover the real general case, by enhancing the Equation 40 where a row can or cannot depend on a value. So the compact form to describe this is the one shown in Equation 42: �+= � ∙ � Equation 42 – Compact notation to express the final general case of Equation 41 where ‘A’ is the Circuit Matrix, ‘T’ is the Topology Matrix and ‘V’ is the value vector. Since the same thing has to be replicated for the RHS, actually two Topology Matrices exist. One clear property of this notation is the fact that the rows’ order in the Topology Matrix is determined by �, so it depends on the circuit and we can assume it’s fixed, because it depends on external facts, while the columns’ order depends on the value vector order, so this order can be scrambled freely to facilitate the matrix-vector multiplication as much as possible, by changing the sparsity pattern. That’s why a Position Vector has been used to retrieve the indices inside the DEVload and the Topology Matrix codes, as previously introduced.

/* Space Allocation to GPU */ status = cudaMalloc ((void **)&(model->d_PositionVector), size * sizeof(int)) ; CUDAMALLOCCHECK (model->d_PositionVector, size, int, status) status = cudaMemcpy (model->d_PositionVector, model->PositionVector, size * sizeof(int), cudaMemcpyHostToDevice) ; CUDAMEMCPYCHECK (model->d_PositionVector, size, int, status) status = cudaMalloc ((void **)&(model->d_PositionVectorRHS), size * sizeof(int)) ; CUDAMALLOCCHECK (model->d_PositionVectorRHS, size, int, status) status = cudaMemcpy (model->d_PositionVectorRHS, model- >PositionVectorRHS, size * sizeof(int), cudaMemcpyHostToDevice) ; CUDAMEMCPYCHECK (model->d_PositionVectorRHS, size, int, status) Code 20 – Allocation and memory copy for Position Vector of the Sparse Matrix and RHS

ADVANCING NANO-CMOS CIRCUITS SIMULATION 62

In the Code 20, it’s illustrated the memory allocation and copy for the Position Vector of the BSIM4 model for both the Sparse Matrix and RHS.

#define TopologyMatrixInsert(Ptr, instance_ID, offset, Value, global_ID) \ ckt->CKTtopologyMatrixCOOi [global_ID] = (int)(here->Ptr - basePtr) ; \ ckt->CKTtopologyMatrixCOOj [global_ID] = model->PositionVector [instance_ID] + offset ; \ ckt->CKTtopologyMatrixCOOx [global_ID] = Value ; Code 21 – Macro to insert values in the Topology Matrix (for the Circuit Matrix) correctly, using the Position Vector

#define TopologyMatrixInsertRHS(offset, instance_ID, offsetRHS, Value, global_ID) \ ckt->CKTtopologyMatrixCOOiRHS [global_ID] = here->offset ; \ ckt->CKTtopologyMatrixCOOjRHS [global_ID] = model- >PositionVectorRHS [instance_ID] + offsetRHS ; \ ckt->CKTtopologyMatrixCOOxRHS [global_ID] = Value ; Code 22 – Macro to insert values in the Topology Matrix (for the RHS) correctly, using another Position Vector

/* m * geltd */ if ((here->BSIM4gNodeExt != 0) && (here->BSIM4gNodeExt != 0)) { TopologyMatrixInsert (BSIM4GEgePtr, k, 0, 1, *i) ; (*i)++ ; } Code 23 – An example of Topology Matrix element insertion for the Circuit Matrix, leveraging on the Position Vector

ADVANCING NANO-CMOS CIRCUITS SIMULATION 63

/* m * ceqgcrg */ if (here->BSIM4gNodeExt != 0) { TopologyMatrixInsertRHS (BSIM4gNodeExt, k, total_offsetRHS + 0, - 1, *j) ; (*j)++ ; } Code 24 – An example of Topology Matrix element insertion for the RHS, leveraging on the other Position Vector

The Code 21, Code 22, Code 23 and Code 24 are examples of the Position Vector usage inside the macros which insert values in the Topology Matrices for both the Circuit Matrix and RHS.

pos = d_PositionVector [instance_ID] ; total_offset = 0 ; if (BSIM4entry.d_BSIM4rgateModArray [instance_ID] == 1) { d_CKTloadOutput [pos + total_offset + 0] = m * geltd ; d_CKTloadOutput [pos + total_offset + 1] = m * (gcggb + geltd - ggtg + gIgtotg) ; d_CKTloadOutput [pos + total_offset + 2] = m * (gcgdb - ggtd + gIgtotd) ; d_CKTloadOutput [pos + total_offset + 3] = m * (gcgsb - ggts + gIgtots) ; d_CKTloadOutput [pos + total_offset + 4] = m * (gcgbb - ggtb + gIgtotb) ; total_offset += 5 ; } else if (BSIM4entry.d_BSIM4rgateModArray [instance_ID] == 2) { d_CKTloadOutput [pos + total_offset + 0] = m * gcrg ; d_CKTloadOutput [pos + total_offset + 1] = m * gcrgg ; ... } Code 25 – Example of Position Vector usage for Circuit Matrix inside the BSIM4 model

The Code 25 illustrates the usage of the Position Vector during the outputting of the values for the Circuit Matrix. In particular, each instance has some predefined locations inside the ‘d_CKTloadOutput’ vector, starting from the one dictated by the Position Vector. Since some values exist only inside ‘if’ statements, it’s mandatory to account them properly; for this reason,

ADVANCING NANO-CMOS CIRCUITS SIMULATION 64

‘total_offset’ expresses the offset inside a group of data, which depends on the model configuration specified from external parameters.

5.6.3 Other code refinements to achieve awesome performances The main goal of moving to the GPU was the speedup of the Transient Analysis. Since this involves the Newton-Raphson cycle, the DEVload has been modified as part of the Model Evaluation, but there are other pieces of code which could be enhanced. First of all, the Linear System Solution could be ported to the GPU, but this is a difficult problem to be solved, because there is no enough parallelism to be exploited and the available amount depends on the particular circuit under simulation. Anyway, some things can be done, like porting to the GPU only the so called re-factorization (meaning the factorization which uses the already calculated pivoting), since only calculations are performed in this task, no decisions, or using a hybrid method, or maybe other things, but this is not the topic of this work. There are some possibilities available in the Transient Analysis itself, which could be explored, including: • Transfer to the GPU memory everything which could be calculated directly into the GPU, including the State Vectors, various circular buffers and other things • Move Local Truncation Error (LTE) calculation inside GPU • Minimize memory transfer between GPU and CPU Regarding these refinements there is no much more to say, since it’s just a matter of good coding, except for the Local Truncation Error, which is the error that comes from the Integration Method of each device instance. From the computational point of view, LTE calculates the next time step as the minimum of each device instances time step, which is independent from each other. So, a proper logarithmic reduction speeds up this phase a lot and the GPU is the right hardware for it. A simple conceptual and abstract example in given in Figure 23, where the reduction can be applied in parallel for each block plus a final reduction to reduce all the blocks, as illustrated in Figure 24. Moreover, by doing this, no memory transfer is needed, except for the final calculated time step.

ADVANCING NANO-CMOS CIRCUITS SIMULATION 65

Figure 23 – Abstract example of a reduction tree

Figure 24 – Parallel Reduction for blocks plus a final reduction for all the blocks

5.7 Results

Results can be divided in 2 milestones: 1) Only Model Evaluation, prior to introduce the Topology Matrix Method 2) Final Results, considering the entire simulation time or excluding the Linear System Solution Regarding the first milestone, the results comprehend the Model Evaluation only, without the back and forth copy of all the parameters. The Figure 25 shows the speedup considering only the Model Evaluation of simple models (Resistors, Capacitors and Inductors) and it’s up to about 35X, while the Figure 26 shows the speedup of the Model Evaluation of the BSIM4 models, which is up to 6.67X on a Fermi generation card. One thing to note is that the simple models are so fast respect to the BSIM4 model that its evaluation is dominant. In fact, the execution of the simple models can be really made hidden by simply feeding up the GPU with a sufficient amount of huge data to work on.

ADVANCING NANO-CMOS CIRCUITS SIMULATION 66

*Resistor Netlist: all resistors; Capacitor Netlist: half capacitors and half resistors; Inductor Netlist: half resistors, quarter capacitors and quarter inductors 40 Resistor Netlist 35 Capacitor Netlist 30 Inductor Netlist 25 20

Speedup 15 10 5 0 8192 16384 32768 65536 131072 number of instances of models* *NVIDIA C2070, ECC on TM *Intel X5690 (6 Core ) @ 3.47GHz Figure 25 – Model Evaluation of a ladder network composed by simple models

60 6.67x CPU (1 core) 50 GPU 40

30

Time (ms) Time 20

10

0

ISCAS85 Benchmark Suite

Figure 26 – Model Evaluation of BSIM4v7 model on the ISCAS85 Benchmark Suite circuits

ADVANCING NANO-CMOS CIRCUITS SIMULATION 67

Considering also the Topology Matrix Method and all the other further improvements, CUSPICE is really able to reach a speedup of about 9X in pure Transient Analysis, without taking into account the Linear System Solution, on a Kepler generation GPU, which is an about 3 years old card, so the speedup could be even more nowadays. This result is explained in the next pictures; in particular, the Figure 27 shows the total simulation time of CUSPICE, only without the initial parsing and the circuit creation time, because they are serial and cannot be made parallel on GPU, since they represent a pure decisional code, almost without calculations. This metric represents the real benefit of the GPU parallelization on the entire simulation time from the user perspective.

3.13x 1000

100 3.74x CPU 10

Time (s) Time GPU - C2070 GPU - K20 1

ISCAS85 Benchmark Suite

*NVIDIA C2070 and K20@600Mhz TM *Intel X5690 (6 Core ) @ 3.47GHz Figure 27 – Total Simulation Time – Overall (without Parsing the Circuit Creation Time)

On the other hand, Figure 28 and Figure 29 show the total simulation time of CUSPICE, without including the Linear System Solution and Truncation Error part. These results are good metrics to demonstrate that the work done in parallelizing the Transient Analysis is pretty awesome, because, considering the Fermi card, where there are also available the results of the pure Model Evaluation, the theoretical speedup of the simulation cannot be more than the Model Evaluation speedup. Since the latter is 6.67X and the former is 5.60X, it’s possible to say that only the 16% of

ADVANCING NANO-CMOS CIRCUITS SIMULATION 68

the theoretical maximum speedup is lost somewhere and is not parallelized, as expressed in the Equation 43: 6.67 − 5.60 = 0.16 = 16% 6.67 Equation 43 – Simple formula to calculate the residual amount of available parallelism in CUSPICE

However, since most of this loss is decisional and not computation, it’s also possible to affirm that almost the entire portion of this residual contribution is not parallelizable, at least on a GPU. So, a possible future work could be diving into the BSIM4 model to understand how it’s possible to speedup it, without touching the mathematical computations.

1000 5.29x CPU GPU - C2070 100 GPU - K20

10 7.83x Time (s) Time

1

ISCAS85 Benchmark Suite

*NVIDIA C2070 and K20@600Mhz TM *Intel X5690 (6 Core ) @ 3.47GHz Figure 28 – Total Simulation Time without Linear System Solution and Truncation Error (OP + Transient)

ADVANCING NANO-CMOS CIRCUITS SIMULATION 69

1000 CPU GPU - C2070 100 GPU - K20 5.60x 8.89x

10 Time (s) Time

1

ISCAS85 Benchmark Suite

*NVIDIA C2070 and K20@600Mhz TM *Intel X5690 (6 Core ) @ 3.47GHz Figure 29 – Total Simulation Time without Linear System Solution and Truncation Error in Pure Transient Analysis

The last but not the least is the Figure 30, which expresses the amount of time expended in the Linear System Solution and Truncation Error vs the Model Evaluation. Since the 80% of the total time is expended in the Model Evaluation when it’s executed on a single core in average, the parallelization on GPU of the Model Evaluation phase is really key in reducing the total simulation time. The other relevant fact is that the Linear System Solution and the Truncation Error have become the new bottleneck to be worked on, after that the Model Evaluation has been parallelized.

DC + Transient

100%

80%

60% Linear System Solve and Truncation Error 40% Model Evaluation 20%

0% CPU (1 core) C2070 K20

*NVIDIA C2070 and K20@600Mhz *Intel X5690 (6 CoreTM) @ 3.47GHz Figure 30 – Average amount of time expended in Model Evaluation vs Linear System Solution and Truncation Error

ADVANCING NANO-CMOS CIRCUITS SIMULATION 70

Finally, it’s possible to spend some words regarding this new bottleneck: the Truncation Error has been ported to the GPU after these results, but it has decreased the total simulation time by about another 5% and nothing more, so the Linear System Solution is really the way to go at this point, but this part is extremely difficult to be parallelized, because the available parallelism depends on the circuit, as mentioned before. In addition, a bigger circuit doesn’t necessary mean increasing the available parallelism, because it depends on how many components are connected together in the expanded circuit (after the models substitution).

ADVANCING NANO-CMOS CIRCUITS SIMULATION 71

Chapter 6 - Reliability Analysis

The problem of devices and circuits aging has been one of the key research topics since some years, when the scaling effect on transistors’ dimensions has let emerge reliability issues due to more and more defects during the fabrication process. Nowadays, the scaling is so high that the transistor dimensions are really tiny, leading to even changing materials to overcome these always more important aging effects. Fore sure circuit lifetime is a function of the single device lifetime, which is in turn a function of process, temperature, voltage and other physical phenomena, which will be considered only at single device level. The most important ones are: • NBTI/PBTI (Negative/Positive Bias Temperature Instability) for PMOS and NMOS, respectively • HCI (Hot Carrier Injection) Negative bias temperature instability has been known since 1966, but it’s only during the last few years that it has become a reliability issue in silicon integrated circuits, because the gate fields have increased as a result of scaling, increasing therefore the chip operating temperature. NBTI is an increase in the absolute threshold voltage, a degradation of the mobility, drain current, and transconductance of PMOS transistors. It’s almost universally attributed to the creation of interface traps and oxide charge by a negative gate bias at elevated temperature. The oxide electric field is usually, but not always, lower than that leading to hot carrier degradation. The oxide electric field and temperature are similar to those typically encountered during burn-in and sometimes encountered during high-performance chip operation. However, the details of how NBTI occurs are not entirely clear. For this reason, each researcher creates his model, considering the physical phenomena he wants to study and the particular constitutive material of the transistor.

ADVANCING NANO-CMOS CIRCUITS SIMULATION 72

In addition to that, there are two different kind of models for NBTI: 1) Reaction Diffusion 2) Trapping/De-Trapping The Reaction Diffusion based one is considered to be the old methodology to describe the NBTI effect, so the more recent theory of Trapping/De-Trapping is widely used nowadays. The aging model I have used is based, in fact, on the latter methodology. On the other hand, PBTI is considered a new effect, mainly present in the new kind of transistors, built using high-k oxides. Since it’s a new effect, it isn’t well known and understood yet, so no many models are available. HCI effect appears when a device is biased in strong inversion with large VDS. Hot carriers with high kinetic energy collide with other atoms and carriers, and lead to secondary carriers generated through impact ionization. Some carriers can be injected into the gate dielectric, which create traps at the gate dielectric interface and the dielectric bulk. The traps capture carriers, hence the device Vth increases. HCI is more prominent in NMOS devices, which face a smaller potential barrier than holes at the gate oxide interface. All these effects are described in the Figure 31 [7]. At the end of the day, it doesn’t matter the physical phenomena which age the transistor, the result is every time a degradation in performances and reliability, which is measured as variation of the absolute device’s Threshold Voltage.

ADVANCING NANO-CMOS CIRCUITS SIMULATION 73

Figure 31 – Physical phenomena which affect devices reliability

ADVANCING NANO-CMOS CIRCUITS SIMULATION 74

6.1 The Aging Model In my PhD, I took into consideration only the NBTI phenomenon, because I used a new model for new kind of devices with very thin oxide (about 2nm), which has been developed and verified using TCAD simulations by the Integrated Device Characterization group, led by Prof. Irrera at DIET department of Sapienza University. However, a TCAD model cannot fit into a circuit simulator, so a MATLAB version has been developed to create a Compact Model, which is basically a mathematical abstraction of the real model that expresses the physical phenomena. One interesting thing to note is that there is no dependency on temperature in this model for the Trapping phase; this is possible because high-k dielectrics are not so sensible to high temperatures respect to silicon oxide based ones. In fact, high-k dielectrics have more initial defects respect to the others, so there is no room to generate additional traps in this condition, as illustrated in the Figure 32 [8].

Figure 32 – Sensibility to high temperature of both high-k and silicon oxide based dielectrics during the Trapping phase

ADVANCING NANO-CMOS CIRCUITS SIMULATION 75

Anyway, since a temperature dependency can be found in the De-Trapping phase, I have updated the aging model, taking the base equations from the final work presented in [8]. Then, I have worked in NGSPICE by modifying the fitting parameters and by letting the configurable ones available in the input netlist. Every aging model tries to predict the aging of a device during the passing of time, considering particular physical phenomena and the features of the device. The equations can be obviously different depending on the model, but the general behavior is always the same. In addition to that, it’s possible to consider the effect near or far in time; for this reason, two different behaviors exist: • Short Term • Long Term

6.2 The Short Term Behavior Short term models predict the behavior around some nanoseconds in the future, but the definition is not obviously every time the same, since it depends also on the application. In this range of time, the transistor behavior is divided in two phases: • Stress • Recovery The device is considered under stress when is turned on, in both saturation and triode stages, while it’s considered under recovery when is turned off. So, basically, these two phases depend on the device input signal, when a simple inverter is considered, for instance, but when the considered device is connected in a more complex way, nothing can be easily derived. That’s why aging circuit simulation is important: in this way, no assumptions about the transistor connections have to be made, but the transistor status can be effectively measured on the circuit, enabling the aging analysis of complex designs. So, considering one transistor in the circuit, it’s subjected to a certain stress-recovery cycle, depending on the main circuit input. The simulation consists in following this cycle and leveraging on the aging model to compute the device aging for that amount of time, for both stress and recovery phases, with two different laws.

ADVANCING NANO-CMOS CIRCUITS SIMULATION 76

The device aging calculated by the model is the so called DeltaVth, which is the variation of the absolute Vth, mentioned above. Then, this value can further be used to run the real simulation on the aged circuit, forming a two-step flow, composed by: • A fresh analysis to compute the aging of the circuit • A simulation of the aged circuit to extract the desired information, like Transient, OP, etc.

6.3 The Long Term Behavior

On the contrary, long term models predict the behavior of the transistor for a very far future, which is generally a years-based time, considered nowadays as 2 years, even though formerly it was 10 years, which in any case depends on the application. In order to do so, the model itself is different and some approximations have to be used to predict such long behavior. In particular, considering the input signal as a clock, an upper bound equation can be taken into account, even though is generally pessimistic. In this context, the clock period and the duty cycle are always fundamental parameters. However, since generally speaking this isn’t true every time, the long term model is a very good first order approximation for a single device, but it isn’t valid to calculate the aging of an entire circuit. That’s why, I have used the short term model inside NGSPICE to calculate the aging of the entire circuit. At this point, in order to extract the long term behavior, I have inserted a proper fitting to perform the extrapolation, by considering a year-based target time. Another important difference between short term models and long term models is that long term ones don’t consider the stress-recovery cycle, but the time is only used as an asymptotic parameter.

ADVANCING NANO-CMOS CIRCUITS SIMULATION 77

6.4 The implementation in NGSPICE Considering what it has been said as a theoretical starting point, I had to code somehow this in NGSPICE. The modifications needed to put in place the Reliability Analysis in NGSPICE, using the short term model are the following: • Create the aging model as companion model for the transistor • Create the new Reliability Analysis • Modify the input language to be able to accept the new syntax for the companion model and the Reliability Analysis • Add hooks in the transistor model code to be able to retrieve the voltages applied to the transistor, in order to determine if the transistor is turned on or off • Using the companion model and this information to calculate the stress-recovery cycle and the final DeltaVth for a certain amount of time, specified by the user, which has to be reasonably short • Compute the extrapolation at years-based time, when asked (generally this is the case) • Launch another analysis in sequence on the aged circuit, specified by the user Creating the aging model as companion model is crucial for the implementation, because it has to access the private data structure of the transistor model, which is the BSIM4, in this case: #include "../bsim4/bsim4def.h" Code 26 – Include Statement in the Reliability Model to make it the companion of the BSIM4 model

The Reliability Analysis requires a new input language for the analysis and the model. In Code 27 there is an example: *.model p1_ra relmodel level=1 t0=1e10 k1_2=81e53 e_01=0.335e9 ea=0.13 x1=1 x2=0.5 t_clk=1e-9 alfa=0.5 .model p1_ra relmodel level=1 .appendmodel p1_ra pmos [list of devices] … .relan 1n 10n .tran 1n 10n Code 27 – New Input Language to be able to specify the Reliability Analysis and Model

ADVANCING NANO-CMOS CIRCUITS SIMULATION 78

The ‘.model’ declaration instantiates the ‘relmodel’ model, which is the aging model, naming it ‘p1_ra’ (in this example), using level 1 (which is present only for compatibility right now). The first version differentiates respect to the second version in passing parameters, because every model parameter can be overwritten from the netlist as a list of ‘parameter=value’; so, all the parameters have been overwritten in the first example, while no parameters have been passed in the second one.

#ifdef RELAN if (!(strcmp (type_name, "relmodel"))) { type = INPtypelook ("RELMODEL") ; if (type < 0) { err = INPmkTemp ("Model type 'relmodel' not available in this binary\n") ; } … #endif Code 28 – Piece of code to handle the ‘relmodel’ model from the input netlist

The Code 28 maps the ‘relmodel’ model from the input netlist to the RELMODEL model, which is the real implemented model inside NGSPICE. In this phase, only the model name and type name are retrieved from the ‘.model’ line; then the type name is converted into the type number (this code) and the rest of the line is passed to the routine which parses the particular model implementation, because only this routine knows exactly what the parameters mean. This task in NGSPICE is performed automatically, in the sense that there is a place inside the model which specifies what the parameters mean and then a more generic routine parses the rest of the line, considering the syntax shown in Code 29: parameter_name1=value1 parameter_name2=value2 … Code 29 – Syntax to pass parameters in the ‘.model’ line

ADVANCING NANO-CMOS CIRCUITS SIMULATION 79

The ‘relmodel’ model has the list of parameters shown in Code 30, where the first word is the parameter name, the second member is an identification number (defined in another file), the third one is a flag indicating the type and the last one is a description: IFparm RELMODELmPTable [] = { /* model parameters */ IOP ("k_b", RELMODEL_MOD_KB, IF_REAL, "Boltzmann Constant"), IOP ("h_cut", RELMODEL_MOD_HCUT, IF_REAL, "h_cut"), IOP ("nts", RELMODEL_MOD_NTS, IF_REAL, "Nts"), IOP ("eps_hk", RELMODEL_MOD_EPSHK, IF_REAL, "Dielectric Constant High-K"), IOP ("eps_SiO2", RELMODEL_MOD_EPSSIO2, IF_REAL, "Dielectric Constant SiO2"), IOP ("m_star", RELMODEL_MOD_MSTAR, IF_REAL, "m_star"), IOP ("w", RELMODEL_MOD_W, IF_REAL, "W"), IOP ("tau_0", RELMODEL_MOD_TAU0, IF_REAL, "tau_0"), IOP ("beta", RELMODEL_MOD_BETA, IF_REAL, "beta"), IOP ("tau_e", RELMODEL_MOD_TAUE, IF_REAL, "tau_e"), IOP ("beta1", RELMODEL_MOD_BETA1, IF_REAL, "beta1"), IP ("relmodel", RELMODEL_MOD_RELMODEL, IF_FLAG, "Flag to indicate RELMODEL") } ; Code 30 – Model structure to host the parameters definition

The ‘.appendmodel’ declaration let the aging model be the companion model of another one. In this example, the ‘p1_ra’ model will be the companion model of the ‘pmos’ model, which is the transistor model name, not shown here. It isn’t important the order, since the evaluation is performed considering how the companion model concept works. As last remark on this, the declaration name has been taken from HSPICE manual. This declaration is handled in the Third Pass of the parser as mentioned in the Code 31, which calls another routine, not mentioned here, because it’s too long: } else if (strcmp (token, ".appendmodel") == 0) { dot_appendmodel (ckt, tab, current) ; goto quit ; } … Code 31 – Piece of code the handle the ‘.appendmodel’ line

ADVANCING NANO-CMOS CIRCUITS SIMULATION 80

I decided to provide more flexibility in the user interface, giving the possibility to select only a subset of transistors to be affected by aging, by adding a list in the input netlist. Here below is the part of the ‘dot_appendmodel’ routine which: 1) initializes the list 2) loops on the line which holds the list and retrieve the token 3) looks for the transistor instance presence in the table of instantiated devices 4) then, if it finds it, it stores it in the list of models for the subsequent aging calculation

// Parse the rest of the line, which is a list of devices thismodel->INPmodfast->GENrelmodelDeviceList = NULL ; while (*line) { INPgetTok (&line, &device_name, 1) ; error = INPretrieve (&device_name, tab) ; if (error) { printf ("Warning: Cannot find the %s instance of the %s model - Skipping this instance\n\n", device_name, appendmodel_name) ; } else { GENrelmodelDeviceElem *GENrelmodelDevice = TMALLOC (GENrelmodelDeviceElem, 1) ; GENrelmodelDevice->device_name = device_name ; GENrelmodelDevice->next = thismodel->INPmodfast- >GENrelmodelDeviceList ; thismodel->INPmodfast->GENrelmodelDeviceList = GENrelmodelDevice ; } } Code 32 – Piece of code to parse the list of devices to be affected by aging

ADVANCING NANO-CMOS CIRCUITS SIMULATION 81

The ‘.relan’ and the ‘.tran’ declarations are the analyses which have be to be performed. Also here, the order is not important, because the assumption is that the Reliability Analysis has to come first, in order to age the devices, and then the second analysis, whatever it is, has to come later on, in order to perform the analysis on the aged circuit. The ‘.relan’ directive indicates the new Reliability Analysis and needs to have a command declaration, which will parse it, in order to work, as mentioned in the Code 33: #ifdef RELAN {"relan", com_relan, TRUE, TRUE, { 0, 0, 0, 0 }, E_DEFHMASK, 0, LOTS, NULL, "[.relan line args] : Do a Reliability Analysis." }, #endif Code 33 – The ‘.relan’ command declaration

The command ‘com_relan’ in the Code 34 is just a wrapper to potentially execute specific tasks to launch the analysis properly: #ifdef RELAN void com_relan (wordlist *wl) { dosim ("relan", wl) ; } #endif Code 34 – The ‘com_relan’ routine, which executes the Reliability Analysis

The ‘.relan’ line is parsed in the Code 35, which calls another routine, not mentioned here, because it’s too long, but it’s worth say the syntax of this analysis, which is reported in Code 36. #ifdef RELAN } else if ((strcmp (token, ".relan") == 0)) { rtn = dot_relan (line, ckt, tab, current, task, gnode, foo) ; goto quit ; #endif Code 35 – Piece of code to handle the ‘.relan’ line

ADVANCING NANO-CMOS CIRCUITS SIMULATION 82

/* .relan AgingStep AgingStop */ which = ft_find_analysis ("RELAN") ; if (which == -1) { LITERR ("Reliability Analysis unsupported.\n") ; return (0) ; } IFC (newAnalysis, (ckt, which, "Reliability Analysis", &foo, task)) ; parm = INPgetValue (ckt, &line, IF_REAL, tab) ; GCA (INPapName, (ckt, which, foo, "relan_aging_step", parm)) ; parm = INPgetValue (ckt, &line, IF_REAL, tab) ; GCA (INPapName, (ckt, which, foo, "relan_aging_stop", parm)) ; … Code 36 – Syntax of the ‘.relan’ line and piece of code to parse it

Then, the mentioned forced order to let the Reliability Analysis be executed as first, when specified, is managed in the Code 37: #ifdef RELAN extern SPICEanalysis RELANinfo ; #endif

extern SPICEanalysis ACinfo ; … extern SPICEanalysis TRANinfo ; Code 37 – Extract of the available analyses – The order is key

Regarding the core part of the Reliability Analysis, there is a data structure, which holds all the code needed to perform it, which has been derived from the Transient Analysis. So, some modifications are obvious (e.g the parameters), but there is one thing which is completely new: the stress/recovery cycle, which computes the aging profile, is managed by the Code 38 and is inside the Reliability Analysis code: CKTreliability (ckt, 0) ; Code 38 – Routine to calculate the aging for the entire circuit

ADVANCING NANO-CMOS CIRCUITS SIMULATION 83

The same routine has to be called at the end of the analysis, because the final piece of aging profile has to be extracted from the latest time point, too. This part is handled by the analysis manager: #ifdef RELAN if (i == 1) { /* In case of Reliability Analysis, perform the final CKTreliability */ printf ("\n\nFinal Aging...\n") ; CKTreliability (ckt, 1) ; error = CKTtemp (ckt) ; if (error) { return (error) ; } } #endif Code 39 – Final aging profile calculation

‘CKTreliability’ loops over all the ‘DEVreliability’ routines, which are private of each device model which supports the ‘relmodel’ model as companion model, as illustrated in the Code 40: int CKTreliability (CKTcircuit *ckt, unsigned int mode) { int error, i ; for (i = 0 ; i < DEVmaxnum ; i++) { if (DEVices [i] && DEVices [i]->DEVreliability && ckt->CKThead [i]) { error = DEVices [i]->DEVreliability (ckt->CKThead [i], ckt, mode) ; } } return (OK) ; } Code 40 – The ‘CKTreliability’ routine to loop over the ‘DEVreliability’ routines and extract the aging profile of the whole circuit

ADVANCING NANO-CMOS CIRCUITS SIMULATION 84

where ‘DEVreliability’ is mentioned in the Code 41: /* loop through all the BSIM4 device models */ for ( ; model != NULL ; model = model->BSIM4nextModel) { if (model->BSIM4type == PMOS) { if (model->BSIM4relmodelDeviceList != NULL) { for (elem = model->BSIM4relmodelDeviceList ; elem != NULL ; elem = elem->next) { here = (BSIM4instance *)(CKTfndDev (ckt, elem- >device_name)) ; if (!here) { SPfrontEnd->IFerrorf (ERR_WARNING, "Error: Cannot find the %s instance of the %s model", elem->device_name, model- >BSIM4modName) ; } else { BSIM4reliability_internal (here, ckt, mode) ; } } } else { /* loop through all the instances of the model */ for (here = model->BSIM4instances ; here != NULL ; here=here->BSIM4nextInstance) { BSIM4reliability_internal (here, ckt, mode) ; } } } } Code 41 – The ‘DEVreliability’ routine, which loops over all the instances of the model or over the specified ones

This routine picks up the PMOS model; then, if the list of instances has been populated in the input netlist, it loops only on those, otherwise it loops on every instance of the PMOS model. The loop calls the real routine which computes the stress/recovery cycle for each instance.

ADVANCING NANO-CMOS CIRCUITS SIMULATION 85

At the beginning, it’s mandatory to determine if a transistor is turned ON or OFF, so the state vector and the Vth (here called Von) are used for this purpose: // Determine if the transistor is ON or OFF vds = ckt->CKTstate0 [here->BSIM4vds] ; vgs = ckt->CKTstate0 [here->BSIM4vgs] ; von = here->BSIM4von ; Code 42 – Piece of code to determine if the transistor is ON or OFF

Then, depending on the transistor state, which depends in turn not only on ‘vds’, ‘vgs’ and ‘von’, but also on the model configuration, a flag is turned ON or OFF, as mentioned in the Code 43: if (vds >= 0) { if (vgs > von) { if (here->BSIM4rgateMod == 3) { double vges, vgms ; vges = ckt->CKTstate0 [here->BSIM4vges] ; vgms = ckt->CKTstate0 [here->BSIM4vgms] ; if ((vges > von) && (vgms > von)) { NowIsON = 1 ; } else { NowIsON = 0 ; } } else if ((here->BSIM4rgateMod == 1) || (here->BSIM4rgateMod == 2)) { double vges ; vges = ckt->CKTstate0 [here->BSIM4vges] ; if (vges > von) { NowIsON = 1 ; } else { NowIsON = 0 ;

ADVANCING NANO-CMOS CIRCUITS SIMULATION 86

} } else { NowIsON = 1 ; } } else { NowIsON = 0 ; } } else { double vgd ; vgd = vgs - vds ; if (vgd > von) { if (here->BSIM4rgateMod == 3) { double vges, vged, vgms, vgmd ; vges = ckt->CKTstate0 [here->BSIM4vges] ; vged = vges - vds ; vgms = ckt->CKTstate0 [here->BSIM4vgms] ; vgmd = vgms - vds ; if ((vged > von) && (vgmd > von)) { NowIsON = 1 ; } else { NowIsON = 0 ; } } else if ((here->BSIM4rgateMod == 1) || (here->BSIM4rgateMod == 2)) { double vges, vged ; vges = ckt->CKTstate0 [here->BSIM4vges] ; vged = vges - vds ; if (vged > von) { NowIsON = 1 ; } else { NowIsON = 0 ; } } else {

ADVANCING NANO-CMOS CIRCUITS SIMULATION 87

NowIsON = 1 ; } } else { NowIsON = 0 ; } } Code 43 – Piece of code to determine is a transistor is ON or OFF

At this point, the rest of the code will rely on the ‘NowIsON’ flag to determine is the transistor is in stress or recovery, which depends also on its previous state; for this reason, there is an initialization the first time to a valid state, shown in Code 44: // If it's the first time, initialize 'here->relStruct->IsON' if (here->relStruct->IsON == -1) { here->relStruct->IsON = NowIsON ; } Code 44 – Initialization of the ‘IsON’ flag the first time to a valid state

Now, there are two states to be considered: the current one and the previous one, so a total of four combinations are possible and all of them have to be considered, as illustrated in the Code 45: if (mode == 0) { if (NowIsON) { if (here->relStruct->IsON == 1) { // Until now, the device was ON - Do NOTHING delta = -1 ; } else if (here->relStruct->IsON == 0) { // Until now, the device was OFF - Calculate recovery delta = ckt->CKTtime - here->relStruct->time ;

// Update time and flag - Stress begins here->relStruct->time = ckt->CKTtime ;

ADVANCING NANO-CMOS CIRCUITS SIMULATION 88

here->relStruct->IsON = 1 ;

// Calculate Aging - Giogio Liatis' Model ret = RELMODELcalculateAging ((GENinstance *)here, here- >BSIM4modPtr->BSIM4modType, delta, 1e-12, 0) ; if (ret == 1) { return (E_INTERN) ; }

// Update the semiperiod counter here->relStruct->semiPeriods++ ; } else { fprintf (stderr, "Reliability Analysis Error\n") ; } } else { if (here->relStruct->IsON == 1) { // Until now, the device was ON - Calculate stress delta = ckt->CKTtime - here->relStruct->time ;

// Update time and flag - Recovery begins here->relStruct->time = ckt->CKTtime ; here->relStruct->IsON = 0 ; // Calculate Aging - Giorgio Liatis' Model ret = RELMODELcalculateAging ((GENinstance *)here, here- >BSIM4modPtr->BSIM4modType, delta, 1e-12, 1) ; if (ret == 1) { return (E_INTERN) ; }

// Update the semiperiod counter here->relStruct->semiPeriods++ ; } else if (here->relStruct->IsON == 0) {

ADVANCING NANO-CMOS CIRCUITS SIMULATION 89

// Until now, the device was OFF - Do NOTHING delta = -1 ; } else { fprintf (stderr, "Reliability Analysis Error\n") ; } } … } Code 45 – The four possible combinations to be considered to determine if a transistor is ON or OFF

In the Code 45, only the ‘mode 0’ has been considered; this mode is the one used during the Reliability Analysis to extract the aging profile and considering both the current and previous states is mandatory, because: - if now the transistor is turned on and previously it was also on, then there is no delta time, since there is no change in the state; only the ‘delta’ variable is updated to a conventional value in order to specify a different amount of delta time avoiding a possible misunderstanding of the elapsed time - exactly the same thing happens when the transistor is turned off and previously it was also off - if the transistor is on and previously was off, then the recovery has to be calculated. In particular, the amount of time between the current state and the previous one is calculated, then the current time and status flag are saved and the aging is calculated for the instance, for the determined amount of delta time; finally, the semi-periods counter is updated, which serves at the end for the final long term prediction - similarly, if the transistor is off and previously was on, the stress has to be calculated In order to build this calculation, an assumption is made: each state’s change is captured by the normal Transient Analysis’ time step evolution, using a reject/accept methodology, so that it’s impossible to skip a state’s change, because the Reliability Analysis arrives here only after an accepted time point. They are so close to each other that a so fine granularity should capture correctly the proper instant.

ADVANCING NANO-CMOS CIRCUITS SIMULATION 90

The ‘mode’ flag is passed from ‘CKTreliability’ and above and it’s mandatory because it serves to distinguish between a stress/recovery cycle which is internal to the simulation, respect to the final one, performed inside the analysis manager, outside the main analysis loop. In fact, if its value is 1, the code is illustrated in the Code 46: } else if (mode == 1) { // In this mode, it doesn't matter if NOW the device is in stress or in recovery, since it's the latest timestep if (here->relStruct->IsON == 1) { // Calculate stress delta = ckt->CKTtime - here->relStruct->time ;

// Update time and flag - Maybe Optional here->relStruct->time = ckt->CKTtime ; here->relStruct->IsON = 1 ;

// Calculate Aging - Giorgio Liatis' Model ret = RELMODELcalculateAging ((GENinstance *)here, here- >BSIM4modPtr->BSIM4modType, delta, 1e-12, 1) ; if (ret == 1) { return (E_INTERN) ; }

// Update the semiperiod counter here->relStruct->semiPeriods++ ; } else if (here->relStruct->IsON == 0) { // Calculate recovery delta = ckt->CKTtime - here->relStruct->time ;

// Update time and flag - Maybe Optional here->relStruct->time = ckt->CKTtime ; here->relStruct->IsON = 0 ;

// Calculate Aging - Giorgio Liatis' Model ret = RELMODELcalculateAging ((GENinstance *)here, here-

ADVANCING NANO-CMOS CIRCUITS SIMULATION 91

>BSIM4modPtr->BSIM4modType, delta, 1e-12, 0) ; if (ret == 1) { return (E_INTERN) ; }

// Update the semiperiod counter here->relStruct->semiPeriods++ ; } else { fprintf (stderr, "Reliability Analysis Error\n") ; } … Code 46 – Final piece of aging profile extraction

In particular, since this is the latest time step, it’s an obliged point and the assumption is that the state is the same as the previous one, because the change has to be captured by the normal time step evolution, as mentioned before. After that, the long term prediction can be calculated, but the circuit simulation cannot be executed for a year-based amount of time, so an extrapolation is mandatory for this purpose by using a fitting. There are two cases for it: • The simple case, when there is only a single stress and no stress/recovery cycle; here the extrapolation is trivial, because it’s simply the aging model calculated at the target extrapolation time • On the contrary, the general case is quite complicated, because the stress/recovery cycle has a periodic behavior, so the fitting methodology has to use a Fourier basis The general case is handled in the Code 47: /* Calculate fitting */ if (here->relStruct->semiPeriods > 1) { /* The model behavior is periodic - Use Fourier basis fitting */

double *deltaVthFit, f, factor_for_2pi, *fitting_matrix, target, *timeFit ; RELMODELrelList *current ; unsigned int columns, i, j, number_of_modes, number_of_periods,

ADVANCING NANO-CMOS CIRCUITS SIMULATION 92

rows, size ;

/* Count how many deltaVth we have */ i = 0 ; current = here->relStruct->deltaVthList ; while (current != NULL) { i++ ; current = current->next ; }

/* Assign list members to vectors */ timeFit = TMALLOC (double, i) ; deltaVthFit = TMALLOC (double, i) ;

i = 0 ; current = here->relStruct->deltaVthList ; while (current != NULL) { timeFit [i] = current->time ; deltaVthFit [i] = current->deltaVth ; i++ ; current = current->next ; }

/* Generate the fitting matrix */ number_of_periods = here->relStruct->semiPeriods / 2 ; number_of_modes = 10 * (number_of_periods + 1) ;

rows = i ; columns = 2 * number_of_modes + 1 ; size = rows * columns ; fitting_matrix = TMALLOC (double, size) ;

factor_for_2pi = 2 * 3.14159265359 / (timeFit [rows - 1] - timeFit [0]) ;

ADVANCING NANO-CMOS CIRCUITS SIMULATION 93

for (i = 0 ; i < rows ; i++) { /* The first element of every row is equal to 1 */ fitting_matrix [columns * i] = 1 ;

/* The odd elements of every row are cos(x) */ for (j = 0 ; j < number_of_modes ; j++) { fitting_matrix [columns * i + 2 * j + 1] = cos ((j + 1) * timeFit [i] * factor_for_2pi) ; }

/* The even elements of every row are sin(x) */ for (j = 1 ; j <= number_of_modes ; j++) { fitting_matrix [columns * i + 2 * j] = sin (j * timeFit [i] * factor_for_2pi) ; } }

gsl_matrix_view m = gsl_matrix_view_array (fitting_matrix, rows, columns) ; gsl_vector_view b = gsl_vector_view_array (deltaVthFit, rows) ; gsl_vector *tau = gsl_vector_alloc (MIN (rows, columns)) ;

gsl_vector *x = gsl_vector_alloc (columns) ; gsl_vector *residual = gsl_vector_alloc (rows) ;

gsl_linalg_QR_decomp (&m.matrix, tau) ; gsl_linalg_QR_lssolve (&m.matrix, tau, &b.vector, x, residual) ;

target = 315360000.0 ; f = gsl_vector_get (x, 0) ;

/* The odd elements of every row are cos(x) */ for (j = 0 ; j < number_of_modes ; j++) {

ADVANCING NANO-CMOS CIRCUITS SIMULATION 94

f += gsl_vector_get (x, 2 * j + 1) * cos ((j + 1) * target * factor_for_2pi) ; }

/* The even elements of every row are sin(x) */ for (j = 1 ; j <= number_of_modes ; j++) { f += gsl_vector_get (x, 2 * j) * sin (j * target * factor_for_2pi) ; }

printf ("\n\nExtrapolation at 10 years:\n\t\t\t\tDeltaVth: %- .9gmV\n\n\n\n", f * 1000) ;

/* Assign the extrapolated DeltaVth to the model */ here->relStruct->deltaVth = f ;

gsl_vector_free (tau) ; gsl_vector_free (x) ; gsl_vector_free (residual) ; } else { … } Code 47 – Extrapolation at the end of the simulation, using a Fourier basis fitting

In this context, the GSL library has been used to perform the QR decomposition and the least square triangular solve to calculate the solution of the fitting matrix, then each element of the solution vector has to be multiplied by the proper factor, which depends on the target extrapolation time, on the position in the vector and on a pre-calculated factor which scale the time quantities to 0, 2� , showed in the Equation 44:

������_ = 2�/ ��� ��������� − ��� ���������

Equation 44 – Factor to scale a vector of time steps to the interval [0,2p]

ADVANCING NANO-CMOS CIRCUITS SIMULATION 95

The fitting matrix to be resolved is given in the Equation 45: � � ��� ⋮ − � � ≡ ��� � − �� � Equation 45 – Fitting Matrix to be solved by using the QR decomposition and the Least Square Triangular Solve where the function can be whatever. In this case, it’s approximated by using a number of Fourier bases, as mentioned in the Equation 46:

� � = � + ��� �� ∙ � + ��� �� ∙ � Equation 46 – Fourier basis approximation for a given function where � represents the number of modes needed to approximate the function. They have to be at least the number of considered periods plus 1. In my implementation, I multiplied this by a factor of 10, so that I’m quite sure to catch enough modes to approximate correctly the function, without losing too much precision. In the resulting fitting matrix, even elements (odd, considering the C array indexing) are cosine and odd (even, considering the C array indexing) elements are sine, with the exception of the first element, which is 1. The particular case where the transistor is always ON or OFF from the beginning of the simulation is also taken into account, because the initialization to a valid value captures the state, which is then maintained until the end. In this case, if the transistor is always ON, the stress is calculated, otherwise the recovery is calculated, even though, actually, the result is 0, because the maximum value of DeltaVth is 0 as initialization (this is a parameter of the aging model). Since the final result is calculated as variation of the absolute value of Vth, the proper sign has to be considered in doing this task; then, it gets injected inside the transistor model to start the next analysis on the aged circuit.

ADVANCING NANO-CMOS CIRCUITS SIMULATION 96

The Code 48 adds the contribution inside the ‘DEVtemp’ of the BSIM4 model as ‘delvto’, which is also a parameter that can be added directly from the input netlist (useful for calculating the aging also outside NGSPICE): #ifdef RELAN if (model->BSIM4type == PMOS) { if (here->relStruct->IsON != -1) { here->BSIM4delvto = model->BSIM4type * here->relStruct- >deltaVth ; } } #endif Code 48 – DeltaVth injection inside the BSIM4 model in form of ‘delvto’

Several analyses can be launched at this point, but all of them will use the aged circuit.

6.5 Results

In terms of results, the Reliability Analysis has been launched on a simple minimal inverter and on a 32bit Carry Chain Adder, using the BSIM4 as transistor model. The minimal inverter showed that the delay due to the HL input transition, which is the only one affected by aging, gets higher; this could lead to setup or hold issues, depending if the data path or the clock path is considered, respectively, as showed in the Table 3.

Delay HL (ns) @ VDD=0.7V – Single Inverter @ 16nm Using an ultra-thin gate oxide

Fresh Circuit Aged Circuit (10 years) Variation Percentage

0.7097 0.7233 1.92%

Table 3 – Single Inverter at 16nm, aged considering an ultra-thin oxide

ADVANCING NANO-CMOS CIRCUITS SIMULATION 97

Then, the same inverter has been used to calculate the aging, by considering a hypothetical defected material with halved tc and doubled b, as reported in the Table 4.

Delay HL (ns) @ VDD=0.7V – Single Inverter @ 16nm Using an hypothetical material with half � and double β c

Fresh Circuit Aged Circuit (10 years) Variation Percentage

0.7097 0.7760 9.34%

Table 4 – Single Inverter at 16nm, aged considering a hypothetical material with halved tc and doubled b

The cell is so simple that can be used to validate the delay response to the aging phenomenon, but it’s unusable to validate the leakage power response, so a more complex circuit has been used in this case: a 32bit Carry Chain Adder. In this case, due to the cell complexity, a specific configuration has to be chosen; I decided to use the Cin – Cout critical path as metric, considering the worst case of A=0x00000000 and

B=0xFFFFFFFF, in both the LH and HL transitions of the Cin input. Both the delay and leakage power responses are analyzed in the subsequent tables.

Delay (ns) @ VDD=0.7V – 32bit Carry Chain Adder @ 16nm A=0x00000000 and B=0xFFFFFFFF – C – LH transition in

Fresh Circuit Aged Circuit (10 years) Variation Percentage

0.5682 0.5716 0.61%

Table 5 – Delay of 32bit Carry Chain Adder, aged in the worst case condition and considering an LH Cin transition

ADVANCING NANO-CMOS CIRCUITS SIMULATION 98

Delay (ns) @ VDD=0.7V – 32bit Carry Chain Adder @ 16nm A=0x00000000 and B=0xFFFFFFFF – C – HL transition in

Fresh Circuit Aged Circuit (10 years) Variation Percentage

0.6191 0.6351 2.59%

Table 6 – Delay of 32bit Carry Chain Adder, aged in the worst case condition and considering an HL Cin transition

The different situations showed in the Table 5 and in the Table 6 indicate that the delay depends on the stage where the pull-up is activated. In particular, by analyzing the circuit which composes a single carry chain adder in Figure 33, it’s possible to see that the first stage is more complex than the second one, along the critical path; in fact, the first stage delay is about 50% more than the second stage delay, meaning also that the first stage delay is about 60% of the total delay, while the second one is about 40%. So, a change in the first delay has more impact on the total delay respect to the same change in the second delay. In addition, this concept has to be replicated 32 times, since here the 32bit Carry Chain Adder is simulated, plus the fact that the delay of a stage is a function of the input transition, which comes from the previous stage, so an additional component has to be considered as well.

ADVANCING NANO-CMOS CIRCUITS SIMULATION 99

Figure 33 – Carry Chain Adder @ 1bit

Power (µW) @ VDD=0.7V – 32bit Carry Chain Adder @ 16nm A=0x00000000 and B=0xFFFFFFFF – C – LH transition in

Fresh Circuit Aged Circuit (10 Variation Mean Delta Vth years) Percentage Variation %

2.3154 2.2208 -4.09% 3.11%

Table 7 – Leakage Power of 32bit Carry Chain Adder, aged in the worst case condition and considering an LH Cin transition

ADVANCING NANO-CMOS CIRCUITS SIMULATION 100

Power (µW) @ VDD=0.7V – 32bit Carry Chain Adder @ 16nm A=0x00000000 and B=0xFFFFFFFF – C – HL transition in

Fresh Circuit Aged Circuit (10 Variation Mean Delta Vth years) Percentage Variation %

2.8863 2.8061 -2.78% 3.06%

Table 8 – Leakage Power of 32bit Carry Chain Adder, aged in the worst case condition and considering an HL Cin transition

In the Table 7 and in the Table 8, it’s described the leakage power consumption of the 32bit

Carry Chain Adder. It’s possible to see that, in this case, the LH transition of Cin let it consume less leakage power respect to the LH transition, which is apparently the opposite as the delay situation. However, the illustrated result is correct, because, by analyzing the circuit in Figure 33, it’s possible to see that Mp12 is under recovery and Mp9 and Mp13 are under stress during the LH transition, while it’s the opposite during the HL transition (actually Mp12 is always OFF in the latter case); the rest can be assumed constant. So, in the first case, the total aging effect is greater than the one in the second case. A good metric to see this effect can be the absolute Mean Delta Vth Variation, because it expresses a sort of mean aging effect on the whole circuit. The last but not the least, in the first case, 290 PMOS transistors are affected by aging, while they are 322 in the second case, but still the overall Mean Delta Vth Variation is higher in the former case.

ADVANCING NANO-CMOS CIRCUITS SIMULATION 101

Chapter 7 - Verilog-A Models Compiler

7.1 Compact Models Compact models for circuit simulations are abstractions of the real transistor behavior and of the physical laws which govern its functionality. TCAD tools are used to develop a model which respects physical laws, including Poisson, Potential, Schroedinger laws and so on. These laws are really complex and most of the time don’t even admit a closed form solution, so they have to be solved iteratively and sometimes using some tricks and assumptions. It’s obvious, then, that these laws can be used only to understand, predict and simulate the behavior of only one transistor, but are not suitable at all for a complete circuit, even a very small one. For sure this limitation comes from the complexity of these laws, but also from the fact that a circuit simulator needs models with C2 functions (which means continue and with continue derivatives until the second order) and for some analyses also C3 functions. So, Compact Models are more mathematical laws than physical and most of the times they are just fitting curves upon the physical models developed on TCAD, whose parameters are tweaked using a special phase of the flow called Parameter Extraction, which is usually performed by directly measuring the silicon. Said that, developing a Compact Model is not easy at all and nowadays it’s even more difficult because the model has to take into account a lot of secondary order effects, which are really relevant or even dominant in the novel deep sub-micron devices. So, writing the C code of these models manually is something unbelievable. In fact, the latest key model directly written in C is the BSIM4; all the other ones are now written only in Verilog-A.

7.2 The Verilog-A Standard Verilog-A is an extension for Analog of the well-known Verilog language, used for Digital hardware description. The advantage of using Verilog-A in the context of Compact Modeling stays principally in these aspects: • It’s closer to a mathematical formulation rather than a code • The model developer doesn’t need to write the partial derivatives by hand

ADVANCING NANO-CMOS CIRCUITS SIMULATION 102

It’s true that modern programming languages offer a lot of mathematical constructs and functions, but some advanced concepts are not available directly and the developer has to build himself these routines or take advantage of existing libraries, which unfortunately are not standard. In this context, Verilog-A offers a standardized way to approach these concepts and the developer uses the language constructs, without knowing the real implementation, but assuming that the implementation has to respect the standard, so that every Verilog-A Compiler behaves in the same way. In addition to the usage for Compact Modeling described above, Verilog-A, and more extensively Verilog-AMS, can be used to represent an analog component and connect it to the digital world, using a ‘Connect Module’.

7.2.1 A simple example Let’s see now a simple example of a Compact Model: a resistor. Its Verilog-A code implementation can be as in Code 49: module res (p, n) ; inout p, n ; electrical p, n ; parameter real r=0 from [0:inf) ; analog V (p, n) <+ r * I (p, n) ; endmodule Code 49 – Verilog-A Code of a simple resistor

This example comprehends a module, called ‘res’, which has two ports: ‘p’ and ‘n’. Then, the direction is specified: ‘inout’ for both. Then the discipline, which is ‘electrical’ for both. At this point, a parameter of type ‘real’, named ‘r’, with an initial value of ‘0’ is declared; its range goes from ‘0’, included, to ‘+inf’. The last statement is ‘analog’, which introduces the body of the model, which is a single declaration, in this case: the voltage applied between the ports ‘p’ and ‘n’ is equal to the parameter ‘r’ times the current between the same ports ‘p’ and ‘n’. Obviously the parameter ‘r’ can be changed externally, which means from the netlist or from the simulator. Generally, a Compact Model implementation is flat, like in this case, but it could also be hierarchical.

ADVANCING NANO-CMOS CIRCUITS SIMULATION 103

7.3 What is a compiler Prior to talk about the Verilog-A Models Compiler, let’s see what is a compiler in general. A compiler is a piece of software which reads one or more input text files, written in some formal language, and converts it or them in a code which is readable by the CPU, generating the executable program. For example, GCC is a compiler for several languages, including C, and convert the source code written in this language into an executable for the computer.

7.4 Why a compiler is needed First of all, let’s say that more than a compiler a translator is needed for this purpose, because a compiler translates a high level and human readable language into a machine language, while the needed tool translates a high level language into another high level language. However, in this context, the needed tool is normally referred as compiler or synthesizer, because it doesn’t perform a simple 1:1 translation, but it has to interpret some concepts and constructs which are not present in the target language, which is generally C/C++, but it could be whatever other programming language. Another key feature it has to support is the target simulator’s device models’ interface, in order to write the proper data structure as its elaboration output. Said that, a compiler is needed nowadays, because Compact Models are so complex that writing the C code directly is practically impossible, long and error prone, as mentioned before. Moreover, the model developer has to know the simulator models’ interface and the resulting Compact Model will be suitable only for the circuit simulators compatible with the chosen models’ interface. So, the tendency nowadays is letting the model developer write the Compact Model code in Verilog-A and then the end-users use an automatic tool to convert it. So, in this context, Verilog-A represents a simple, standardized and high level way to write the Compact Model. There is also another reason for preferring a compiler over hand coding: the Compact Model can be general and suitable for all the circuit simulators available on the market, because there is no data structure coding in the Verilog-A implementation and the model developer has not to learn a particular data structure to be followed. So, at the end, using a compiler (or synthesizer) is the right way to go.

ADVANCING NANO-CMOS CIRCUITS SIMULATION 104

7.5 Existing Verilog-A Compilers All the commercial software houses active in circuit simulators business have their own compiler, but not everyone let the user leverage on it for converting and using a, let’s say, custom Compact Model. One of the companies which let do that is MunEDA. However, commercial tools serve here just as a global perspective, meaning that industry is also offering and moving towards this direction, but it’s obviously meaningless for my research, since there is no code or specifications available to the public domain for those compilers. The only available compiler in the public domain is ADMS (Automatic Device Model Synthesizer) by Laurent Lemaitre. It has been developed long time ago when Laurent was in Motorola/Freescale and it’s divided in two parts: • Parsing of the Verilog-A model and generation of the tree representation in XML format • Reading back the XML format tree by other XLST scripts, used by a custom XLST processor, to generate the final C code for the circuit simulator ADMS supports several simulators: SPICE3 (so NGSPICE), Spectre and HSPICE.

7.6 Why a new Verilog-A Compiler is needed As I said, ADMS is the only Verilog-A Compiler for circuit simulator available in the public domain, so a deep analysis of its features and implementation is needed to understand why we need a new Verilog-A Compiler. The first issue comes from the fact that ADMS cannot compile correctly new Compact Models, including BSIM6, BSIMCMG, BSIMIMG and maybe other ones, because it lacks in supporting some Verilog-A constructs needed by those models and in general it doesn’t support all the Verilog-AMS language set, but only the Verilog-A subset. The second issue comes from the implementation chosen by Laurent. Looking at the code, it seems like his work has been made available to the public, although it seems it was initially a proprietary implementation. In fact, the XML representation and the XLST scripts are not understandable at all and not comments are present. Moreover, there is a custom XLST processor, without documentation of which fields and attributes are available and their meaning.

ADVANCING NANO-CMOS CIRCUITS SIMULATION 105

The last but not the least, ADMS isn’t supported anymore officially by Laurent, although there are some mirrors, conducted by people which are trying to carry on the Laurent’s work. In particular, there are three main programmers: • Ryan Fox, which mirrored ADMS and started his work for , a start-up company, but his work was mainly proprietary and only the revised Verilog-A parser part has been published on a GitHub repository • Guillerme Brondani Torri, which mirrored ADMS to include it inside QUCS, another circuit simulator, very complete and capable to perform more analyses than NGSPICE, but not based on SPICE3 core; in this case, the entire work is released on a GitHub mirror • Luther The Cat, which mirrored ADMS to change the data structure from XML to JSON and the XLST scripts and processor to Python; the first part of his work is already completed and the second part is going to be developed during his free time I got in contact with all the three people and I have to say that Luther is doing a very good job and I like his work and ideas, which are similar to mine and my work.

7.7 The aim of my Verilog-A Models Compiler (VAMC) Standing this situation, I decided to start a new Verilog-A Models Compiler from scratch to fill this gap existing in the flow. I could continue the development of ADMS, like Ryan or Guillerme, or starting from it to conceive a new compiler, like Luther, but this is not a real research, in my opinion, so I decided to start over completely. My compiler wants to be a complete platform to first of all analyze a Compact Model, prior to compile it, using also a nice GUI developed in Qt, supporting all the Verilog-AMS language and not only the Verilog-A subset. The internal database can be navigated through a TCL interface, which let creating custom commands easily. These custom commands are the ones created by me to serve several purpose, including the database queries and the final compilation, but they are also the ones the end-user can generate by himself as an add-on to the compiler, since TCL is a very powerful extensible language.

ADVANCING NANO-CMOS CIRCUITS SIMULATION 106

7.7.1 The Front-End Implementation In order to code the Front-End implementation, I decided to start reading the Verilog-AMS Reference Language Manual, freely available in the Accellera Website, where there is a very good and clear description of all the constructs, along with the Backus-Naur Form of the grammar, which is an excellent starting point for writing the parser. The Backus-Naur Form or Backus-Normal Form is one of the two main notation techniques for context-free grammars, often used to describe computing languages, such as programming languages, document formats, instruction sets and communication protocols. The parser has been created by using Flex and Bison utilities, which help the developer in generating various kind of parsers; more precisely, Flex generates the lexer or scanner, while Bison generates the real grammar parser. They are the new versions of the old Lex and Yacc, initially developed in the 1970s. The decision of using these tools has been made because they are the historical and most known utilities for creating parsers and because they generate automatically a C implementation, which can be cabled inside C/C++ applications. Since Verilog-AMS supports preprocessor directives, it’s necessary to separate the entire parser in two parts: • The preprocessor parser, which generates a single output file, scrambled of all the preprocessor directives, after executing them properly, and including all the files specified using the ‘include’ directive • The main parser, which reads back the single file generated by the preprocessor parser and builds the data structure, which is the actual database where the TCL queries act I have implemented the preprocessor parser as a very complex lexer, without a real parser, because it’s difficult to find a proper grammar and regular expressions for all the preprocessor directives, or at least easy ones, so I have just coded the regular expressions for the first part of all the directives and then I have written a sort of decision tree, based upon the letters or symbols which come after those regular expressions. In order to parse comments, I have used a particular concept of Flex, which is called Start Condition, and consists in changing the state of the lexer, letting possible use particular rules valid only in the current state, so that some rules can be written only for the comment case and other rules are valid elsewhere in other cases.

ADVANCING NANO-CMOS CIRCUITS SIMULATION 107

The main parser is, instead, composed by a normal lexer and a parser. The grammar is described in the parser, which is composed by symbols: non-terminal and terminal ones. The non- terminal symbols are placeholders to redirect rules to other rules inside the same parser, or they could be used to divide a complex rule in a simpler one. In the latter case, the grammar is longer to let it be comprehensible by humans, but it could be reduced to the minimal form. On the other hand, the terminal symbols are not used inside the parser, but they refer to the lexer and identify a minimum and indivisible term, called token, which is coded to be a particular expression or a regular expression inside the lexer. So, the parser calls the lexer, which returns the symbol data structure to the caller parser. At this point, if there are no errors, the input Compact Model is parsed, because it respects the grammar from the syntactic point of view, but nothing can be said about the semantic point of view, which is the meaning of each constructs inside the model. In order to control the semantics, Bison has the concept of action, which is basically a block of C/C++ code associated to each rule. Since it’s a block of code, it’s possible to write whatever instruction in there, even without specifying any semantic check, in fact this concept is generally used to create the data structure, which will be then used by the main application which called the parser. There is no particular data structure to be created in this case (in fact it has to be designed by the programmer), but, since the parser produces an Abstract Syntax Tree, the canonical data structure is a tree or, at the most, a graph. My data structure is coded in C++, so that I can leverage on the OOP paradigm, which is very helpful in this context, where there are several complex kind of data to manage, including lists, associative arrays, and so on. Constructors are also really helpful in this context, because all the class initializations can be done in there, without having external routines go all around the code. Since Bison can generate various kind of parser, I decided to code mine to be left recursive, but it could also be right recursive; the key thing is not having dual recursion (both left and right), because it leads to ambiguity, which has to be then resolved, because otherwise the parser doesn’t know if it has to reduce or shift, which basically means selecting and following the current non-terminal or moving to the next one. Sometimes ambiguity is mandatory, in order to get resolved explicitly, like in the operators’ rules, where generally multiplication and division have higher priority over addition and subtraction.

ADVANCING NANO-CMOS CIRCUITS SIMULATION 108

My parser’s grammar can be created quite easily with some modifications to the existing Backus-Naur Form of the grammar, which is present on the Verilog-AMS Reference Language Manual, as mentioned before; the difficult parts are the ones where a disambiguation is needed and the Backus-Naur Form doesn’t provide them, so I had to code those separately. Regarding the lexer, the Backus-Naur Form only provides the token keywords, but it doesn’t give hints on the implementation of those, which had to be coded by me, too. The same thing is valid for the preprocessor’s directives parser.

7.7.2 The Back-End Implementation With Back-End I mean here everything is part of the post-processing of the Compact Model, just after the parsing; therefore, everything is created leveraging on the data structure which comes out from the parser. In this part, I have chosen the TCL language, following the model of the Digital, but also Analog, commercial tools. This means that the user has a TCL interface, where he can type commands for querying the database, like ‘get_functions’ to retrieve all the functions defined in the Compact Model, or ‘get_ports’ to retrieve all the ports, and so on. Moreover, all the ‘get_*’ commands retrieve a list of objects and all these objects have attributes which can be queried, so for example the user can query the name of all the ports by simply typing ‘get_attribute [get_ports] name’. This notation is one of the powerful things of TCL: commands concatenation, which means that all the commands can be called in cascade, where the inner one passes its calculation result to the outer one. There are also some reports to be called; right now only the list of attributes for each class exist as report, but more are on-going. In the context of querying the database, my idea of creating a complete platform for developing Compact Models takes place, because database query is the basic operation, but it isn’t user-friendly, especially for new users, especially nowadays that GUIs are pervasive. In fact, a GUI is the right thing to be created to enhance this part and create the truly platform. The idea is using Qt, since it’s written in C++, it has a vast community of users and a lot of tutorials are available over the Web to start practicing. There is just a little issue with compilation chain to be solved in principle to adopt this method, because I’m using Autotools Toolchain for the text-based version, while Qt wants his framework, called Qt Creator; this means that I should transfer

ADVANCING NANO-CMOS CIRCUITS SIMULATION 109

everything to Qt Creator and maintain two compilations methodology. Fortunately, I have found over Internet a possible solution, which I have tested and a little bit modified, in order to work perfectly in my environment, which solved finally the problem at its root. Another important part of the VAMC Back-End side is the Compact Model conversion to NGSPICE and, in general, to circuit simulators. It comprehends basically two parts: • Model’s components calculation • Model’s output formatting The first one consists in looping over the data structure, retrieving all the needed information to calculate something which doesn’t exist in the Verilog-A version of the Compact Model, but that is needed by the circuit simulator, including Partial Derivatives, Integration Method calls, model and instance parameters settings and queries. The second one consists, instead, in formatting that information properly, so that the circuit simulator finds what it’s looking for, in the correct way and in the correct files. For example, NGSPICE expects to have several routines for each Compact Model, including DEVload, DEVsetup, and so on, commonly divided in several files, one for each routine.

ADVANCING NANO-CMOS CIRCUITS SIMULATION 110

7.7.3 Results Since only some parts of the entire Verilog-A Models Compiler are ready, in the Figure 34 and the Figure 35 are reported some operations which is possible to perform at this stage of the development. In particular, the preprocessor parser is completed and the main parser can read completely a lot of Compact Models, including BSIM6, BSIM-CMG, BSIM-IMG and others. Some TCL commands are queries are also available.

Figure 34 – Retrieve all the ports name of all the modules

Figure 35 – Example of objects and attributes querying

ADVANCING NANO-CMOS CIRCUITS SIMULATION 111

Chapter 8 - Conclusion

Detailed results of each topic have been already presented in their own chapter, so, in this final chapter, I’m only going to recap the final conclusion achieved so far. NGSPICE has been evolved a lot during my work, both with Master’s and PhD Theses: several new algorithms and performances gain have been reached. First of all, the KLU implementation refinement of my previous work done during the Master’s Thesis has let NGSPICE reach a new speedup of 13X over SPARSE 1.3, enhancing the previous 11X. Then, the conjugate work with Stefano Perticaroli has let NGSPICE gain a fully working Periodic Steady State Analysis, without stopping on a local minimum, through the usage of a DFT-based algorithm to refine the found period. After completing this preliminary work to have a stable NGSPICE framework, the real PhD phase could have been started. The nano-CMOS circuit simulation has been really advanced at this point, by implementing a novel convergence algorithm based upon the KCL verification and subsequently by inventing a novel homotopy method both to enhance the accuracy and the convergence and to speed up the entire simulation. This work has let the entire ISCAS85 Benchmarking Suite converge with really high precision (GMIN = 1e-18), speeding up the simulation by about 1.31X, too. It has also won the Silver Leaf Award prize at the PRIME2014 conference in Grenoble. The second phase of my PhD has targeted the acceleration of the simulation on NVIDIA GPUs, by using the CUDA language. Since a SPICE Transient Analysis is basically a loop to solve a non- linear system approximated as a sequence or linear ones, the main actors are the Model Evaluation and Linear System Solution phases. I have concentrated my work on the first one because it takes from 40% up to 70% of the entire simulation time, depending on the size and complexity of the circuit. The acceleration of this part has focused on main device models (Resistor, Capacitor, Inductor, Voltage and Current Sources and BSIM4 model for MOSFET), by changing the data structure from Arrays of Structures to Structures of Arrays, which can be better used by the GPU, reaching a speedup of about 6.67X on a Fermi generation card. A further improvement has concerned the Circuit Matrix and RHS update inside the Model Evaluation of each device model. This update has been totally reviewed from scratch, by using the

ADVANCING NANO-CMOS CIRCUITS SIMULATION 112

Topology Matrix Method, which generates the Circuit Matrix and RHS each time by doing a matrix- vector multiplication between a Topology Matrix and a so called Values Vector, which holds all the values coming from the Model Evaluation. So, both the Circuit Matrix and RHS are generated directly on the GPU and transferred back to the CPU, only to perform the Linear System Solution. This further improvement has let NGSPICE reach about 8.89X speedup in the pure Transient Analysis. The third part of my PhD has seen the development of a Reliability Analysis framework, based upon state of the art aging models. The framework computes the Short Term behavior by extracting the aging profile, using a modified Transient Analysis, and leveraging on the chosen aging model for the total DeltaVth calculation. Then, a long term prediction is calculated by assuming that the short term simulation has captured the typical behavior of the circuit, so that its inputs continuously repeat indefinitely. In order to extract the prediction, a proper fitting is necessary; two kind of fittings are considered: 1) Model based – when only a single continuous stress is performed 2) Fourier basis fitting – a fitting matrix is created by writing the Fourier approximation of � � , considering a proper number of modes for each row; all the rows are equal regarding the modes, but obviously they have their own � value. The system is solved by using the QR decomposition and the Least Square Approximation The extrapolated value is then injected in the chosen device model to be aged (I have used the BSIM4 model) and a new simulation is performed on the aged circuit. The aged simulation can be whatever. For my trials I have chosen to launch another Transient Analysis. The results are in line to what the literature provides in term of delay and leakage power: - The delay of the aged circuit gets higher, leading to setup issues or hold issues, depending if the data path or clock path is considered and also on the margins the designer has used in the project - The leakage power of the aged circuit gets lower The amount of variation depends on the particular circuit and its inputs, because it depends on how the transistors inside the circuit get stressed. Finally, the need of supporting new Compact Models, written in the Verilog-A language, has required starting a novel Verilog-A Models Compiler. Unfortunately, the work has not yet finished, but a lot of components are already fully implemented and working. In fact, it’s already possible to

ADVANCING NANO-CMOS CIRCUITS SIMULATION 113

parse several device models available over Internet, including BSIM6, BSIM-CMG, BSIM-IMG, which are the most relevant.

8.1 Future works

The dream of having a fully working Verilog-A Models Compiler has not become true during my PhD, but a great effort has been spent in order to start a novel compiler from scratch, based upon well-recognized tools and methodologies, including Flex, Bison and TCL. The Preprocessor Parser is completed and the Main Parser is still on going, but good bases have been founded to write a good tool: the DFS to traverse the nodes is working, the data structure is well defined and everything has been coded in C/C++ for an easier understanding and future modification. So, the main future work will be for sure finishing this big task, which is really among two worlds, more than any other aspect of the EDA: the electronics and the informatics.

References

[ F. Lannutti, F. Menichelli, P. Nenzi and M. Olivieri, "A new algorithm for convergence 1] verification in circuit level simulations," PRIME2014 Conference, 2014.

[ F. Lannutti, P. Nenzi and M. Olivieri, "KLU sparse direct linear solver implementation into 2] NGSPICE," in Mixed Design of Integrated Circuits and Systems (MIXDES) - 2012 - Proceedings of the 19th International Conference, May 2012.

[ M. Naumov, "Parallel Solution of Sparse Triangular Linear Systems in the Preconditioned 3] Iterative Methods on the GPU," [Online] At https://research.nvidia.com/sites/default/files/publications/nvr-2011-001.pdf, 2011.

[ T. Davis and E. P. Natarajan, "Algorithm 907: KLU, a direct sparse solver for circuit 4] simulation problems," Transactions on Mathematical Software, vol. 37, no. 3, pp. 36:2 - 36:17, 2010.

[ K. Kundert and I. Clifford, "Achieving Accurate Results With a Circuit Simulator," Cadence

ADVANCING NANO-CMOS CIRCUITS SIMULATION 114

5] Design Systems (Analog Division), San Jose, CA, USA.

[ J. Xu, "Perform the SPICE Simulation of ISCAS85 Benchmark Circuits for Research," 6] [Online]. Available: http://www.ece.uic.edu/masud/iscas2spice.htm.

[ T. T.-H. Kim and Z. H. Kong, "Impact Analysis of NBTI/PBTI on SRAM VMIN and Design 7] Techniques for Improved SRAM VMIN," JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, vol. 13, no. 2, April 2013.

[ G. Liatis, "Modeling Analitico delle Instabilità della tensione di soglia in nodi tecnologici 8] CMOS attuali e simulazioni SPICE dell'impatto a livello di circuito," Rome.

[ Z. Abbas and M. Olivieri, "Impact of technology scaling on leakage power in nano-scale 9] bulk CMOS digital standard cells," Microelectronics Journal, vol. 45, no. 2, pp. 179-195, February 2014.

[ Z. Abbas, A. Mastrandrea and M. Olivieri, "A Voltage-Based Leakage Current Calculation 10] Scheme and its Application to Nanoscale MOSFET and FinFET Standard-Cell Designs," IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2014.

[ L. W. Nagel and D. O. Pederson, "SPICE (Simulation Program with Integrated Circuit 11] Emphasis)," Berkeley, 1973.

[ A. Mastrandrea, M. Olivieri and F. Menichelli, "A delay model allowing nano-CMOS 12] standard cells statistical simulation at the logic level," 7th Conference on Ph.D. Research in Microelectronics and Electronics (PRIME), pp. 217-220, 3-7 July 2011.

[ Z. Abbas, V. Genua and M. Olivieri, "A novel logic level calculation model for leakage 13] currents in digital nano-CMOS circuits," 7th Conference on Ph.D. Research in Microelectronics and Electronics (PRIME), pp. 221-224, 3-7 July 2011.

[ M. Olivieri and A. Mastrandrea, "Logic Drivers: A Propagation Delay Modeling Paradigm for 14] Statistical Simulation of Standard Cell Designs," IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 22, no. 6, pp. 1429-1440, June 2014.

[ D. Pescovitz, "1972: The release of SPICE, still the industry standard tool for integrated 15] circuit design," Lab Notes: Research from the Berkeley College of Engineering, 2002.

[ F. Ramundo, P. Nenzi and M. Olivieri, "First integration of MOSFET Band-To-Band 16] Tunneling Current in BSIM4," Microelectronics Journal, vol. 44, no. 1, pp. 26-32, January 2913.

ADVANCING NANO-CMOS CIRCUITS SIMULATION 115

[ A. Sanyal, A. Rastogi, W. Chen, K. Roy and S. Kundu, "An Efficient Technique for Leakage 17] Current Estimation in Nanoscaled CMOS Circuits Incorporating Self-Loading Effects," IEEE Transaction on Computers, vol. 59, no. 7, pp. 922-932, July 2010.

[ N. Kapre, "SPICE2 – A Spatial Parallel Architecture for Accelerating the SPICE Circuit 18] Simulator," 2010.

[ A. Vladimirescu, "LSI Circuit Simulation on Vector Computers," 1982. 19]

[ R. Poore, "GPU-accelerated time-domain circuit simulation," Custom Integrated Circuits 20] Conference, 2009, pp. 629-632, 2009.

[ M. C. Hansen, H. Yalcin and J. P. Hayes, "Unveiling the ISCAS-85 Benchmarks: A Case Study 21] in Reverse Engineering," IEEE Design & Test, 16(3), 1999.

[ T. H. Weng, R. K. Perng and B. Chapman, "OpenMP Implementation of SPICE3 Circuit 22] Simulator," International Journal of Parallel Programming, vol. 35, no. 5, pp. 493-505, July 2007.

[ S. Hutchinson, "The Xyce Parallel Electronic Simulator – An Overview," Proceedings of the 23] International Conference ParCo2001, September 2001.

[ K. S. Kundert, "Sparse matrix techniques," Circuit Analysis, Simulation and Design, part 1, 24] 1986.

[ V. B. Kleeberger, M. Barke, C. Werner, D. Schmitt-Landsiedel and U. Schlichtmann, "A 25] compact model for NBTI degradation and recovery under use-profile variations and its application to aging analysis of digital integrated circuits," Microelectronics Reliability, no. 54, pp. 1083-1089, 3 January 2014.

[ J. B. Velamala, K. B. Sutaria, H. Shimizu, H. Awano, T. Sato, G. Wirth and Y. Cao, "Compact 26] Modeling of Statistical BTI under Trapping/Detrapping," IEEE Transactions On Electron Devices, vol. 60, no. 11, pp. 3645-3654, November 2013.

[ W. Grabinski, M. Brinson, P. Nenzi, F. Lannutti, N. Makris, A. Antonopoulos and M. Bucher, 27] "Open source circuit simulation tools for RF compact semiconductor device modeling," International Journal of Numerical Modeling, October 2013.

[ D. Warning, H. Vogt, R. Larice, P. Nenzi and F. Lannutti, "NGSPICE: an Open Platform for

ADVANCING NANO-CMOS CIRCUITS SIMULATION 116

28] Modeling and Simulation," SISPAD2014, September 2014.

ADVANCING NANO-CMOS CIRCUITS SIMULATION 117