Advancing nano-CMOS circuits simulation: convergence optimization, GPU parallelization and reliability analysis
By Francesco Lannutti
A Thesis submitted to the Department of Information Electronic and Telecommunication Engineering (DIET) University of Rome “La Sapienza” In partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY
Ing. Francesco Menichelli Prof. Alberto Sangiovanni-Vincentelli (Thesis Supervisor) (Second Thesis Supervisor) Abstract
Advancing nano-CMOS circuits simulation: convergence optimization, GPU parallelization and reliability analysis By Francesco Lannutti
Submitted to the Department of Information Electronic and Telecommunication Engineering (DIET) University of Rome “La Sapienza” In partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY
The idea of having an open-source circuit simulator originated in 60s at Berkeley from a team led by Prof. Ron Rohrer and Prof. Donald Pederson, who strongly believed that students could learn electronics in depth only through an instrument which let them participate actively in designing new circuits and analyzing existing ones (from Prof. Andrei Vladimirescu presentation at MOS-AK 2013). At the beginning, it was CANCER (Computer Analysis of Nonlinear Circuits, Excluding Radiation), then SPICE1 and 2 (Simulation Program with Integrated Circuit Emphasis), released multiple times during the subsequent years, both developed by Laurence Wolfgang (Larry) Nagel during his PhD, under the supervision of Prof. Donald Pederson. With the passing of time, SPICE had become so crucial that several semiconductor companies supported the program in house (e.g., Intel, TI, ST, and ATT) and added models for devices that were dependent on internally developed technology. Several small companies (e.g., Meta Software with HSPICE, OrCAD with PSPICE) modified SPICE slightly and sold their products in the market. Later on, Cadence Design Systems developed SPECTRE that was based on the research carried out by Kenneth Kundert and Jacob White at Berkeley which featured robust integration
ADVANCING NANO-CMOS CIRCUITS SIMULATION I
methods. SPICE and its derivatives are now the mostly used software in the world for accurate analog circuit simulation and standard cell characterization for digital flows. Fortunately, there are still open versions of SPICE that can be found on the web, among which the most known and used one is NGSPICE, an open-source version which integrates SPICE3, CIDER (which combines SPICE3 with an internal C-based device simulator, DSIM) and XSPICE developed at Georgia Tech. I have chosen NGSPICE for my research, because of the strong support by the community, which can provide invaluable user feedbacks during the research. Despite that, NGSPICE still suffers from coding and algorithmic point of view. While the first one comes from ancient inherits and treats about performances and memory occupancy, the second one is even worse, because it can generate quality of results issues, due to the usage of old algorithms in certain areas, and a usability problem, which makes NGSPICE difficult to be adopted in some research sectors, including new device Compact Models development in Verilog-A, without manually converting it in C code. My PhD research activity concerns the simulation of nano-scale CMOS integrated circuits and is focused on the following four topics: 1) Allowing the efficient simulation of very large circuits, including convergence techniques and the use of a GPU to speed up the analysis 2) Speeding up the simulation leveraging on parallelization algorithms 3) Implementing the Reliability Analysis to stress test the emergent technology nodes 4) Introducing algorithmic techniques to improve elaboration capacity and versatility and to allow the derivation of compact models that are needed to model new materials, new devices and new device effects During the first year of activity, I developed a new algorithm for convergence that attracted interest in the scientific community. It has been presented at MOS-AK 2013 in Bucharest in September 2013 and at PRIME2014 in Grenoble in July 2014 [1], where received the Silver Leaf Award by the conference committee. The results achieved during the second year sped up the simulation up to 3.74 times, by using the GPU acceleration and a novel approach to integrate the outputs of several independent models to only one Circuit Matrix and RHS. The obtained speedup implies decreasing the elapsed
ADVANCING NANO-CMOS CIRCUITS SIMULATION II
time, for example, from 1 hour, using the CPU version of NGSPICE, to 15 minutes, using GPU acceleration. During the third year, the Reliability Analysis has been successfully implemented in NGSPICE, generating a new framework that can leverage on state of the art aging models and can consider both the short term and long term behaviors. The fourth line of activity started during the end of the second year, continued all along the third year and unfortunately is not yet completed due to its complexity, yet it produced a Verilog- A Model Compiler with a complete chain of Lexer and Parser and a TCL user interface.
ADVANCING NANO-CMOS CIRCUITS SIMULATION III
Table of contents
CHAPTER 1 - INTRODUCTION ...... 1
1.1 THE NGSPICE PROJECT ...... 2
1.2 AUTOTOOLS TOOLCHAIN ...... 3
1.3 THE AIM OF MY PHD ...... 4
CHAPTER 2 - KLU – IMPROVEMENTS OVER MY MASTER’S THESIS RESULTS ...... 5
2.1 BINDING TABLE SORTING ...... 6
2.2 KLU SUPPORT FOR ALL THE NGSPICE DEVICE MODELS ...... 7
2.3 KLU ENHANCEMENT IN SYMBOLIC FACTORIZATION ...... 7
2.4 AN IDEA FOR THE PARALLELIZATION ...... 8
CHAPTER 3 - PERIODIC STEADY STATE ANALYSIS ...... 10
3.1 INTRODUCTION ...... 10
3.2 STABILIZATION ...... 13
3.3 SHOOTING ...... 14
3.4 PERIODIC STEADY STATE ...... 16
3.5 RESULTS ...... 17
CHAPTER 4 - A NOVEL CONVERGENCE ALGORITHM BASED UPON KCL VERIFICATION ...... 20
4.1 THE NEWTON-RAPHSON METHOD ...... 24
4.2 THE STATE OF THE ART OF CONVERGENCE ALGORITHMS FOR CIRCUIT SIMULATION ...... 25
4.3 THE FALSE CONVERGENCE PHENOMENON ...... 26
4.4 KCL VERIFICATION ...... 26
4.5 F(VK) IN SPICE ...... 27
4.6 NODES CLASSIFICATION ...... 28
4.7 CONTRIBUTIONS IN THE BSIM4 MODEL ...... 28
4.8 A NEW HOMOTOPY METHOD FOR FASTER CONVERGENCE ...... 33
4.9 IMPLEMENTATION RESULTS ...... 35
4.10 LINEAR AND NON-LINEAR SEPARATION ...... 36
CHAPTER 5 - CUSPICE – THE REVOLUTIONARY NGSPICE ON CUDA PLATFORM ...... 43
5.1 INTRODUCTION ...... 43
5.2 THE FERMI ARCHITECTURE ...... 44
5.3 THE FERMI ARCHITECTURE MEMORY ACCESS ...... 47
ADVANCING NANO-CMOS CIRCUITS SIMULATION IV
5.4 THE CUDA PLATFORM ...... 48 5.4.1 The basic CUDA application ...... 48 5.4.2 The kernel in detail ...... 49
5.5 THE NEW KEPLER ARCHITECTURE ...... 51 5.5.1 The new Streaming Multiprocessor and the memory ...... 51 5.5.2 The scheduling and the energy efficiency ...... 52
5.6 THE GPU PARALLELIZATION OF THE DEVICE MODEL EVALUATION ...... 52 5.6.1 From Linked Lists to Structures of Arrays ...... 53 5.6.2 The Topology Matrix Method ...... 60 5.6.3 Other code refinements to achieve awesome performances ...... 65
5.7 RESULTS ...... 66
CHAPTER 6 - RELIABILITY ANALYSIS ...... 72
6.1 THE AGING MODEL ...... 75
6.2 THE SHORT TERM BEHAVIOR ...... 76
6.3 THE LONG TERM BEHAVIOR ...... 77
6.4 THE IMPLEMENTATION IN NGSPICE ...... 78
6.5 RESULTS ...... 97
CHAPTER 7 - VERILOG-A MODELS COMPILER ...... 102
7.1 COMPACT MODELS ...... 102
7.2 THE VERILOG-A STANDARD ...... 102 7.2.1 A simple example ...... 103
7.3 WHAT IS A COMPILER ...... 104
7.4 WHY A COMPILER IS NEEDED ...... 104
7.5 EXISTING VERILOG-A COMPILERS ...... 105
7.6 WHY A NEW VERILOG-A COMPILER IS NEEDED ...... 105
7.7 THE AIM OF MY VERILOG-A MODELS COMPILER (VAMC) ...... 106 7.7.1 The Front-End Implementation ...... 107 7.7.2 The Back-End Implementation ...... 109 7.7.3 Results ...... 111
CHAPTER 8 - CONCLUSION ...... 112
8.1 FUTURE WORKS ...... 114
ADVANCING NANO-CMOS CIRCUITS SIMULATION V
Chapter 1 - Introduction
The idea of having an open-source circuit simulator originated in 60s at Berkeley from a team led by Prof. Donald Pederson, who strongly believed that students could learn electronics in depth only through an instrument which let them participate actively in designing new circuits and analyzing existing ones. At the beginning, it was CANCER (Computer Analysis of Nonlinear Circuits, Excluding Radiation), stating that Berkeley University was a symbol of liberalism in 1960s, when most of the circuit simulators were developed under the US Department of Defense. CANCER was developed by Laurence Wolfgang (Larry) Nagel and other students under the supervision of Prof. Ronald Rohrer. When Prof. Rohrer left Berkeley, the program supervisor became Prof. Donald Pederson, which wanted the circuit simulator be available in the public domain. So SPICE was going to born… The first public release of this new circuit simulator was SPICE1, which came out in 1973. It was written in FORTRAN and it supported only the Nodal Analysis, which doesn’t support inductors, voltage sources and some controlled sources. In addition to that, it had only few components in the library and the Transient Analysis was based upon fixed time step. For these reasons, the first popular release of SPICE was SPICE2, which came out in 1975 and it was coded in FORTRAN as well. Larry substituted the Nodal Analysis with the Modified Nodal Analysis to overcome its limitations and introduced the variable time step, based upon the Trapezoidal or Gear Methods for the numerical integration step. This SPICE version was maintained until 1983 to the 2G6 release, when another student, Thomas Quarles, took the place of Larry in developing the tool, under the supervision of Prof. Richard Newton, Prof. Donald Pederson and Prof. Alberto Sangiovanni-Vincentelli. This work produced a new mainstream code of SPICE, called SPICE3, which added the support of the X11 system interface, called NUTMEG, developed by another student, Wayne Christopher, and transformed the code language from FORTRAN to C. With this new mainstream code, the license system became BSD until the latest developed version at UC Berkeley, which was the 3F5, in 1993. In the meantime, Kenneth Kundert developed a very efficient and fast sparse matrix solver, called SPARSE, to replace the previous solver, called SMP, under the supervision of Prof. Alberto Sangiovanni-Vincentelli.
ADVANCING NANO-CMOS CIRCUITS SIMULATION 1
With the passing of time, SPICE had become so crucial that several semiconductor companies supported the program in house (e.g., Intel, TI, ST, and ATT) and added models for devices that were dependent on internally developed technology. Several small companies (e.g., Meta Software with HSPICE, OrCAD with PSPICE) modified SPICE slightly and sold their products in the market. Kenneth Kundert developed SPECTRE at the University of California at Berkeley, under the supervision of Prof. Alberto Sangiovanni-Vincentelli, and then joined Cadence Design Systems, where he developed the industrial version of SPECTRE with the support of Jacob White, who carried out his research at Berkeley in circuit simulation, under the supervision of Prof. Alberto Sangiovanni-Vincentelli, and then joined MIT. SPECTRE featured a different circuit formulation method and deployed novel robust integration methods. SPICE and its derivatives and SPECTRE are now the mostly used software in the world for accurate analog circuit simulation and standard cell characterization for digital flows.
1.1 The NGSPICE Project Fortunately, there are still open versions of SPICE that can be found on the web, among which the most known and used one is NGSPICE, an open-source version started by an idea of Paolo Nenzi in 1999, which integrates SPICE3, CIDER (which combines SPICE3 with an internal C-based device simulator, DSIM) and XSPICE developed at Georgia Tech. Its name means New Generation SPICE, but its core is still based upon SPICE3F5 (the latest version released by Berkeley). Right now, this part has changed so much that it’s almost impossible to recognize it; in particular, a lot of memory leaks have been fixed, especially in the frontend part, and several new models and input language commands have been added. In addition to that, some algorithms have been implemented to enhance the already existing NGSPICE functionalities or to add new ones. NGSPICE maintains the BSD license and it’s freely available as SourceForge GIT repository in various compiled versions for Windows, Mac and Linux, together with the source code. In the repository, there are a lot of experimental branches, which serve to code new ideas, without polluting the master branch, which hosts the code to be released. The release cycle is not fixed, but a new version gets approximately released every year.
ADVANCING NANO-CMOS CIRCUITS SIMULATION 2
NGSPICE is written in C, like SPICE3F5, and is built using Autotools Toolchain automation system, which is very powerful, especially for Open Source software, even though there are alternatives, like CMake. All the work I have made is integrated in NGSPICE, because it’s a complete open source platform, which respects the real old SPICE criteria, and also because Paolo Nenzi has been part of my university for long time; he helped me in starting the Master’s Thesis and understanding the basics of NGSPICE.
1.2 Autotools Toolchain The Autotools Toolchain automation system is composed by several programs, which act together to simplify the compilation process for the developer and produces easily the final binaries (in general more than one). They are: • Autoconf • Automake • Libtool The first one converts a description, which is inside ‘configure.ac’, written in a particular language composed of macros and called M4, in the ‘configure’ executable. The second one converts a description of the Makefile, written in another particular language, to the real Makefile. It’s capable to write very complex Makefiles starting from quite simple descriptions, without even knowing the Makefile language. By the way, the distinction between their jobs is not easy, because they are parts of the same global goal. The third one is optional in the toolchain, but it’s needed if the Makefiles are composed by libraries, which could be installable or not installable (placeholder). In NGSPICE, the ‘configure.ac’ file is quite long and complex, because it has to configure the circuit simulator for each supported platform (Windows, Mac and Linux) and configure also the various components the user wants to have in the final build, like XSPICE, CIDER, the PSS analysis, KLU, and so on. The Makefile structure is quite simple globally: there is one Makefile for each folder/subfolder and it builds all the files in that folder to a library and/or it acts recursively using the ‘SUBDIRS’ construct. Only the main Makefile, which is in the root folder, and the one under the ‘src’ folder are different. The first one is richer under the extra files which have to be deleted, naming them
ADVANCING NANO-CMOS CIRCUITS SIMULATION 3
one by one, when the developer wants to delete all the residual of the compilation process, using the ‘make maintainer-clean’ command, while the second one is the real main Makefile, because it assembles all the libraries back together in a particular order and in a particular library superset, prior to be integrated in the final NGSPICE executable. There are also at least two executables in there, because there is not only NGSPICE to be compiled, but also NUTMEG, as stand-alone program. Libraries in NGSPICE are just placeholder, because it has been decided long time ago to produce only static libraries and static linking. This means that all the ‘Makefile.am’ files for Automake and Libtool use the ‘noinst’ keyword prefix for all the libraries, so that they are produced only during the compilation process and not installed in the final folder when the ‘make install’ command is used. So, they are a placeholder in the sense that they act as a package for each folder/subfolder, but at the end they are linked all together in the final executable.
1.3 The aim of my PhD I started my PhD adventure, after that I got passionate to the circuit simulators’ world and in particular to NGSPICE. I decided, then, to pursue the PhD with the aim of enhancing this fantastic open source circuit simulator and let available to the public domain, in the open source form, a modern circuit simulator, capable of overcoming issues, including the false convergence phenomenon, speeding up the simulation time using the modern NVIDIA GPUs, supporting new Compact Models written in Verilog-A and more. It’s a strange research, respect to the ones we are used to have in the Electronic Engineering world, because it’s more related to Computer Science, Linear Algebra, Graph Theory and maybe more, but it has a very practical impact and usage in the designer community. The research path I have run during my PhD is not only theoretical, but there is a working code, committed in some branches of NGSPICE, showing very good results, which every designer can access to, try and further enhance. The only exception is the Verilog-A Models Compiler, which is unfortunately not completed yet, because it’s a huge project and requires a lot of time and extra knowledge, including what a parser is and how to build a proper one using valid instruments, like Bison and Flex. However, a lot of code is already done and working; in fact, the Front-End part is completed, just some modifications are needed in order to satisfy operators precedence, and part of the Back-End is completed too.
ADVANCING NANO-CMOS CIRCUITS SIMULATION 4
Chapter 2 - KLU – Improvements over my Master’s Thesis results
During my Master’s Thesis I have analyzed in detail and ported a new sparse matrix solver, called KLU, to NGSPICE, with the ultimate intent of speeding up the entire simulation. KLU has been invented by Ekanathan Natarajan Palamadai, under the supervision of Prof. Tim Davis, at the University of Florida as Master of Science in 2005, and it’s specifically designed for circuit simulation. It’s based upon a chain of several algorithms which: 1. Reorder the sparse matrix to avoid zeroes along the main diagonal (Maximum Transversal) 2. Partition the sparse matrix in sub-matrices (BTF – Block Triangular Form – optional) 3. Reorder the sub-matrices to reduce fill-ins insertion during the LU factorization (AMD or COLAMD) 4. Factorize the sub-matrices using the Gilbert-Peierls LU decomposition, with partial pivoting, and the Eisenstat-Liu symmetric pruning The first three steps lead to the symbolic factorization, which is then used to perform the numerical factorization, including the partial pivoting calculation. This flow is true for the first factorization or when the pivoting is not reusable, leading to an unstable sparse matrix, otherwise the same symbolic factorization and the same pivoting can be reused for multiple subsequent factorizations. In SPICE, this case is really frequent, so KLU makes also available another feature, which is called Re-Factorization. KLU includes also other features, which are needed in a circuit simulator: • Linear system solution, after the LU decomposition • Complex numbers support through the separate storage of real and imaginary parts, under the form of packets of two numbers each one During my PhD, I have enlarged the work I have done on this topic [2], by enhancing the data structure, bringing the speedup from 11X to 13X, and by including the support of all the NGSPICE device models.
ADVANCING NANO-CMOS CIRCUITS SIMULATION 5
2.1 Binding Table Sorting The first enhancement concerns the binding table I have used to mimic the associative array concept, not available in C, to store the pointers to SPARSE 1.3, KLU Real and KLU Complex representations of the circuit matrix. Each device model has to access the table to retrieve the correct pointer to a certain coordinate location (x, y), repeated for all the coordinates each device instance has to know. This operation is executed only once for a Transient and OP Analyses, because each model has to only change its pointers to the circuit matrix from the SPARSE 1.3 representation to the KLU Real representation, and several times for an AC Analysis, because some conversions from KLU Real to KLU Complex and vice versa are needed in addition to the previous one, which is still performed during the initial OP Analysis. So, it’s clear that optimizing this part is crucial to achieve a good overall speedup. Moreover, the PZ and Sensitivity Analyses change the sparse matrix on the fly or by using the devices’ setup routine, so a process faster than this conversion has to be also developed to further speed up these analyses. I’m evaluating the possibility of getting rid of SPARSE 1.3 when KLU is used, by using the COO format as starting point for having the CSC representation KLU uses, but this poses some challenges. Then, in order to enhance the speed, I have worked on the data structure, by introducing a sorting algorithm, in particular Quick Sort, because the reordered Binding Table can be exploited by a Binary Search Algorithm, which is really better and more efficient than a Linear Search Algorithm, since the computational cost decreases from � � to � log � . So, the Binding Table has been implemented as array of structure, so that each packet has three pointers (at the moment SPARSE 1.3, KLU Real and KLU Complex) and can be easily expanded to whatever number of pointers, if necessary. However, this doesn’t give any speedup per se. In order to benefit from this new data structure, I have added a new pointer inside each device model data structure to store the address of each needed packet inside the Binding Table, sacrificing some space for time, but achieving � 1 in all the subsequent pointers swapping, needed for example in the AC Analysis, as mentioned before, instead of repeating the Binary Search all the times. Exchanging space for time is a well-accepted paradigm nowadays, since space costs less than time, so the idea put in place here is generally valid.
ADVANCING NANO-CMOS CIRCUITS SIMULATION 6
2.2 KLU support for all the NGSPICE device models The second enhancement expands the KLU support to every NGSPICE device models. In fact, my first implementation of KLU, during my Master’s Thesis, supported only a reduced number of device models and in particular only the ones needed to execute the ISCAS85 Benchmarking Suite, meaning BSIM4, Capacitor, Resistor and Voltage Source. Now, with this enhancement, practically all the models are supported, with the exception of only URC, which is still unsupported, because its data structure is different from the one needed by KLU.
2.3 KLU enhancement in Symbolic Factorization There is a third enhancement, which is an idea come out recently in my mind, when I have recalled an analysis I made long time ago during my Master’s Thesis. The problem is that the Symbolic Factorization is divided in two subsequent steps, called in series, just one after the other: • lsolve_symbolic • construct_column The first routine extracts all the columns with a loop from the CSC representation, considering also the reordering happened along the way, and calls the DFS to extract the columns’ indices impacted by the current reordered one in the loop, but only if the current reordered column is inside the current block, otherwise it does nothing and simply continue the loop. This is a waste of computation, because the second routine scatters all the columns from the CSC representation to the dense representation (a vector), considering also here the reordering happened along the way, but taking into account all the columns in the loop, without skipping the ones outside the current block. After this analysis, it’s evident that the first loop is equal to the second one, with the only difference that the first one skips some columns, so it’s definitely possible to merge them all into only one loop, compressing the two routines in only one and assigning the work properly to each category of column. The speedup is not so evident, unfortunately, and it’s just about 1%, but the modification is so simple that is worth doing it. Anyway, the modification acts only in the off-block columns, so it depends on the Block Triangular Form, which depends on the circuit under analysis. The speedup should be more evident if the number of off-block columns is high, which is generally not the case,
ADVANCING NANO-CMOS CIRCUITS SIMULATION 7
because circuits have quite always a big block and some small ones, coming from the strongly connected components algorithm.
2.4 An idea for the parallelization
Finally, an idea which came out in my mind during my Master’s Thesis, but realized only in part, exploits an interesting article of Dr. Maxim Naumov from NVIDIA [3]. Basically, the Breadth First Search (BFS) algorithm can be used in place of DFS used by KLU to find the elimination tree, which is the sequence of independent operations that can be exploited in the sparse matrix. However, just by substituting the DFS with the BFS, the Gilbert-Peierls LU factorization algorithm [4] is changed in its heart, modifying the principle that the LU factorization happens in the number of multiplication between L and U, which is the theoretical minimum number of operations; moreover, the Eisenstat-Liu pruning, which cuts the computation time of the DFS, is not considered anymore. So, it’s true that BFS exposes the parallelism, but it’s also true that the amount of needed calculations is not the minimum anymore, so the available parallelism cannot compete in general with the amount of computation needed by the serial version of the Gilbert- Peierls algorithm. Moreover, if the GPU is considered as the hardware where the parallelism is exploited, there is also a frequency reduction to be taken into account. In fact, by implementing the modification on a GPU, there is only a slowdown, which could be quite big in certain cases, as showed in the Figure 1.
Figure 1 – Slowdown of the GPU of KLU respect to the serial one
ADVANCING NANO-CMOS CIRCUITS SIMULATION 8
So, an enhancement could be made to this idea: it’s possible to use the Gilbert-Peierls standard algorithm with Eisenstat-Liu pruning to extract in the symbolic factorization whose columns are needed during the next numerical factorization, but then use the BFS on top of that to extract the available parallelism. Since the available parallelism will be very few, this solution is not suitable for the GPU, but it could be worth trying it on a multi-core CPU, where the available cores are generally in a number of 4-8 nowadays. The available parallelism depends on the current column, because a DFS, and so a BFS, is launched for each column of L, considering also the fill-ins, so at the beginning the amount will be almost null, but it could increase along the way, as showed in the Figure 2. If there is no available parallelism in a certain column, then the execution will be serial, but, at least, the CPU frequency is the same, while the GPU runs generally at less than a quarter of the CPU frequency.
Number of parallel operations for each factorization on «c432»
200
180
160
140
120
100
Parallelism 80
60
40
20
0 944 1112 1313 1507 1643 1775 1856 1969 2073 2180 2279 2376 2481 2589 2699 2812 2921 3030 3116 3232 3374 3479 3564 3661 3773 3863 3944 4033 4117 4202 4284 4366 4447 4528 4609 4680 4755 4826 4897 Figure 2 – Number of Parallel Operations for each column of the ‘c432’ circuit of the ISCAS85
ADVANCING NANO-CMOS CIRCUITS SIMULATION 9
Chapter 3 - Periodic Steady State Analysis
3.1 Introduction RF analyses are one of the simulation classes generally available in a circuit simulator, but NGSPICE doesn’t have any of them; only an experimental Periodic Steady State Analysis was present at the time when I started looking into this topic. That experimental code has been written by a friend of mine, Stefano Perticaroli, based upon his PhD dissertation, but it was full of bugs and memory leaks. So, we decided to start over and he explained me the main algorithm he was intended to code, letting us start a new implementation from scratch, adding at the same time new ideas which came in my mind to reach the goal. The Periodic Steady State Analysis is the basic analysis for RF systems; it serves to investigate the large signal and/or the non-linear behavior of both autonomous and not-autonomous periodic systems. The Figure 3 illustrates the two categories: the first one comprehends systems without inputs, which have at least one output, like oscillators, while the second category regards all the systems in which an RF signal is applied to the input and the response can be acquired at the output, like an LNA or a mixer.
Figure 3 – Autonomous and non-autonomous periodic systems
ADVANCING NANO-CMOS CIRCUITS SIMULATION 10
Autonomous systems are different from the non-autonomous by the fact that only initial conditions are known for the former ones, while also the working frequency is known for the latter ones, which represents in turn one of the goal of the Periodic Steady State analysis for the first ones. In both cases, the voltage at each node and the current at each branch are to be retrieved from the analysis. There are several ways to reach the Periodic Steady State solution: - Transient Shooting - Harmonic Balance - Mixed Domain Techniques Since the first one is preferred in case of highly non-linear and/or large circuits, this is the way the Periodic Steady State Analysis has been implemented in NGSPICE. The Transient Shooting method tries to solve the following equation: � � � � , � = � � � + � � � + � � = 0 �� Equation 1 – Transient Shooting formulation
And the PSS state is reached when: � � + � − � � = 0 Equation 2 – Periodic Steady State condition
Since the period T is unknown for autonomous circuits, a closed form solution doesn’t exist in this case and a residual error must be accepted, leading to the formulation in Equation 3: � � + � − � � = �
Equation 3 – Periodic Steady State condition using quantized quantities and a residual error
ADVANCING NANO-CMOS CIRCUITS SIMULATION 11
In order to achieve the goal, the original Transient Analysis has been modified, by adding the Shooting Method, which is divided in three steps, as illustrated in Figure 4: 1. Stabilization 2. Shooting 3. Periodic Steady State and automatic frequency correction
Figure 4 – Block diagram of the PSS implementation in case of autonomous circuits
ADVANCING NANO-CMOS CIRCUITS SIMULATION 12
3.2 Stabilization The first part is basically the free evolution of the system, like a pure Transient Analysis, since some time is required to let the system go at speed; so the user has an analysis parameter to be tweaked, in order to give the system at least the minimum amount of time to reach its stability. This user parameter is mandatory, because this amount of time depends on the circuit. This concept is depicted in Figure 5:
Figure 5 – Stabilization of a Van Der Pol oscillator during a Periodic Steady State Analysis
ADVANCING NANO-CMOS CIRCUITS SIMULATION 13
3.3 Shooting After the stabilization, the shooting block tries to retrieve the period by taking a user specified guessed frequency, which depends on the circuit, and also by taking the RHS as Pseudo State Vector, because it contains the equivalent conductance of components with memory plus other things, including components without memory, independent sources and controlled sources. However, Shooting Method would require the state matrix, which expresses the charges and the fluxes inside the circuit and it’s then composed only by active linear components, like capacitors and inductors, but this matrix is not available in NGSPICE, since every device model adds his contribution to the circuit matrix and the state matrix is not extractable at the end, by simply post-processing the circuit matrix. So, in order to have that, all the device models have to be changed in order to write two matrices, the circuit matrix and the state matrix, but this implies another issue: the basic formulation is composed by simple circuits, which have a simple state matrix, but what is the state matrix in the general case, for example when there is a node in the circuit which has no capacitors or inductors? Should that node be empty or should it be collapsed? Instead of answering to this question, NGSPICE uses the circuit matrix as state matrix. During the natural evolution of the system in Transient Analysis, the mean quadratic error of the Pseudo State Vector and its derivative are calculated. Then, they are used, together with the Pseudo State Vector error, to update the period, which can be used as starting point of the next shooting cycle or can be the final one. This concept is very powerful, but it requires a fixed time step, which is unfortunately not available in NGSPICE, because the time step depends on the Local Truncation Error, which is calculated by each device model separately and is then taken as the minimum of each one.
Figure 6 – Time steps are not constant due to Local Truncation Error
ADVANCING NANO-CMOS CIRCUITS SIMULATION 14
However, this is in contrast to what the shooting cycle is trying to achieve, because it would try to update the period near the end of it. In order to do that, the implementation injects artificial breakpoints to make the next time step be at least close to the one calculated by the shooting cycle, as the Figure 6 tries to show.
The choice of � and � influences heavily the convergence accuracy and the algorithm speed. In order to guarantee convergence at each shooting cycle, the conditions in Equation 4 and Equation 5 are considered. They have to be valid for each node or branch to claim that convergence is reached.
�������_����� ≤ ��� � ∗ ��� + �_������ ∗ ��_��� ∗ ������_����� Equation 4 – Voltage error calculation for convergence purposes
�������_����� ≤ ��� � ∗ ��� + �_������ ∗ ��_��� ∗ ������_����� Equation 5 – Current error calculation for convergence purposes
The steady coefficient is a new parameter available only for Periodic Steady State Analysis, while the other parameters are already available in the Transient Analysis, even though their values can be different. Their default values are shown in Table 1: Default Values of Periodic Steady State Analysis convergence parameters rel_tol 0.001 v_abstol 1µV i_abstol 1nA tr_tol 7
Table 1 – Default Values of Periodic Steady State Analysis convergence parameters
ADVANCING NANO-CMOS CIRCUITS SIMULATION 15
3.4 Periodic Steady State The third part is the research of the Periodic Steady State, after that the shooting phase is completed. At this point, the period is known, so the system is let evolved for another period in order to collect proper information for plotting (discussed later on) in Time and Frequency Domains, through the usage of a DFT routine, which calculates the harmonics, whose number is specified as user parameter from the input netlist. The output of the DFT is also used for the automatic correction of the final frequency, because the shooting cycle can converge to a local minimum when a far guessed frequency has been provided as starting point, so a multiple of the real frequency can be proposed as result. In order to automatically correct this, a property is exploited: the harmonic with the highest magnitude is the first one, if the frequency is the correct one and not a multiple. So, by post processing the result of the DFT, this condition is checked and, if it isn’t verified, another Periodic Steady State Analysis is launched on top of the current one using the frequency of the first harmonic as new guessed one, otherwise the analysis terminates. All the plots in both Time and Frequency Domains are retained, even when the final frequency is not correct. This algorithm intent is shown in the Figure 7:
Figure 7 – The Periodic Steady State Analysis is trapped in a local minimum and the DFT is used to get out
ADVANCING NANO-CMOS CIRCUITS SIMULATION 16
3.5 Results Finally, some results of the implemented Periodic Steady State Analysis are shown in the figures below.
Figure 8 – Colpitt’s Oscillator
ADVANCING NANO-CMOS CIRCUITS SIMULATION 17
Figure 9 – Complementary CMOS Oscillator
Figure 10 – Hartley Oscillator
ADVANCING NANO-CMOS CIRCUITS SIMULATION 18
Figure 11 – Vackar Oscillator
Figure 12 – Van Der Pol Oscillator
ADVANCING NANO-CMOS CIRCUITS SIMULATION 19
Chapter 4 - A novel convergence algorithm based upon KCL verification
One of the issues of SPICE since day-zero is the so called False Convergence Phenomenon, which consists in believing that the Newton-Raphson cycle is completed, and so convergence is reached, even though this is not true, because KCL is not verified and satisfied. This behavior is due to the original algorithm put in place by Larry Nagel in SPICE, which just checks if a current has reached convergence by using a threshold. This concept is an arbitrary criterion and it’s suitable for direct unknowns, but it’s wrong in case of indirect unknowns and this is the case when the current doesn’t come from an independent current source, but it’s calculated inside a device model, starting from the voltages applied to it. Said that, let’s step back to recall some background information in order to introduce a formal description of the problem and its solution. In order to solve a linear circuit, it’s possible to leverage upon the KCL or the KVL, where the first one says that the sum of the entering currents at each node of the circuit is zero and the second one says that the sum of voltages at each closed branch is zero. In SPICE, the KCL formulation has been used. From the algorithmic point of view, SPICE represents each node of the circuit with a number, which is actually the row and column index of the circuit matrix which is being assembled to form the linear system to be solved and it takes the name of Nodal Analysis. A special case exists: the reference node, or ground, is always indicated as the node number 0 and it isn’t stored inside the matrix, but it’s only present in the RHS.
Figure 13 – Ladder Network
ADVANCING NANO-CMOS CIRCUITS SIMULATION 20
Considering a very simple 3 nodes circuit, called Ladder Network, KCL can be formulated in this way: