Fast Algorithms for Sparse Matrix Inverse Computations a Dissertation Submitted to the Institute for Computational and Mathemati

FAST ALGORITHMS FOR SPARSE MATRIX INVERSE COMPUTATIONS A DISSERTATION SUBMITTED TO THE INSTITUTE FOR COMPUTATIONAL AND MATHEMATICAL ENGINEERING AND THE COMMITTEE ON GRADUATE STUDIES OF STANFORD UNIVERSITY IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY Song Li September 2009 c Copyright by Song Li 2009 All Rights Reserved ii I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. (Eric Darve) Principal Adviser I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. (George Papanicolaou) I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. (Michael Saunders) Approved for the University Committee on Graduate Studies. iii iv Abstract An accurate and efficient algorithm, called Fast Inverse using Nested Dissection (FIND), has been developed for certain sparse matrix computations. The algorithm reduces the computation cost by an order of magnitude for 2D problems. After discretization on an N N mesh, the previously best-known algorithm Recursive x × y 3 Green's Function (RGF) requires O(Nx Ny) operations, whereas ours requires only 2 O(Nx Ny). The current major application of the algorithm is to simulate the transport of electrons in nano-devices, but it may be applied to other problems as well, e.g., data clustering, imagery pattern clustering, image segmentation. The algorithm computes the diagonal entries of the inverse of a sparse of finite- difference, finite-element, or finite-volume type. In addition, it can be extended to computing certain off-diagonal entries and other inverse-related matrix computations. As an example, we focus on the retarded Green's function, the less-than Green's function, and the current density in the non-equilibrium Green's function approach for transport problems. Special optimizations of the algorithm are also be discussed. These techniques are applicable to 3D regular shaped domains as well. Computing inverse elements for a large matrix requires a lot of memory and is very time-consuming even using our efficient algorithm with optimization. We pro- pose several parallel algorithms for such applications based on ideas from cyclic reduction, dynamic programming, and nested dissection. A similar parallel algorithm is discussed for solving sparse linear systems with repeated right-hand sides with significant speedup. v Acknowledgements The first person I would like to thank is professor Walter Murray. My pleasant study here at Stanford and the newly established Institute for Computational and Mathematical Engineering would not have been possible without him. I would also very like to thank Indira Choudhury, who helped me on numerous administrative issues with lots of patience. I would also very like to professor TW Wiedmann, who has been constantly pro- viding mental support and giving me advice on all aspects of my life here. I am grateful to professors Tze Leung Lai, Parviz Moin, George Papanicolaou, and Michael Saunders for willing to be my committee members. In particular, Michael gave lots of comments on my dissertation draft. I believe that his comments not only helped me improve my dissertation, but also will benefit my future academic writing. I would very much like to thank my advisor Eric Darve for choosing a great topic for me at the beginning and all his help afterwards in the past five years. His constant support and tolerance provided me a stable environment with more than enough freedom to learn what interests me and think in my own pace and style, while at the same time, his advice led my vague and diverse ideas to rigorous and meaningful results. This is almost an ideal environment in my mind for academic study and research that I have never actually had before. In particular, the discussion with him in the past years has been very enjoyable. I am very lucky to meet him and have him as my advisor here at Stanford. Thank you, Eric! Lastly, I thank my mother and my sister for always encouraging me and believing in me. vi Contents Abstract v Acknowledgements vi 1 Introduction 1 1.1 The transport problem of nano-devices . .1 1.1.1 The physical problem . .1 1.1.2 NEGF approach . .4 1.2 Review of existing methods . .8 1.2.1 The need for computing sparse matrix inverses . .8 1.2.2 Direct methods for matrix inverse related computation . 10 1.2.3 Recursive Green's Function method . 13 2 Overview of FIND Algorithms 17 2.1 1-way methods based on LU only . 17 2.2 2-way methods based on both LU and back substitution . 20 3 Computing the Inverse 23 3.1 Brief description of the algorithm . 23 3.1.1 Upward pass . 25 3.1.2 Downward pass . 26 3.1.3 Nested dissection algorithm of A. George et al. 29 3.2 Detailed description of the algorithm . 30 3.2.1 The definition and properties of mesh node sets and trees . 31 vii 3.2.2 Correctness of the algorithm . 34 3.2.3 The algorithm . 37 3.3 Complexity analysis . 39 3.3.1 Running time analysis . 39 3.3.2 Memory cost analysis . 44 3.3.3 Effect of the null boundary set of the whole mesh . 45 3.4 Simulation of device and comparison with RGF . 46 3.5 Concluding remarks . 49 4 Extension of FIND 51 4.1 Computing diagonal entries of another Green's function: G< ..... 51 4.1.1 The algorithm . 52 4.1.2 The pseudocode of the algorithm . 53 4.1.3 Computation and storage cost . 56 4.2 Computing off-diagonal entries of Gr and G< ............. 57 5 Optimization of FIND 59 5.1 Introduction . 59 5.2 Optimization for extra sparsity in A .................. 61 5.2.1 Schematic description of the extra sparsity . 61 5.2.2 Preliminary analysis . 64 5.2.3 Exploiting the extra sparsity in a block structure . 66 5.3 Optimization for symmetry and positive definiteness . 72 5.3.1 The symmetry and positive definiteness of dense matrix A .. 72 5.3.2 Optimization combined with the extra sparsity . 74 5.3.3 Storage cost reduction . 75 5.4 Optimization for computing G< and current density . 76 5.4.1 G< sparsity . 76 5.4.2 G< symmetry . 76 5.4.3 Current density . 77 5.5 Results and comparison . 78 viii 6 FIND-2way and Its Parallelization 83 6.1 Recurrence formulas for the inverse matrix . 84 6.2 Sequential algorithm for 1D problems . 89 6.3 Computational cost analysis . 91 6.4 Parallel algorithm for 1D problems . 95 6.4.1 Schur phases . 97 6.4.2 Block Cyclic Reduction phases . 103 6.5 Numerical results . 104 6.5.1 Stability . 104 6.5.2 Load balancing . 105 6.5.3 Benchmarks . 107 6.6 Conclusion . 108 7 More Advanced Parallel Schemes 113 7.1 Common features of the parallel schemes . 113 7.2 PCR scheme for 1D problems . 114 7.3 PFIND-Complement scheme for 1D problems . 117 7.4 Parallel schemes in 2D . 120 8 Conclusions and Directions for Future Work 123 Appendices 126 A Proofs for the Recursive Method Computing Gr and G< 126 A.1 Forward recurrence . 128 A.2 Backward recurrence . 129 B Theoretical Supplement for FIND-1way Algorithms 132 B.1 Proofs for both computing Gr and computing G< ........... 132 B.2 Proofs for computing G< ......................... 136 C Algorithms 140 C.1 BCR Reduction Functions . 141 ix C.2 BCR Production Functions . 142 C.3 Hybrid Auxiliary Functions . 144 Bibliography 147 x List of Tables I Symbols in the physical problems . xvii II Symbols for matrix operations . xviii III Symbols for mesh description . xviii 3.1 Computation cost estimate for different type of clusters . 46 5.1 Matrix blocks and their corresponding operations . 66 5.2 The cost of operations and their dependence in the first method. The costs are in flops. The size of A, B, C, and D is m m; the size of W , × X, Y , and Z is m n........................... 67 × 5.3 The cost of operations and their dependence in the second method. The costs are in flops. 68 5.4 The cost of operations in flops and their dependence in the third method 69 5.5 The cost of operations in flops and their dependence in the fourth method. The size of A, B, C, and D is m m; the size of W , X, Y , × and Z is m n .............................. 70 × 5.6 Summary of the optimization methods . 70 5.7 The cost of operations and their dependence for (5.10) . 71 5.8 Operation costs in flops. 75 5.9 Operation costs in flops for G< with optimization for symmetry. 77 6.1 Estimate of the computational cost for a 2D square mesh for different cluster sizes. The size is in units of N 1=2. The cost is in units of N 3=2 flops. 94 xi List of Figures 1.1 The model of a widely-studied double-gate SOI MOSFET with ultra- thin intrinsic channel. Typical values of key device parameters are also shown. .3 1.2 Left: example of 2D mesh to which RGF can be applied. Right: 5-point stencil. 13 2.1 Partitions of the entire mesh into clusters . 18 2.2 The partition tree and the complement tree of the mesh .

Fast Algorithms for Sparse Matrix Inverse Computations a Dissertation Submitted to the Institute for Computational and Mathemati

Eigenvalues and Eigenvectors of Tridiagonal Matrices∗

Parametrizations of K-Nonnegative Matrices

On Multigrid Methods for Solving Electromagnetic Scattering Problems

Sparse Matrices and Iterative Methods

Scalable Stochastic Kriging with Markovian Covariances

Chapter 7 Iterative Methods for Large Sparse Linear Systems

A Geometric Theory for Preconditioned Inverse Iteration. III: a Short and Sharp Convergence Estimate for Generalized Eigenvalue Problems

A Parallel Solver for Graph Laplacians

(Hessenberg) Eigenvalue-Eigenmatrix Relations∗

Solving Linear Systems: Iterative Methods and Sparse Systems

Explicit Inverse of a Tridiagonal (P, R)–Toeplitz Matrix

The Godunov–Inverse Iteration: a Fast and Accurate Solution to the Symmetric Tridiagonal Eigenvalue Problem