Trace-Penalty Minimization for Large-Scale Eigenspace Computation

Total Page:16

File Type:pdf, Size:1020Kb

Trace-Penalty Minimization for Large-Scale Eigenspace Computation Trace-Penalty Minimization for Large-scale Eigenspace Computation Yin Zhang Department of Computational and Applied Mathematics Rice University, Houston, Texas, USA (Co-authors: Zaiwen Wen, Chao Yang and Xin Liu) 1 CAAM VIGRE Seminar, January 31, 2013 1SJTU (Shanghai), LBL (Berkeley) and CAS (Beijing) Yin Zhang (RICE) EIGPEN February, 2013 1 / 39 Outline 1 Introduction Problem Description Existing Methods 2 Motivation Large Scale Redefined Avoid the Bottleneck 3 Trace-Penalty Minimization Basic Idea Model Analysis Algorithm Framework 4 Numerical Results and Conclusion Numerical Experiments and Results Concluding Remarks Yin Zhang (RICE) EIGPEN February, 2013 2 / 39 Section I. Eigenvalue/vector Computation: Fundamental, yet still Challenging Yin Zhang (RICE) EIGPEN February, 2013 3 / 39 Problem Description and Applications Given a symmetric n × n real matrix A Eigenvalue Decomposition AQ = QΛ (1.1) Q 2 Rn×n is orthogonal. n×n Λ 2 R is diagonal (with λ1 ≤ λ2 ≤ ::: ≤ λn on diagonal). k−truncated decomposition (k largest/smallest eigenvalues) AQk = Qk Λk (1.2) n×k Qk 2 R with orthonormal columns; k n. k×k Λk 2 R is diagonal with largest/smallest k eigenvalues. Yin Zhang (RICE) EIGPEN February, 2013 4 / 39 Applications Basic Problem in Numerical Linear Algebra Various scientific and engineering applications Lowest-energy states (Materials, Physics, Chemistry) Density functional theory for Electron Structures Nonlinear eigenvalue problems Singular Value Decomposition Data analysis, e.g., PCA Ill-posed problems Matrix rank minimization F Increasingly large-scale sparse matrices F Increasingly large portions of the spectrum Yin Zhang (RICE) EIGPEN February, 2013 5 / 39 Some Existing Methods Books and Surveys Saad, 1992, “Numerical Methods for Large Eigenvalue Problems” Sorensen, 2002, “Numerical Methods for Large Eigenvalue Problems” Hernandez´ et al, 2009, “A Survey of Software for Sparse Eigenvalue Problems” Krylov Subspace Techniques Arnodi methods, Lanczos Methods – ARPACK (eigs in Matlab) Sorensen, 1996, “Implicitly Restarted Arnoldi/Lanczos Methods for ...... ” Krylov-Schur, ...... Optimization based, e.g., LOBPCG Subspace Iteration, Jacobi-Davidson, polynomial filtering, ... Keep orthogonality: X T X = I at each iteration Rayleigh-Ritz (RR): [V; D] = eig(X T AX); X = X ∗ V; Yin Zhang (RICE) EIGPEN February, 2013 6 / 39 Section II. Motivation: A Method for Larger Eigenspaces with Richer Parallelism Yin Zhang (RICE) EIGPEN February, 2013 7 / 39 What is Large Scale? Ordinarily Large Scale: A large and sparse matrix, say n = 1M A small number of eigen-pairs, say k = 100 Doubly Large Scale: A large and sparse matrix, say n = 1M A large number of eigen-pairs, say k = 1% ∗ n A sequence of doubly large scale problems Change of characters as k jumps: X 2 Rn×k Cost ofRR/ orth(X) AX Parallelism becomes a critical factor Low parallelism in RR/Orth =) Opportunity for new methods? Yin Zhang (RICE) EIGPEN February, 2013 8 / 39 Example: DFT, Materials Science Kohn-Sham Total Energy Minimization T min Etotal(X) s:t: X X = I; (2.1) where, for ρ(X) := diag(XX T ), ! 1 1 E (X) := tr X T ( L + V )X + ρTL yρ + ρT (ρ) + E : total 2 ion 2 xc rep Nonlinear eigenvalue problem: up to 10% smallest eigen-pairs. A Main Approach: SCF — a sequence of linear eigenvalue problems Yin Zhang (RICE) EIGPEN February, 2013 9 / 39 Avoid the Bottleneck Two Types of Computation: AX and RR=orth As k becomes large, AX is dominated by RR=orth — bottleneck Parallelism AX −! Ax1 [ Ax2 [ ::: [ Axk . Higher. RR=orth contains sequentiality. Lower. Avoid bottleneck? Do fewer RR/orth No free lunch? Do more BLAS3 (higher parallelism than AX) Yin Zhang (RICE) EIGPEN February, 2013 10 / 39 Section III. Trace-Penalty Minimization: Free of Orthogonalization BLAS3-Dominated Computation Yin Zhang (RICE) EIGPEN February, 2013 11 / 39 Basic Idea Trace Minimization min ftr(X TAX): X TX = Ig (3.1) X2Rn×k Trace-penalty Minimization 1 T µ T 2 min f(X) := tr(X AX) + kX X − IkF: (3.2) X2Rm×k 2 4 It is well known that µ ! 1, (3.2) =) (3.1) Quadratic Penalty Function (Courant 1940’s) This idea appears old and unsophisticated. However, ...... Yin Zhang (RICE) EIGPEN February, 2013 12 / 39 “Exact” Penalty However, µ ! 1 is unnecessary. Theorem (Equivalence in Eigenspace) Problem (3.2) is equivalent to (3.1) if and only if µ > λk : (3.3) Under (3.3), all minimizers of (3.2) have the SVD form: 1=2 T X^ = Qk (I − Λk /µ) V ; (3.4) where Qk consist of k eigenvectors associated with a set of k smallest k×k eigenvalues that form the diagonal matrix Λk , and V 2 R is any orthogonal matrix. Yin Zhang (RICE) EIGPEN February, 2013 13 / 39 Fewer Saddle Points Original Model: minftr(X TAX): X TX = I; X 2 Rn×k g One minimum/maximum subspace (discounting multiplicity). All k-dimensional eigenspaces are saddle points. However, for the penalty model: Theorem Let f(X) be the penalty function associated with parameter µ > 0. 1 For µ 2 (λk ; λn), f(X) has a unique minimum, no maximum. 2 For µ 2 (λk ; λk+p) where λk+p is the smallest eigenvalue > λk , a rank-k stationary point must be a minimizers, as defined in (3.4). In a sense, the penalty model is much stronger. Yin Zhang (RICE) EIGPEN February, 2013 14 / 39 Error Bounds between Optimality Conditions First order condition Our penalty model: 0 = rf(X) , AX + µX(X TX − I); Original model: 0 = R(X) , AY(X) − Y(X)(Y(X)TAY(X)), where Y(X) is an orthonormal basis of spanfXg. Lemma Let rf(X) (with µ > λk ) and R(X) be defined as above, then −1 kR(X)kF ≤ σmin(X)krf(X)kF ; (3.5) where σmin(X) is the smallest singular value of X. Moreover, for any global minimizer X^ and any > 0, there exists δ > 0 such that whenever kX − X^kF ≤ δ, 1 + kR(X)kF ≤ p krf(X)kF : (3.6) 1 − λk /µ Yin Zhang (RICE) EIGPEN February, 2013 15 / 39 Condition Number Condition Number of the Hessian at Solution 2 2 2 κ(r f(X^)) , λmax(r f(X^))/λmin(r f(X^)) Determining factor for asymptotic convergence rate of gradient methods Lemma Let X^ be a global minimizer of (3.2) with µ > λk . The condition number of the Hessian at X^ satisfies max (2(µ − λ1); (λn − λ1)) κ r2f(X^) ≥ : (3.7) min (2(µ − λk ); (λk+1 − λk )) In particular, the above holds as an equality for k = 1. Gradient methods may encounter slow convergence at the end. Yin Zhang (RICE) EIGPEN February, 2013 16 / 39 Generalizations Generalized eigenvalue problems: X TX = I ! X TBX = I Keep out undesired subspace: UTX = 0 (UT U = I) Trace Minimization with Subspace Constraint min ftr(X TAX): X TBX = I; UTX = 0g X2Rn×k Trace-Penalty Formulation 1 T T µ T T 2 min tr(X Q AQX) + kX Q BQX − IkF X 2 4 where Q = I − UUT (QX = X − U(UT X)). With changes of variables, all results still hold. Yin Zhang (RICE) EIGPEN February, 2013 17 / 39 Algorithms for Trace-Penalty Minimization Gradient Methods: X X − αrf(X): rf(X) = AX + µX(X TX − I) First Order Condition: rf(X) = 0 , AX = X(I − X TX)µ 2 Types of Computations for rf(X): 1 AX: O(k nnz(A)) 2 X(X T X): O(k 2n) — BLAS3 (2) dominates (1) whenever k nnz(A)=n Gradient methods requires NO RR/Orth Yin Zhang (RICE) EIGPEN February, 2013 18 / 39 Gradient Method Preserve Full Rank Lemma Let X j+1 be generated by X j+1 = X j − αjrf(X j) from a full rank X j. Then X j+1 is rank deficient only if 1/αj is one of the k generalized eigenvalues of the problem: [(X j)T rf(X j)]u = λ[(X j)T (X j)]u: j j j j+1 On the other hand, if α < σmin(X )=jjrf(X )jj2,X is full rank. Combined with previous results, there is a high probability of getting a global minimizer by using gradient type methods. Yin Zhang (RICE) EIGPEN February, 2013 19 / 39 Gradient Methods (Cont’d) X j+1 = X j − αjrf(X j) Step Size α Non-monotone line search (Grippo 1986, Zhang-Hager 2004) Initial BB step: tr((Sj)TY j) αj = arg min jjSj − αY jjj2 = F jj jjj2 α Y F where Sj = X j − X j−1, Y j = rf(X j) − rf(X j−1). Many other choices Yin Zhang (RICE) EIGPEN February, 2013 20 / 39 Current Algorithm Framework: 1 Pre-process — scaling, shifting, preconditioning 2 Penalty parameter µ — dynamically adjusted 3 Gradient iterations — main operations: X(X T X) and AX 4 RR Restart — computing Ritz-pairs and restarting (Further steps possible, but NOT used in comparison) 5 Deflation — working on desired subspaces only 6 Chebychev Filter — improving accuracy Yin Zhang (RICE) EIGPEN February, 2013 21 / 39 Enhancement: RR Restarting RR Steps return Ritz-pairs for given subspaces 1 Orthogonalization: Q 2 orth(X) 2 Eigenvalue decomposition: QTAQ = V TΣV 3 Ritz-paires: QV and diag(Σ) RR Steps ensure accurate terminations RR Steps can accelerate convergence Very few RR Steps are used Yin Zhang (RICE) EIGPEN February, 2013 22 / 39 Section IV. Numerical Results and Conclusion Yin Zhang (RICE) EIGPEN February, 2013 23 / 39 Pilot Tests in Matlab Matrix: delsq(numgrid(’S’,102)); size: n = 10000; tol = 1e-3 CPU Time in Seconds 120 300 eigs eigs lobpcg lobpcg eigpen 100 eigpen 250 80 200 150 60 CPU Second CPU Second 100 40 50 20 50 100 150 200 250 300 350 400 450 500 50 100 150 200 250 300 350 400 450 500 Number of Eigenvalues Number of Eigenvalues (a) with “-singleCompThread” (b) without “-singleCompThread” Yin Zhang (RICE) EIGPEN February, 2013 24 / 39 Experiment Environment Running Platform A single node of a Cray XE6 supercomputer (NERSC) Two 12-core AMD ‘MagnyCours’ 2.1-GHz processors 32 GB shared memory System and language: Cray Linux Environment version 3 Fortran + OpenMP All 24 cores are used unless otherwise specified Solvers Tested ARPACK LOBPCG EIGPEN Yin Zhang (RICE) EIGPEN February, 2013 25 / 39 Relative Error Measurements Let x1 x2 ··· xk be computed Ritz vectors, and θi Ritz values.
Recommended publications
  • CUDA 6 and Beyond
    MUMPS USERS DAYS JUNE 1ST / 2ND 2017 Programming heterogeneous architectures with libraries: A survey of NVIDIA linear algebra libraries François Courteille |Principal Solutions Architect, NVIDIA |[email protected] ACKNOWLEDGEMENTS Joe Eaton , NVIDIA Ty McKercher , NVIDIA Lung Sheng Chien , NVIDIA Nikolay Markovskiy , NVIDIA Stan Posey , NVIDIA Steve Rennich , NVIDIA Dave Miles , NVIDIA Peng Wang, NVIDIA Questions: [email protected] 2 AGENDA Prolegomena NVIDIA Solutions for Accelerated Linear Algebra Libraries performance on Pascal Rapid software development for heterogeneous architecture 3 PROLEGOMENA 124 NVIDIA Gaming VR AI & HPC Self-Driving Cars GPU Computing 5 ONE ARCHITECTURE BUILT FOR BOTH DATA SCIENCE & COMPUTATIONAL SCIENCE AlexNet Training Performance 70x Pascal [CELLR ANGE] 60x 16nm FinFET 50x 40x CoWoS HBM2 30x 20x [CELLR ANGE] NVLink 10x [CELLR [CELLR ANGE] ANGE] 0x 2013 2014 2015 2016 cuDNN Pascal & Volta NVIDIA DGX-1 NVIDIA DGX SATURNV 65x in 3 Years 6 7 8 8 9 NVLINK TO CPU IBM Power Systems Server S822LC (codename “Minsky”) DDR4 DDR4 2x IBM Power8+ CPUs and 4x P100 GPUs 115GB/s 80 GB/s per GPU bidirectional for peer traffic IB P8+ CPU P8+ CPU IB 80 GB/s per GPU bidirectional to CPU P100 P100 P100 P100 115 GB/s CPU Memory Bandwidth Direct Load/store access to CPU Memory High Speed Copy Engines for bulk data movement 1615 UNIFIED MEMORY ON PASCAL Large datasets, Simple programming, High performance CUDA 8 Enable Large Oversubscribe GPU memory Pascal Data Models Allocate up to system memory size CPU GPU Higher Demand
    [Show full text]
  • Accelerating the LOBPCG Method on Gpus Using a Blocked Sparse Matrix Vector Product
    Accelerating the LOBPCG method on GPUs using a blocked Sparse Matrix Vector Product Hartwig Anzt and Stanimire Tomov and Jack Dongarra Innovative Computing Lab University of Tennessee Knoxville, USA Email: [email protected], [email protected], [email protected] Abstract— the computing power of today’s supercomputers, often accel- erated by coprocessors like graphics processing units (GPUs), This paper presents a heterogeneous CPU-GPU algorithm design and optimized implementation for an entire sparse iter- becomes challenging. ative eigensolver – the Locally Optimal Block Preconditioned Conjugate Gradient (LOBPCG) – starting from low-level GPU While there exist numerous efforts to adapt iterative lin- data structures and kernels to the higher-level algorithmic choices ear solvers like Krylov subspace methods to coprocessor and overall heterogeneous design. Most notably, the eigensolver technology, sparse eigensolvers have so far remained out- leverages the high-performance of a new GPU kernel developed side the main focus. A possible explanation is that many for the simultaneous multiplication of a sparse matrix and a of those combine sparse and dense linear algebra routines, set of vectors (SpMM). This is a building block that serves which makes porting them to accelerators more difficult. Aside as a backbone for not only block-Krylov, but also for other from the power method, algorithms based on the Krylov methods relying on blocking for acceleration in general. The subspace idea are among the most commonly used general heterogeneous LOBPCG developed here reveals the potential of eigensolvers [1]. When targeting symmetric positive definite this type of eigensolver by highly optimizing all of its components, eigenvalue problems, the recently developed Locally Optimal and can be viewed as a benchmark for other SpMM-dependent applications.
    [Show full text]
  • Solving Symmetric Semi-Definite (Ill-Conditioned)
    Solving Symmetric Semi-definite (ill-conditioned) Generalized Eigenvalue Problems Zhaojun Bai University of California, Davis Berkeley, August 19, 2016 Symmetric definite generalized eigenvalue problem I Symmetric definite generalized eigenvalue problem Ax = λBx where AT = A and BT = B > 0 I Eigen-decomposition AX = BXΛ where Λ = diag(λ1; λ2; : : : ; λn) X = (x1; x2; : : : ; xn) XT BX = I: I Assume λ1 ≤ λ2 ≤ · · · ≤ λn LAPACK solvers I LAPACK routines xSYGV, xSYGVD, xSYGVX are based on the following algorithm (Wilkinson'65): 1. compute the Cholesky factorization B = GGT 2. compute C = G−1AG−T 3. compute symmetric eigen-decomposition QT CQ = Λ 4. set X = G−T Q I xSYGV[D,X] could be numerically unstable if B is ill-conditioned: −1 jλbi − λij . p(n)(kB k2kAk2 + cond(B)jλbij) · and −1 1=2 kB k2kAk2(cond(B)) + cond(B)jλbij θ(xbi; xi) . p(n) · specgapi I User's choice between the inversion of ill-conditioned Cholesky decomposition and the QZ algorithm that destroys symmetry Algorithms to address the ill-conditioning 1. Fix-Heiberger'72 (Parlett'80): explicit reduction 2. Chang-Chung Chang'74: SQZ method (QZ by Moler and Stewart'73) 3. Bunse-Gerstner'84: MDR method 4. Chandrasekaran'00: \proper pivoting scheme" 5. Davies-Higham-Tisseur'01: Cholesky+Jacobi 6. Working notes by Kahan'11 and Moler'14 This talk Three approaches: 1. A LAPACK-style implementation of Fix-Heiberger algorithm 2. An algebraic reformulation 3. Locally accelerated block preconditioned steepest descent (LABPSD) This talk Three approaches: 1. A LAPACK-style implementation of Fix-Heiberger algorithm Status: beta-release 2.
    [Show full text]
  • On the Performance and Energy Efficiency of Sparse Linear Algebra on Gpus
    Original Article The International Journal of High Performance Computing Applications 1–16 On the performance and energy efficiency Ó The Author(s) 2016 Reprints and permissions: sagepub.co.uk/journalsPermissions.nav of sparse linear algebra on GPUs DOI: 10.1177/1094342016672081 hpc.sagepub.com Hartwig Anzt1, Stanimire Tomov1 and Jack Dongarra1,2,3 Abstract In this paper we unveil some performance and energy efficiency frontiers for sparse computations on GPU-based super- computers. We compare the resource efficiency of different sparse matrix–vector products (SpMV) taken from libraries such as cuSPARSE and MAGMA for GPU and Intel’s MKL for multicore CPUs, and develop a GPU sparse matrix–matrix product (SpMM) implementation that handles the simultaneous multiplication of a sparse matrix with a set of vectors in block-wise fashion. While a typical sparse computation such as the SpMV reaches only a fraction of the peak of current GPUs, we show that the SpMM succeeds in exceeding the memory-bound limitations of the SpMV. We integrate this kernel into a GPU-accelerated Locally Optimal Block Preconditioned Conjugate Gradient (LOBPCG) eigensolver. LOBPCG is chosen as a benchmark algorithm for this study as it combines an interesting mix of sparse and dense linear algebra operations that is typical for complex simulation applications, and allows for hardware-aware optimizations. In a detailed analysis we compare the performance and energy efficiency against a multi-threaded CPU counterpart. The reported performance and energy efficiency results are indicative of sparse computations on supercomputers. Keywords sparse eigensolver, LOBPCG, GPU supercomputer, energy efficiency, blocked sparse matrix–vector product 1Introduction GPU-based supercomputers.
    [Show full text]
  • LARGE-SCALE COMPUTATION of PSEUDOSPECTRA USING ARPACK and EIGS∗ 1. Introduction. the Matrices in Many Eigenvalue Problems
    SIAM J. SCI. COMPUT. c 2001 Society for Industrial and Applied Mathematics Vol. 23, No. 2, pp. 591–605 LARGE-SCALE COMPUTATION OF PSEUDOSPECTRA USING ARPACK AND EIGS∗ THOMAS G. WRIGHT† AND LLOYD N. TREFETHEN† Abstract. ARPACK and its Matlab counterpart, eigs, are software packages that calculate some eigenvalues of a large nonsymmetric matrix by Arnoldi iteration with implicit restarts. We show that at a small additional cost, which diminishes relatively as the matrix dimension increases, good estimates of pseudospectra in addition to eigenvalues can be obtained as a by-product. Thus in large- scale eigenvalue calculations it is feasible to obtain routinely not just eigenvalue approximations, but also information as to whether or not the eigenvalues are likely to be physically significant. Examples are presented for matrices with dimension up to 200,000. Key words. Arnoldi, ARPACK, eigenvalues, implicit restarting, pseudospectra AMS subject classifications. 65F15, 65F30, 65F50 PII. S106482750037322X 1. Introduction. The matrices in many eigenvalue problems are too large to allow direct computation of their full spectra, and two of the iterative tools available for computing a part of the spectrum are ARPACK [10, 11]and its Matlab counter- part, eigs.1 For nonsymmetric matrices, the mathematical basis of these packages is the Arnoldi iteration with implicit restarting [11, 23], which works by compressing the matrix to an “interesting” Hessenberg matrix, one which contains information about the eigenvalues and eigenvectors of interest. For general information on large-scale nonsymmetric matrix eigenvalue iterations, see [2, 21, 29, 31]. For some matrices, nonnormality (nonorthogonality of the eigenvectors) may be physically important [30].
    [Show full text]
  • High Efficiency Spectral Analysis and BLAS-3
    High Efficiency Spectral Analysis and BLAS-3 Randomized QRCP with Low-Rank Approximations by Jed Alma Duersch A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Applied Mathematics in the Graduate Division of the University of California, Berkeley Committee in charge: Professor Ming Gu, Chair Assistant Professor Lin Lin Professor Kameshwar Poolla Fall 2015 High Efficiency Spectral Analysis and BLAS-3 Randomized QRCP with Low-Rank Approximations Copyright 2015 by Jed Alma Duersch i Contents Contents i 1 A Robust and Efficient Implementation of LOBPCG 2 1.1 Contributions and results . 2 1.2 Introduction to LOBPCG . 3 1.3 The basic LOBPCG algorithm . 4 1.4 Numerical stability and basis selection . 6 1.5 Stability improvements . 10 1.6 Efficiency improvements . 13 1.7 Analysis . 22 1.8 Numerical examples . 25 2 Spectral Target Residual Descent 32 2.1 Contributions and results . 32 2.2 Introduction to interior eigenvalue targeting . 33 2.3 Spectral targeting analysis . 34 2.4 Notes on generalized spectral targeting . 42 2.5 Numerical experiments . 48 3 True BLAS-3 Performance QRCP using Random Sampling 53 3.1 Contributions and results . 53 3.2 Introduction to QRCP . 54 3.3 Sample QRCP . 58 3.4 Sample updates . 64 3.5 Full randomized QRCP . 66 3.6 Parallel implementation notes . 71 3.7 Truncated approximate Singular Value Decomposition . 73 3.8 Experimental performance . 77 ii Bibliography 93 iii Acknowledgments I gratefully thank my coauthors, Dr. Meiyue Shao and Dr. Chao Yang, for their time and expertise. I would also like to thank Assistant Professor Lin Lin, Dr.
    [Show full text]
  • A High Performance Implementation of Spectral Clustering on CPU-GPU Platforms
    A High Performance Implementation of Spectral Clustering on CPU-GPU Platforms Yu Jin Joseph F. JaJa Institute for Advanced Computer Studies Institute for Advanced Computer Studies Department of Electrical and Computer Engineering Department of Electrical and Computer Engineering University of Maryland, College Park, USA University of Maryland, College Park, USA Email: [email protected] Email: [email protected] Abstract—Spectral clustering is one of the most popular graph CPUs, further boost the overall performance and are able clustering algorithms, which achieves the best performance for to achieve very high performance on problems whose sizes many scientific and engineering applications. However, existing grow up to the capacity of CPU memory [6, 7, 8, 9, 10, implementations in commonly used software platforms such as Matlab and Python do not scale well for many of the emerging 11]. In this paper, we present a hybrid implementation of the Big Data applications. In this paper, we present a fast imple- spectral clustering algorithm which significantly outperforms mentation of the spectral clustering algorithm on a CPU-GPU the known implementations, most of which are purely based heterogeneous platform. Our implementation takes advantage on multi-core CPUs. of the computational power of the multi-core CPU and the There have been reported efforts on parallelizing the spec- massive multithreading and SIMD capabilities of GPUs. Given the input as data points in high dimensional space, we propose tral clustering algorithm. Zheng et al. [12] presented both a parallel scheme to build a sparse similarity graph represented CUDA and OpenMP implementations of spectral clustering. in a standard sparse representation format.
    [Show full text]
  • A High Performance Block Eigensolver for Nuclear Configuration Interaction Calculations Hasan Metin Aktulga, Md
    1 A High Performance Block Eigensolver for Nuclear Configuration Interaction Calculations Hasan Metin Aktulga, Md. Afibuzzaman, Samuel Williams, Aydın Buluc¸, Meiyue Shao, Chao Yang, Esmond G. Ng, Pieter Maris, James P. Vary Abstract—As on-node parallelism increases and the performance gap between the processor and the memory system widens, achieving high performance in large-scale scientific applications requires an architecture-aware design of algorithms and solvers. We focus on the eigenvalue problem arising in nuclear Configuration Interaction (CI) calculations, where a few extreme eigenpairs of a sparse symmetric matrix are needed. We consider a block iterative eigensolver whose main computational kernels are the multiplication of a sparse matrix with multiple vectors (SpMM), and tall-skinny matrix operations. We present techniques to significantly improve the SpMM and the transpose operation SpMMT by using the compressed sparse blocks (CSB) format. We achieve 3–4× speedup on the requisite operations over good implementations with the commonly used compressed sparse row (CSR) format. We develop a performance model that allows us to correctly estimate the performance of our SpMM kernel implementations, and we identify cache bandwidth as a potential performance bottleneck beyond DRAM. We also analyze and optimize the performance of LOBPCG kernels (inner product and linear combinations on multiple vectors) and show up to 15× speedup over using high performance BLAS libraries for these operations. The resulting high performance LOBPCG solver achieves 1.4× to 1.8× speedup over the existing Lanczos solver on a series of CI computations on high-end multicore architectures (Intel Xeons). We also analyze the performance of our techniques on an Intel Xeon Phi Knights Corner (KNC) processor.
    [Show full text]
  • Slepc Users Manual Scalable Library for Eigenvalue Problem Computations
    Departamento de Sistemas Inform´aticos y Computaci´on Technical Report DSIC-II/24/02 SLEPc Users Manual Scalable Library for Eigenvalue Problem Computations https://slepc.upv.es Jose E. Roman Carmen Campos Lisandro Dalcin Eloy Romero Andr´es Tom´as To be used with slepc 3.15 March, 2021 Abstract This document describes slepc, the Scalable Library for Eigenvalue Problem Computations, a software package for the solution of large sparse eigenproblems on parallel computers. It can be used for the solution of various types of eigenvalue problems, including linear and nonlinear, as well as other related problems such as the singular value decomposition (see a summary of supported problem classes on page iii). slepc is a general library in the sense that it covers both Hermitian and non-Hermitian problems, with either real or complex arithmetic. The emphasis of the software is on methods and techniques appropriate for problems in which the associated matrices are large and sparse, for example, those arising after the discretization of partial differential equations. Thus, most of the methods offered by the library are projection methods, including different variants of Krylov and Davidson iterations. In addition to its own solvers, slepc provides transparent access to some external software packages such as arpack. These packages are optional and their installation is not required to use slepc, see x8.7 for details. Apart from the solvers, slepc also provides built-in support for some operations commonly used in the context of eigenvalue computations, such as preconditioning or the shift- and-invert spectral transformation. slepc is built on top of petsc, the Portable, Extensible Toolkit for Scientific Computation [Balay et al., 2021].
    [Show full text]
  • Comparison of Numerical Methods and Open-Source Libraries for Eigenvalue Analysis of Large-Scale Power Systems
    applied sciences Article Comparison of Numerical Methods and Open-Source Libraries for Eigenvalue Analysis of Large-Scale Power Systems Georgios Tzounas , Ioannis Dassios * , Muyang Liu and Federico Milano School of Electrical and Electronic Engineering, University College Dublin, Belfield, Dublin 4, Ireland; [email protected] (G.T.); [email protected] (M.L.); [email protected] (F.M.) * Correspondence: [email protected] Received: 30 September 2020; Accepted: 24 October 2020; Published: 28 October 2020 Abstract: This paper discusses the numerical solution of the generalized non-Hermitian eigenvalue problem. It provides a comprehensive comparison of existing algorithms, as well as of available free and open-source software tools, which are suitable for the solution of the eigenvalue problems that arise in the stability analysis of electric power systems. The paper focuses, in particular, on methods and software libraries that are able to handle the large-scale, non-symmetric matrices that arise in power system eigenvalue problems. These kinds of eigenvalue problems are particularly difficult for most numerical methods to handle. Thus, a review and fair comparison of existing algorithms and software tools is a valuable contribution for researchers and practitioners that are interested in power system dynamic analysis. The scalability and performance of the algorithms and libraries are duly discussed through case studies based on real-world electrical power networks. These are a model of the All-Island Irish Transmission System with 8640 variables; and, a model of the European Network of Transmission System Operators for Electricity, with 146,164 variables. Keywords: eigenvalue analysis; large non-Hermitian matrices; numerical methods; open-source libraries 1.
    [Show full text]
  • Computing Singular Values of Large Matrices with an Inverse-Free Preconditioned Krylov Subspace Method∗
    Electronic Transactions on Numerical Analysis. ETNA Volume 42, pp. 197-221, 2014. Kent State University Copyright 2014, Kent State University. http://etna.math.kent.edu ISSN 1068-9613. COMPUTING SINGULAR VALUES OF LARGE MATRICES WITH AN INVERSE-FREE PRECONDITIONED KRYLOV SUBSPACE METHOD∗ QIAO LIANG† AND QIANG YE† Dedicated to Lothar Reichel on the occasion of his 60th birthday Abstract. We present an efficient algorithm for computing a few extreme singular values of a large sparse m×n matrix C. Our algorithm is based on reformulating the singular value problem as an eigenvalue problem for CT C. To address the clustering of the singular values, we develop an inverse-free preconditioned Krylov subspace method to accelerate convergence. We consider preconditioning that is based on robust incomplete factorizations, and we discuss various implementation issues. Extensive numerical tests are presented to demonstrate efficiency and robust- ness of the new algorithm. Key words. singular values, inverse-free preconditioned Krylov Subspace Method, preconditioning, incomplete QR factorization, robust incomplete factorization AMS subject classifications. 65F15, 65F08 1. Introduction. Consider the problem of computing a few of the extreme (i.e., largest or smallest) singular values and the corresponding singular vectors of an m n real ma- trix C. For notational convenience, we assume that m n as otherwise we can× consider CT . In addition, most of the discussions here are valid for≥ the case m < n as well with some notational modifications. Let σ1 σ2 σn be the singular values of C. Then nearly all existing numerical methods are≤ based≤···≤ on reformulating the singular value problem as one of the following two symmetric eigenvalue problems: (1.1) σ2 σ2 σ2 are the eigenvalues of CT C 1 ≤ 2 ≤···≤ n or σ σ σ 0= = 0 σ σ σ − n ≤···≤− 2 ≤− 1 ≤ ··· ≤ 1 ≤ 2 ≤···≤ n m−n are the eigenvalues of the augmented matrix| {z } 0 C (1.2) M := .
    [Show full text]
  • Recent Implementations, Applications, and Extensions of The
    Recent implementations, applications, and extensions of the Locally Optimal Block Preconditioned Conjugate Gradient method (LOBPCG) Andrew Knyazev (Mitsubishi Electric Research Laboratories) Abstract Since introduction [A. Knyazev, Toward the optimal preconditioned eigensolver: Locally optimal block preconditioned conjugate gradient method, SISC (2001) doi:10.1137/S1064827500366124] and efficient parallel implementation [A. Knyazev et al., Block locally optimal preconditioned eigenvalue xolvers (BLOPEX) in HYPRE and PETSc, SISC (2007) doi:10.1137/060661624], LOBPCG has been used is a wide range of applications in mechanics, material sciences, and data sciences. We review its recent implementations and applications, as well as extensions of the local optimality idea beyond standard eigenvalue problems. 1 Background Kantorovich in 1948 has proposed calculating the smallest eigenvalue λ1 of a symmetric matrix A by steepest descent using a direction r = Ax − λ(x) of a scaled gradient of a Rayleigh quotient λ(x) = (x,Ax)/(x,x) in a scalar product (x,y)= x′y, where the step size is computed by minimizing the Rayleigh quotient in the span of the vectors x and w, i.e. in a locally optimal manner. Samokish in 1958 has proposed applying a preconditioner T to the vector r to generate the preconditioned direction w = T r and derived asymptotic, as x approaches the eigenvector, convergence rate bounds. Block locally optimal multi-step steepest descent is described in Cullum and Willoughby in 1985. Local minimization of the Rayleigh quotient on the subspace spanned by the current approximation, the current residual and the previous approximation, as well as its block version, appear in AK PhD thesis; see 1986.
    [Show full text]