Trace-Penalty Minimization for Large-Scale Eigenspace Computation
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
CUDA 6 and Beyond
MUMPS USERS DAYS JUNE 1ST / 2ND 2017 Programming heterogeneous architectures with libraries: A survey of NVIDIA linear algebra libraries François Courteille |Principal Solutions Architect, NVIDIA |[email protected] ACKNOWLEDGEMENTS Joe Eaton , NVIDIA Ty McKercher , NVIDIA Lung Sheng Chien , NVIDIA Nikolay Markovskiy , NVIDIA Stan Posey , NVIDIA Steve Rennich , NVIDIA Dave Miles , NVIDIA Peng Wang, NVIDIA Questions: [email protected] 2 AGENDA Prolegomena NVIDIA Solutions for Accelerated Linear Algebra Libraries performance on Pascal Rapid software development for heterogeneous architecture 3 PROLEGOMENA 124 NVIDIA Gaming VR AI & HPC Self-Driving Cars GPU Computing 5 ONE ARCHITECTURE BUILT FOR BOTH DATA SCIENCE & COMPUTATIONAL SCIENCE AlexNet Training Performance 70x Pascal [CELLR ANGE] 60x 16nm FinFET 50x 40x CoWoS HBM2 30x 20x [CELLR ANGE] NVLink 10x [CELLR [CELLR ANGE] ANGE] 0x 2013 2014 2015 2016 cuDNN Pascal & Volta NVIDIA DGX-1 NVIDIA DGX SATURNV 65x in 3 Years 6 7 8 8 9 NVLINK TO CPU IBM Power Systems Server S822LC (codename “Minsky”) DDR4 DDR4 2x IBM Power8+ CPUs and 4x P100 GPUs 115GB/s 80 GB/s per GPU bidirectional for peer traffic IB P8+ CPU P8+ CPU IB 80 GB/s per GPU bidirectional to CPU P100 P100 P100 P100 115 GB/s CPU Memory Bandwidth Direct Load/store access to CPU Memory High Speed Copy Engines for bulk data movement 1615 UNIFIED MEMORY ON PASCAL Large datasets, Simple programming, High performance CUDA 8 Enable Large Oversubscribe GPU memory Pascal Data Models Allocate up to system memory size CPU GPU Higher Demand -
Accelerating the LOBPCG Method on Gpus Using a Blocked Sparse Matrix Vector Product
Accelerating the LOBPCG method on GPUs using a blocked Sparse Matrix Vector Product Hartwig Anzt and Stanimire Tomov and Jack Dongarra Innovative Computing Lab University of Tennessee Knoxville, USA Email: [email protected], [email protected], [email protected] Abstract— the computing power of today’s supercomputers, often accel- erated by coprocessors like graphics processing units (GPUs), This paper presents a heterogeneous CPU-GPU algorithm design and optimized implementation for an entire sparse iter- becomes challenging. ative eigensolver – the Locally Optimal Block Preconditioned Conjugate Gradient (LOBPCG) – starting from low-level GPU While there exist numerous efforts to adapt iterative lin- data structures and kernels to the higher-level algorithmic choices ear solvers like Krylov subspace methods to coprocessor and overall heterogeneous design. Most notably, the eigensolver technology, sparse eigensolvers have so far remained out- leverages the high-performance of a new GPU kernel developed side the main focus. A possible explanation is that many for the simultaneous multiplication of a sparse matrix and a of those combine sparse and dense linear algebra routines, set of vectors (SpMM). This is a building block that serves which makes porting them to accelerators more difficult. Aside as a backbone for not only block-Krylov, but also for other from the power method, algorithms based on the Krylov methods relying on blocking for acceleration in general. The subspace idea are among the most commonly used general heterogeneous LOBPCG developed here reveals the potential of eigensolvers [1]. When targeting symmetric positive definite this type of eigensolver by highly optimizing all of its components, eigenvalue problems, the recently developed Locally Optimal and can be viewed as a benchmark for other SpMM-dependent applications. -
Solving Symmetric Semi-Definite (Ill-Conditioned)
Solving Symmetric Semi-definite (ill-conditioned) Generalized Eigenvalue Problems Zhaojun Bai University of California, Davis Berkeley, August 19, 2016 Symmetric definite generalized eigenvalue problem I Symmetric definite generalized eigenvalue problem Ax = λBx where AT = A and BT = B > 0 I Eigen-decomposition AX = BXΛ where Λ = diag(λ1; λ2; : : : ; λn) X = (x1; x2; : : : ; xn) XT BX = I: I Assume λ1 ≤ λ2 ≤ · · · ≤ λn LAPACK solvers I LAPACK routines xSYGV, xSYGVD, xSYGVX are based on the following algorithm (Wilkinson'65): 1. compute the Cholesky factorization B = GGT 2. compute C = G−1AG−T 3. compute symmetric eigen-decomposition QT CQ = Λ 4. set X = G−T Q I xSYGV[D,X] could be numerically unstable if B is ill-conditioned: −1 jλbi − λij . p(n)(kB k2kAk2 + cond(B)jλbij) · and −1 1=2 kB k2kAk2(cond(B)) + cond(B)jλbij θ(xbi; xi) . p(n) · specgapi I User's choice between the inversion of ill-conditioned Cholesky decomposition and the QZ algorithm that destroys symmetry Algorithms to address the ill-conditioning 1. Fix-Heiberger'72 (Parlett'80): explicit reduction 2. Chang-Chung Chang'74: SQZ method (QZ by Moler and Stewart'73) 3. Bunse-Gerstner'84: MDR method 4. Chandrasekaran'00: \proper pivoting scheme" 5. Davies-Higham-Tisseur'01: Cholesky+Jacobi 6. Working notes by Kahan'11 and Moler'14 This talk Three approaches: 1. A LAPACK-style implementation of Fix-Heiberger algorithm 2. An algebraic reformulation 3. Locally accelerated block preconditioned steepest descent (LABPSD) This talk Three approaches: 1. A LAPACK-style implementation of Fix-Heiberger algorithm Status: beta-release 2. -
On the Performance and Energy Efficiency of Sparse Linear Algebra on Gpus
Original Article The International Journal of High Performance Computing Applications 1–16 On the performance and energy efficiency Ó The Author(s) 2016 Reprints and permissions: sagepub.co.uk/journalsPermissions.nav of sparse linear algebra on GPUs DOI: 10.1177/1094342016672081 hpc.sagepub.com Hartwig Anzt1, Stanimire Tomov1 and Jack Dongarra1,2,3 Abstract In this paper we unveil some performance and energy efficiency frontiers for sparse computations on GPU-based super- computers. We compare the resource efficiency of different sparse matrix–vector products (SpMV) taken from libraries such as cuSPARSE and MAGMA for GPU and Intel’s MKL for multicore CPUs, and develop a GPU sparse matrix–matrix product (SpMM) implementation that handles the simultaneous multiplication of a sparse matrix with a set of vectors in block-wise fashion. While a typical sparse computation such as the SpMV reaches only a fraction of the peak of current GPUs, we show that the SpMM succeeds in exceeding the memory-bound limitations of the SpMV. We integrate this kernel into a GPU-accelerated Locally Optimal Block Preconditioned Conjugate Gradient (LOBPCG) eigensolver. LOBPCG is chosen as a benchmark algorithm for this study as it combines an interesting mix of sparse and dense linear algebra operations that is typical for complex simulation applications, and allows for hardware-aware optimizations. In a detailed analysis we compare the performance and energy efficiency against a multi-threaded CPU counterpart. The reported performance and energy efficiency results are indicative of sparse computations on supercomputers. Keywords sparse eigensolver, LOBPCG, GPU supercomputer, energy efficiency, blocked sparse matrix–vector product 1Introduction GPU-based supercomputers. -
LARGE-SCALE COMPUTATION of PSEUDOSPECTRA USING ARPACK and EIGS∗ 1. Introduction. the Matrices in Many Eigenvalue Problems
SIAM J. SCI. COMPUT. c 2001 Society for Industrial and Applied Mathematics Vol. 23, No. 2, pp. 591–605 LARGE-SCALE COMPUTATION OF PSEUDOSPECTRA USING ARPACK AND EIGS∗ THOMAS G. WRIGHT† AND LLOYD N. TREFETHEN† Abstract. ARPACK and its Matlab counterpart, eigs, are software packages that calculate some eigenvalues of a large nonsymmetric matrix by Arnoldi iteration with implicit restarts. We show that at a small additional cost, which diminishes relatively as the matrix dimension increases, good estimates of pseudospectra in addition to eigenvalues can be obtained as a by-product. Thus in large- scale eigenvalue calculations it is feasible to obtain routinely not just eigenvalue approximations, but also information as to whether or not the eigenvalues are likely to be physically significant. Examples are presented for matrices with dimension up to 200,000. Key words. Arnoldi, ARPACK, eigenvalues, implicit restarting, pseudospectra AMS subject classifications. 65F15, 65F30, 65F50 PII. S106482750037322X 1. Introduction. The matrices in many eigenvalue problems are too large to allow direct computation of their full spectra, and two of the iterative tools available for computing a part of the spectrum are ARPACK [10, 11]and its Matlab counter- part, eigs.1 For nonsymmetric matrices, the mathematical basis of these packages is the Arnoldi iteration with implicit restarting [11, 23], which works by compressing the matrix to an “interesting” Hessenberg matrix, one which contains information about the eigenvalues and eigenvectors of interest. For general information on large-scale nonsymmetric matrix eigenvalue iterations, see [2, 21, 29, 31]. For some matrices, nonnormality (nonorthogonality of the eigenvectors) may be physically important [30]. -
High Efficiency Spectral Analysis and BLAS-3
High Efficiency Spectral Analysis and BLAS-3 Randomized QRCP with Low-Rank Approximations by Jed Alma Duersch A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Applied Mathematics in the Graduate Division of the University of California, Berkeley Committee in charge: Professor Ming Gu, Chair Assistant Professor Lin Lin Professor Kameshwar Poolla Fall 2015 High Efficiency Spectral Analysis and BLAS-3 Randomized QRCP with Low-Rank Approximations Copyright 2015 by Jed Alma Duersch i Contents Contents i 1 A Robust and Efficient Implementation of LOBPCG 2 1.1 Contributions and results . 2 1.2 Introduction to LOBPCG . 3 1.3 The basic LOBPCG algorithm . 4 1.4 Numerical stability and basis selection . 6 1.5 Stability improvements . 10 1.6 Efficiency improvements . 13 1.7 Analysis . 22 1.8 Numerical examples . 25 2 Spectral Target Residual Descent 32 2.1 Contributions and results . 32 2.2 Introduction to interior eigenvalue targeting . 33 2.3 Spectral targeting analysis . 34 2.4 Notes on generalized spectral targeting . 42 2.5 Numerical experiments . 48 3 True BLAS-3 Performance QRCP using Random Sampling 53 3.1 Contributions and results . 53 3.2 Introduction to QRCP . 54 3.3 Sample QRCP . 58 3.4 Sample updates . 64 3.5 Full randomized QRCP . 66 3.6 Parallel implementation notes . 71 3.7 Truncated approximate Singular Value Decomposition . 73 3.8 Experimental performance . 77 ii Bibliography 93 iii Acknowledgments I gratefully thank my coauthors, Dr. Meiyue Shao and Dr. Chao Yang, for their time and expertise. I would also like to thank Assistant Professor Lin Lin, Dr. -
A High Performance Implementation of Spectral Clustering on CPU-GPU Platforms
A High Performance Implementation of Spectral Clustering on CPU-GPU Platforms Yu Jin Joseph F. JaJa Institute for Advanced Computer Studies Institute for Advanced Computer Studies Department of Electrical and Computer Engineering Department of Electrical and Computer Engineering University of Maryland, College Park, USA University of Maryland, College Park, USA Email: [email protected] Email: [email protected] Abstract—Spectral clustering is one of the most popular graph CPUs, further boost the overall performance and are able clustering algorithms, which achieves the best performance for to achieve very high performance on problems whose sizes many scientific and engineering applications. However, existing grow up to the capacity of CPU memory [6, 7, 8, 9, 10, implementations in commonly used software platforms such as Matlab and Python do not scale well for many of the emerging 11]. In this paper, we present a hybrid implementation of the Big Data applications. In this paper, we present a fast imple- spectral clustering algorithm which significantly outperforms mentation of the spectral clustering algorithm on a CPU-GPU the known implementations, most of which are purely based heterogeneous platform. Our implementation takes advantage on multi-core CPUs. of the computational power of the multi-core CPU and the There have been reported efforts on parallelizing the spec- massive multithreading and SIMD capabilities of GPUs. Given the input as data points in high dimensional space, we propose tral clustering algorithm. Zheng et al. [12] presented both a parallel scheme to build a sparse similarity graph represented CUDA and OpenMP implementations of spectral clustering. in a standard sparse representation format. -
A High Performance Block Eigensolver for Nuclear Configuration Interaction Calculations Hasan Metin Aktulga, Md
1 A High Performance Block Eigensolver for Nuclear Configuration Interaction Calculations Hasan Metin Aktulga, Md. Afibuzzaman, Samuel Williams, Aydın Buluc¸, Meiyue Shao, Chao Yang, Esmond G. Ng, Pieter Maris, James P. Vary Abstract—As on-node parallelism increases and the performance gap between the processor and the memory system widens, achieving high performance in large-scale scientific applications requires an architecture-aware design of algorithms and solvers. We focus on the eigenvalue problem arising in nuclear Configuration Interaction (CI) calculations, where a few extreme eigenpairs of a sparse symmetric matrix are needed. We consider a block iterative eigensolver whose main computational kernels are the multiplication of a sparse matrix with multiple vectors (SpMM), and tall-skinny matrix operations. We present techniques to significantly improve the SpMM and the transpose operation SpMMT by using the compressed sparse blocks (CSB) format. We achieve 3–4× speedup on the requisite operations over good implementations with the commonly used compressed sparse row (CSR) format. We develop a performance model that allows us to correctly estimate the performance of our SpMM kernel implementations, and we identify cache bandwidth as a potential performance bottleneck beyond DRAM. We also analyze and optimize the performance of LOBPCG kernels (inner product and linear combinations on multiple vectors) and show up to 15× speedup over using high performance BLAS libraries for these operations. The resulting high performance LOBPCG solver achieves 1.4× to 1.8× speedup over the existing Lanczos solver on a series of CI computations on high-end multicore architectures (Intel Xeons). We also analyze the performance of our techniques on an Intel Xeon Phi Knights Corner (KNC) processor. -
Slepc Users Manual Scalable Library for Eigenvalue Problem Computations
Departamento de Sistemas Inform´aticos y Computaci´on Technical Report DSIC-II/24/02 SLEPc Users Manual Scalable Library for Eigenvalue Problem Computations https://slepc.upv.es Jose E. Roman Carmen Campos Lisandro Dalcin Eloy Romero Andr´es Tom´as To be used with slepc 3.15 March, 2021 Abstract This document describes slepc, the Scalable Library for Eigenvalue Problem Computations, a software package for the solution of large sparse eigenproblems on parallel computers. It can be used for the solution of various types of eigenvalue problems, including linear and nonlinear, as well as other related problems such as the singular value decomposition (see a summary of supported problem classes on page iii). slepc is a general library in the sense that it covers both Hermitian and non-Hermitian problems, with either real or complex arithmetic. The emphasis of the software is on methods and techniques appropriate for problems in which the associated matrices are large and sparse, for example, those arising after the discretization of partial differential equations. Thus, most of the methods offered by the library are projection methods, including different variants of Krylov and Davidson iterations. In addition to its own solvers, slepc provides transparent access to some external software packages such as arpack. These packages are optional and their installation is not required to use slepc, see x8.7 for details. Apart from the solvers, slepc also provides built-in support for some operations commonly used in the context of eigenvalue computations, such as preconditioning or the shift- and-invert spectral transformation. slepc is built on top of petsc, the Portable, Extensible Toolkit for Scientific Computation [Balay et al., 2021]. -
Comparison of Numerical Methods and Open-Source Libraries for Eigenvalue Analysis of Large-Scale Power Systems
applied sciences Article Comparison of Numerical Methods and Open-Source Libraries for Eigenvalue Analysis of Large-Scale Power Systems Georgios Tzounas , Ioannis Dassios * , Muyang Liu and Federico Milano School of Electrical and Electronic Engineering, University College Dublin, Belfield, Dublin 4, Ireland; [email protected] (G.T.); [email protected] (M.L.); [email protected] (F.M.) * Correspondence: [email protected] Received: 30 September 2020; Accepted: 24 October 2020; Published: 28 October 2020 Abstract: This paper discusses the numerical solution of the generalized non-Hermitian eigenvalue problem. It provides a comprehensive comparison of existing algorithms, as well as of available free and open-source software tools, which are suitable for the solution of the eigenvalue problems that arise in the stability analysis of electric power systems. The paper focuses, in particular, on methods and software libraries that are able to handle the large-scale, non-symmetric matrices that arise in power system eigenvalue problems. These kinds of eigenvalue problems are particularly difficult for most numerical methods to handle. Thus, a review and fair comparison of existing algorithms and software tools is a valuable contribution for researchers and practitioners that are interested in power system dynamic analysis. The scalability and performance of the algorithms and libraries are duly discussed through case studies based on real-world electrical power networks. These are a model of the All-Island Irish Transmission System with 8640 variables; and, a model of the European Network of Transmission System Operators for Electricity, with 146,164 variables. Keywords: eigenvalue analysis; large non-Hermitian matrices; numerical methods; open-source libraries 1. -
Computing Singular Values of Large Matrices with an Inverse-Free Preconditioned Krylov Subspace Method∗
Electronic Transactions on Numerical Analysis. ETNA Volume 42, pp. 197-221, 2014. Kent State University Copyright 2014, Kent State University. http://etna.math.kent.edu ISSN 1068-9613. COMPUTING SINGULAR VALUES OF LARGE MATRICES WITH AN INVERSE-FREE PRECONDITIONED KRYLOV SUBSPACE METHOD∗ QIAO LIANG† AND QIANG YE† Dedicated to Lothar Reichel on the occasion of his 60th birthday Abstract. We present an efficient algorithm for computing a few extreme singular values of a large sparse m×n matrix C. Our algorithm is based on reformulating the singular value problem as an eigenvalue problem for CT C. To address the clustering of the singular values, we develop an inverse-free preconditioned Krylov subspace method to accelerate convergence. We consider preconditioning that is based on robust incomplete factorizations, and we discuss various implementation issues. Extensive numerical tests are presented to demonstrate efficiency and robust- ness of the new algorithm. Key words. singular values, inverse-free preconditioned Krylov Subspace Method, preconditioning, incomplete QR factorization, robust incomplete factorization AMS subject classifications. 65F15, 65F08 1. Introduction. Consider the problem of computing a few of the extreme (i.e., largest or smallest) singular values and the corresponding singular vectors of an m n real ma- trix C. For notational convenience, we assume that m n as otherwise we can× consider CT . In addition, most of the discussions here are valid for≥ the case m < n as well with some notational modifications. Let σ1 σ2 σn be the singular values of C. Then nearly all existing numerical methods are≤ based≤···≤ on reformulating the singular value problem as one of the following two symmetric eigenvalue problems: (1.1) σ2 σ2 σ2 are the eigenvalues of CT C 1 ≤ 2 ≤···≤ n or σ σ σ 0= = 0 σ σ σ − n ≤···≤− 2 ≤− 1 ≤ ··· ≤ 1 ≤ 2 ≤···≤ n m−n are the eigenvalues of the augmented matrix| {z } 0 C (1.2) M := . -
Recent Implementations, Applications, and Extensions of The
Recent implementations, applications, and extensions of the Locally Optimal Block Preconditioned Conjugate Gradient method (LOBPCG) Andrew Knyazev (Mitsubishi Electric Research Laboratories) Abstract Since introduction [A. Knyazev, Toward the optimal preconditioned eigensolver: Locally optimal block preconditioned conjugate gradient method, SISC (2001) doi:10.1137/S1064827500366124] and efficient parallel implementation [A. Knyazev et al., Block locally optimal preconditioned eigenvalue xolvers (BLOPEX) in HYPRE and PETSc, SISC (2007) doi:10.1137/060661624], LOBPCG has been used is a wide range of applications in mechanics, material sciences, and data sciences. We review its recent implementations and applications, as well as extensions of the local optimality idea beyond standard eigenvalue problems. 1 Background Kantorovich in 1948 has proposed calculating the smallest eigenvalue λ1 of a symmetric matrix A by steepest descent using a direction r = Ax − λ(x) of a scaled gradient of a Rayleigh quotient λ(x) = (x,Ax)/(x,x) in a scalar product (x,y)= x′y, where the step size is computed by minimizing the Rayleigh quotient in the span of the vectors x and w, i.e. in a locally optimal manner. Samokish in 1958 has proposed applying a preconditioner T to the vector r to generate the preconditioned direction w = T r and derived asymptotic, as x approaches the eigenvector, convergence rate bounds. Block locally optimal multi-step steepest descent is described in Cullum and Willoughby in 1985. Local minimization of the Rayleigh quotient on the subspace spanned by the current approximation, the current residual and the previous approximation, as well as its block version, appear in AK PhD thesis; see 1986.