A Parallel Divide and Conquer Algorithm for the Symmetric Eigenvalue Problem on Distributed Memory Architectures

Total Page:16

File Type:pdf, Size:1020Kb

A Parallel Divide and Conquer Algorithm for the Symmetric Eigenvalue Problem on Distributed Memory Architectures A Parallel Divide and Conquer Algorithm for the Symmetric Eigenvalue Problem on Distributed Memory Architectures F Tisseur and J Dongarra 1999 MIMS EPrint: 2007.225 Manchester Institute for Mathematical Sciences School of Mathematics The University of Manchester Reports available from: http://www.manchester.ac.uk/mims/eprints And by contacting: The MIMS Secretary School of Mathematics The University of Manchester Manchester, M13 9PL, UK ISSN 1749-9097 SIAM J. SCI. COMPUT. c 1999 Society for Industrial and Applied Mathematics Vol. 20, No. 6, pp. 2223–2236 A PARALLEL DIVIDE AND CONQUER ALGORITHM FOR THE SYMMETRIC EIGENVALUE PROBLEM ON DISTRIBUTED MEMORY ARCHITECTURES∗ FRANC¸ OISE TISSEUR† AND JACK DONGARRA‡ Abstract. We present a new parallel implementation of a divide and conquer algorithm for computing the spectral decomposition of a symmetric tridiagonal matrix on distributed memory architectures. The implementation we develop differs from other implementations in that we use a two-dimensional block cyclic distribution of the data, we use the L¨owner theorem approach to compute orthogonal eigenvectors, and we introduce permutations before the back transformation of each rank-one update in order to make good use of deation. This algorithm yields the first scalable, portable, and numerically stable parallel divide and conquer eigensolver. Numerical results confirm the effectiveness of our algorithm. We compare performance of the algorithm with that of the QR algorithm and of bisection followed by inverse iteration on an IBM SP2 and a cluster of Pentium PIIs. Key words. divide and conquer, symmetric eigenvalue problem, tridiagonal matrix, rank-one modification, parallel algorithm, ScaLAPACK, LAPACK, distributed memory architecture AMS subject classifications. 65F15, 68C25 PII. S1064827598336951 1. Introduction. The divide and conquer algorithm for the symmetric tridiag- onal eigenvalue problem was first developed by Cuppen [8], based on previous ideas of Golub [16] and Bunch, Nielsen, and Sorensen [5] for the solution of the secular equation. The algorithm was popularized as a practical parallel method by Dongarra and Sorensen [14], who implemented it on a shared memory machine. They concluded that divide and conquer algorithms, when properly implemented, can be many times faster than traditional ones, such as bisection followed by inverse iteration or the QR algorithm, even on serial computers. Later parallel implementations had mixed success. Using an Intel iPSC-1 hypercube, Ipsen and Jessup [22] found that their bisection implementation was more efficient than their divide and conquer implemen- tation because of the excessive amount of data transferred between processors and unbalanced work load after the deation process. More recently, Gates and Arbenz [15] showed that good speed-up can be achieved from distributed memory parallel implementations. However, they did not use techniques described in [18] that guar- antee the orthogonality of the eigenvectors and that make good use of the deation to speed the computation. In this paper, we describe an efficient, scalable, and portable parallel implemen- tation for distributed memory machines of a divide and conquer algorithm for the ∗Received by the editors April 6, 1998; accepted for publication (in revised form) October 15, 1998; published electronically July 16, 1999. This work was supported in part by Oak Ridge National Laboratory, managed by Lockheed Martin Energy Research Corp. for the U.S. Department of Energy under contract DE-AC05-96OR2246 and by the Defense Advanced Research Projects Agency under contract DAAL03-91-C-0047, administered by the Army Research Office. http://www.siam.org/journals/sisc/20-6/33695.html †Department of Mathematics, University of Manchester, Manchester M13 9PL, England (ftisseur@ ma.man.ac.uk). ‡Department of Computer Science, University of Tennessee, Knoxville, TN 37996-1301, and Mathematical Sciences Section, Oak Ridge National Laboratory, Oak Ridge, TN 37831 (dongarra@ cs.utk.edu). 2223 2224 FRANC¸ OISE TISSEUR AND JACK DONGARRA symmetric tridiagonal eigenvalue problem. We chose to implement the rank-one up- date of Cuppen [8] rather than the rank-two update described in [15], [18]. We see no reason why one update should be more accurate than the other or faster in general, but Cuppen’s method, as reviewed in section 2, appears to be easier to implement. Until recently, it was thought that extended precision arithmetic was needed in the solution of the secular equation to guarantee that orthogonal eigenvectors are produced when there are close eigenvalues. However, Gu and Eisenstat [18] have found a new approach that does not require extended precision; Kahan [24] showed how to make it portable and we have used it in our implementation. In section 3 we discuss several important issues to consider for parallel imple- mentation of a divide and conquer algorithm, and then we derive our algorithm. We implemented our algorithm in Fortran 77 as production quality software in the ScaLAPACK model [4] and we used LAPACK divide and conquer routines [25], [27] as building blocks. The code is well suited to compute all the eigenvalues and eigen- vectors of large matrices with clusters of eigenvalues. For these problems, bisection followed by inverse iteration as implemented in ScaLAPACK [4], [10] is limited by the size of the largest cluster that fits on one processor. The QR algorithm is less sensitive to the eigenvalue distribution but is more expensive in computation and communication and thus does not perform as well as the divide and conquer method. Examples that demonstrate the efficiency and numerical performance are presented in section 4. 2. Cuppen’s method. The spectral decomposition of a symmetric matrix is generally computed in three steps: tridiagonalization, diagonalization, and back trans- formation. Here, we consider the diagonalization T = W ΛW T of a symmetric tridi- n n agonal matrix T R × , where Λ is diagonal and W is orthogonal. Cuppen [8] introduced the decomposition∈ T 0 T = 1 + ρvvT , 0 T 2 where T1 and T2 differ from the corresponding submatrices of T only by their last and T T first diagonal coefficients, respectively. Let T1 = Q1D1Q1 ,T2 =Q2D2Q2 be spectral decompositions. Then T is orthogonally similar to the rank-one update (2.1) T = Q(D + ρzzT )QT , T where Q = diag(Q1,Q2) and z = Q v. By solving the secular equation associated with this rank-one update, we compute the spectral decomposition (2.2) D + ρzzT = UΛU T and then T = W ΛW T with W = QU. A recursive application of this strategy to T1 and T2 leads to the divide and conquer algorithm for the symmetric tridiagonal eigenvalue problem. Finding the spectral decomposition of the rank-one update D + ρzzT is the heart n of the divide and conquer algorithm. The eigenvalues λi i=1 are the roots of the secular equation { } T 1 (2.3) f(λ)=1+ρz (D λ) z, and a corresponding eigenvector u is given by 1 (2.4) u =(D λI) z. PARALLEL DIVIDE AND CONQUER EIGENSOLVER 2225 Each eigenvalue and corresponding eigenvector can be computed cheaply in O(n) ops. Unfortunately, calculation of eigenvectors using (2.4) can lead to a loss of orthogonality for close eigenvalues. Solutions to this problem are discussed in section 3.3. Dongarra and Sorensen [14] showed that the spectral decomposition (2.2) can potentially be reduced in size. If zi = 0 for some i, then di = D(i, i) is an eigenvalue with eigenvector the ith unit eigenvector ei, and if there are equal di’s, then the eigenvector basis can be rotated in order to zero out the components of z corresponding to the repeated diagonal entries. In finite precision arithmetic one needs to deate when a zi is nearly equal to zero and when there are nearly equal di’s for some suitable definitions of “nearly” that ensure numerical stability is retained [14]. With suitable deation criteria, if G is the product of all the rotations used to zero out certain components of z and if P is the accumulation of permutations used to translate the zero components of z to the bottom of z, the result is D + ρz˜z˜T 0 (2.5) PG(D+ρzzT )GT P T = +E, 0Λ e where E 2 cu, with c a constant of order unity and u the machine precision. This deationk processk ≤ is essential for the success of the divide and conquer algorithm. In practice, the dimension of D+ρz˜z˜T is usually considerably smaller than the dimension of D+ρzzt, which reduces the number of ops when computing the eigenvector matrix of T . Cuppen [8] showed thate deation is more likely to take place when the matrix is diagonally dominant When no deation is assumed, the whole algorithm requires 4 3 2 3 n + O(n ). In practice, because of deation, it appears that the algorithm takes only O(n2.3) ops on average and the cost can even be as low as O(n2) for some special cases (see [9]). 3. Parallelization issues and implementation details. Divide and conquer algorithms have been successfully implemented on shared memory multiprocessors [14], [23] but difficulties have been encountered on distributed memory machines [22]. Several issues need to be addressed. A more detailed discussion of the issues discussed below can be found in [30]. 3.1. Data distribution. The first issue, and perhaps the most critical step when writing a parallel program, is how to distribute the data. Previous implementations used a one-dimensional distribution [15], [22]. Gates and Arbenz [15] used a one- dimensional row block distribution for Q, the matrix of eigenvectors, and a one- dimensional column block distribution for U, the eigenvector matrix of the rank-one updates. This distribution simplifies their parallel matrix-matrix multiplication used for the back transformation QU. However, their matrix multiplication routine grows in communication with the number of processes, making it not scalable.
Recommended publications
  • The Multishift Qr Algorithm. Part I: Maintaining Well-Focused Shifts and Level 3 Performance∗
    SIAM J. MATRIX ANAL. APPL. c 2002 Society for Industrial and Applied Mathematics Vol. 23, No. 4, pp. 929–947 THE MULTISHIFT QR ALGORITHM. PART I: MAINTAINING WELL-FOCUSED SHIFTS AND LEVEL 3 PERFORMANCE∗ KAREN BRAMAN† , RALPH BYERS† , AND ROY MATHIAS‡ Abstract. This paper presents a small-bulge multishift variation of the multishift QR algorithm that avoids the phenomenon ofshiftblurring, which retards convergence and limits the number of simultaneous shifts. It replaces the large diagonal bulge in the multishift QR sweep with a chain of many small bulges. The small-bulge multishift QR sweep admits nearly any number ofsimultaneous shifts—even hundreds—without adverse effects on the convergence rate. With enough simultaneous shifts, the small-bulge multishift QR algorithm takes advantage ofthe level 3 BLAS, which is a special advantage for computers with advanced architectures. Key words. QR algorithm, implicit shifts, level 3 BLAS, eigenvalues, eigenvectors AMS subject classifications. 65F15, 15A18 PII. S0895479801384573 1. Introduction. This paper presents a small-bulge multishift variation of the multishift QR algorithm [4] that avoids the phenomenon of shift blurring, which re- tards convergence and limits the number of simultaneous shifts that can be used effec- tively. The small-bulge multishift QR algorithm replaces the large diagonal bulge in the multishift QR sweep with a chain of many small bulges. The small-bulge multishift QR sweep admits nearly any number of simultaneous shifts—even hundreds—without adverse effects on the convergence rate. It takes advantage of the level 3 BLAS by organizing nearly all the arithmetic workinto matrix-matrix multiplies. This is par- ticularly efficient on most modern computers and especially efficient on computers with advanced architectures.
    [Show full text]
  • The SVD Algorithm
    Jim Lambers CME 335 Spring Quarter 2010-11 Lecture 6 Notes The SVD Algorithm Let A be an m × n matrix. The Singular Value Decomposition (SVD) of A, A = UΣV T ; where U is m × m and orthogonal, V is n × n and orthogonal, and Σ is an m × n diagonal matrix with nonnegative diagonal entries σ1 ≥ σ2 ≥ · · · ≥ σp; p = minfm; ng; known as the singular values of A, is an extremely useful decomposition that yields much informa- tion about A, including its range, null space, rank, and 2-norm condition number. We now discuss a practical algorithm for computing the SVD of A, due to Golub and Kahan. Let U and V have column partitions U = u1 ··· um ;V = v1 ··· vn : From the relations T Avj = σjuj;A uj = σjvj; j = 1; : : : ; p; it follows that T 2 A Avj = σj vj: That is, the squares of the singular values are the eigenvalues of AT A, which is a symmetric matrix. It follows that one approach to computing the SVD of A is to apply the symmetric QR algorithm T T T T to A A to obtain a decomposition A A = V Σ ΣV . Then, the relations Avj = σjuj, j = 1; : : : ; p, can be used in conjunction with the QR factorization with column pivoting to obtain U. However, this approach is not the most practical, because of the expense and loss of information incurred from computing AT A. Instead, we can implicitly apply the symmetric QR algorithm to AT A. As the first step of the symmetric QR algorithm is to use Householder reflections to reduce the matrix to tridiagonal form, we can use Householder reflections to instead reduce A to upper bidiagonal form 2 3 d1 f1 6 d2 f2 7 6 7 T 6 .
    [Show full text]
  • Fast Solution of Sparse Linear Systems with Adaptive Choice of Preconditioners Zakariae Jorti
    Fast solution of sparse linear systems with adaptive choice of preconditioners Zakariae Jorti To cite this version: Zakariae Jorti. Fast solution of sparse linear systems with adaptive choice of preconditioners. Gen- eral Mathematics [math.GM]. Sorbonne Université, 2019. English. NNT : 2019SORUS149. tel- 02425679v2 HAL Id: tel-02425679 https://tel.archives-ouvertes.fr/tel-02425679v2 Submitted on 15 Feb 2021 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Sorbonne Université École doctorale de Sciences Mathématiques de Paris Centre Laboratoire Jacques-Louis Lions Résolution rapide de systèmes linéaires creux avec choix adaptatif de préconditionneurs Par Zakariae Jorti Thèse de doctorat de Mathématiques appliquées Dirigée par Laura Grigori Co-supervisée par Ani Anciaux-Sedrakian et Soleiman Yousef Présentée et soutenue publiquement le 03/10/2019 Devant un jury composé de: M. Tromeur-Dervout Damien, Professeur, Université de Lyon, Rapporteur M. Schenk Olaf, Professeur, Università della Svizzera italiana, Rapporteur M. Hecht Frédéric, Professeur, Sorbonne Université, Président du jury Mme. Emad Nahid, Professeur, Université Paris-Saclay, Examinateur M. Vasseur Xavier, Ingénieur de recherche, ISAE-SUPAERO, Examinateur Mme. Grigori Laura, Directrice de recherche, Inria Paris, Directrice de thèse Mme. Anciaux-Sedrakian Ani, Ingénieur de recherche, IFPEN, Co-encadrante de thèse M.
    [Show full text]
  • Efficient “Black-Box” Multigrid Solvers for Convection-Dominated Problems
    EFFICIENT \BLACK-BOX" MULTIGRID SOLVERS FOR CONVECTION-DOMINATED PROBLEMS A thesis submitted to the University of Manchester for the degree of Doctor of Philosophy in the Faculty of Engineering and Physical Science 2011 Glyn Owen Rees School of Computer Science 2 Contents Declaration 12 Copyright 13 Acknowledgement 14 1 Introduction 15 2 The Convection-Di®usion Problem 18 2.1 The continuous problem . 18 2.2 The Galerkin approximation method . 22 2.2.1 The ¯nite element method . 26 2.3 The Petrov-Galerkin (SUPG) approximation . 29 2.4 Properties of the discrete operator . 34 2.5 The Navier-Stokes problem . 37 3 Methods for Solving Linear Algebraic Systems 42 3.1 Basic iterative methods . 43 3.1.1 The Jacobi method . 45 3.1.2 The Gauss-Seidel method . 46 3.1.3 Convergence of splitting iterations . 48 3.1.4 Incomplete LU (ILU) factorisation . 49 3.2 Krylov methods . 56 3.2.1 The GMRES method . 59 3.2.2 Preconditioned GMRES method . 62 3.3 Multigrid method . 67 3.3.1 Geometric multigrid method . 67 3.3.2 Algebraic multigrid method . 75 3.3.3 Parallel AMG . 80 3.3.4 Literature summary of multigrid . 81 3.3.5 Multigrid preconditioning of Krylov solvers . 84 3 3.4 A new tILU smoother . 85 4 Two-Dimensional Case Studies 89 4.1 The di®usion problem . 90 4.2 Geometric multigrid preconditioning . 95 4.2.1 Constant uni-directional wind . 95 4.2.2 Double glazing problem - recirculating wind . 106 4.2.3 Combined uni-directional and recirculating wind .
    [Show full text]
  • Chapter 7 Iterative Methods for Large Sparse Linear Systems
    Chapter 7 Iterative methods for large sparse linear systems In this chapter we revisit the problem of solving linear systems of equations, but now in the context of large sparse systems. The price to pay for the direct methods based on matrix factorization is that the factors of a sparse matrix may not be sparse, so that for large sparse systems the memory cost make direct methods too expensive, in memory and in execution time. Instead we introduce iterative methods, for which matrix sparsity is exploited to develop fast algorithms with a low memory footprint. 7.1 Sparse matrix algebra Large sparse matrices We say that the matrix A Rn is large if n is large, and that A is sparse if most of the elements are2 zero. If a matrix is not sparse, we say that the matrix is dense. Whereas for a dense matrix the number of nonzero elements is (n2), for a sparse matrix it is only (n), which has obvious implicationsO for the memory footprint and efficiencyO for algorithms that exploit the sparsity of a matrix. AdiagonalmatrixisasparsematrixA =(aij), for which aij =0for all i = j,andadiagonalmatrixcanbegeneralizedtoabanded matrix, 6 for which there exists a number p,thebandwidth,suchthataij =0forall i<j p or i>j+ p.Forexample,atridiagonal matrix A is a banded − 59 CHAPTER 7. ITERATIVE METHODS FOR LARGE SPARSE 60 LINEAR SYSTEMS matrix with p =1, xx0000 xxx000 20 xxx003 A = , (7.1) 600xxx07 6 7 6000xxx7 6 7 60000xx7 6 7 where x represents a nonzero4 element. 5 Compressed row storage The compressed row storage (CRS) format is a data structure for efficient represention of a sparse matrix by three arrays, containing the nonzero values, the respective column indices, and the extents of the rows.
    [Show full text]
  • Chebyshev and Fourier Spectral Methods 2000
    Chebyshev and Fourier Spectral Methods Second Edition John P. Boyd University of Michigan Ann Arbor, Michigan 48109-2143 email: [email protected] http://www-personal.engin.umich.edu/jpboyd/ 2000 DOVER Publications, Inc. 31 East 2nd Street Mineola, New York 11501 1 Dedication To Marilyn, Ian, and Emma “A computation is a temptation that should be resisted as long as possible.” — J. P. Boyd, paraphrasing T. S. Eliot i Contents PREFACE x Acknowledgments xiv Errata and Extended-Bibliography xvi 1 Introduction 1 1.1 Series expansions .................................. 1 1.2 First Example .................................... 2 1.3 Comparison with finite element methods .................... 4 1.4 Comparisons with Finite Differences ....................... 6 1.5 Parallel Computers ................................. 9 1.6 Choice of basis functions .............................. 9 1.7 Boundary conditions ................................ 10 1.8 Non-Interpolating and Pseudospectral ...................... 12 1.9 Nonlinearity ..................................... 13 1.10 Time-dependent problems ............................. 15 1.11 FAQ: Frequently Asked Questions ........................ 16 1.12 The Chrysalis .................................... 17 2 Chebyshev & Fourier Series 19 2.1 Introduction ..................................... 19 2.2 Fourier series .................................... 20 2.3 Orders of Convergence ............................... 25 2.4 Convergence Order ................................. 27 2.5 Assumption of Equal Errors ...........................
    [Show full text]
  • Communication-Optimal Parallel and Sequential QR and LU Factorizations
    Communication-optimal parallel and sequential QR and LU factorizations James Demmel Laura Grigori Mark Frederick Hoemmen Julien Langou Electrical Engineering and Computer Sciences University of California at Berkeley Technical Report No. UCB/EECS-2008-89 http://www.eecs.berkeley.edu/Pubs/TechRpts/2008/EECS-2008-89.html August 4, 2008 Copyright © 2008, by the author(s). All rights reserved. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission. Communication-optimal parallel and sequential QR and LU factorizations James Demmel, Laura Grigori, Mark Hoemmen, and Julien Langou August 4, 2008 Abstract We present parallel and sequential dense QR factorization algorithms that are both optimal (up to polylogarithmic factors) in the amount of communication they perform, and just as stable as Householder QR. Our first algorithm, Tall Skinny QR (TSQR), factors m × n matrices in a one-dimensional (1-D) block cyclic row layout, and is optimized for m n. Our second algorithm, CAQR (Communication-Avoiding QR), factors general rectangular matrices distributed in a two-dimensional block cyclic layout. It invokes TSQR for each block column factorization. The new algorithms are superior in both theory and practice. We have extended known lower bounds on communication for sequential and parallel matrix multiplication to provide latency lower bounds, and show these bounds apply to the LU and QR decompositions.
    [Show full text]
  • Massively Parallel Poisson and QR Factorization Solvers
    Computers Math. Applic. Vol. 31, No. 4/5, pp. 19-26, 1996 Pergamon Copyright~)1996 Elsevier Science Ltd Printed in Great Britain. All rights reserved 0898-1221/96 $15.00 + 0.00 0898-122 ! (95)00212-X Massively Parallel Poisson and QR Factorization Solvers M. LUCK£ Institute for Control Theory and Robotics, Slovak Academy of Sciences DdbravskA cesta 9, 842 37 Bratislava, Slovak Republik utrrluck@savba, sk M. VAJTERSIC Institute of Informatics, Slovak Academy of Sciences DdbravskA cesta 9, 840 00 Bratislava, P.O. Box 56, Slovak Republic kaifmava©savba, sk E. VIKTORINOVA Institute for Control Theoryand Robotics, Slovak Academy of Sciences DdbravskA cesta 9, 842 37 Bratislava, Slovak Republik utrrevka@savba, sk Abstract--The paper brings a massively parallel Poisson solver for rectangle domain and parallel algorithms for computation of QR factorization of a dense matrix A by means of Householder re- flections and Givens rotations. The computer model under consideration is a SIMD mesh-connected toroidal n x n processor array. The Dirichlet problem is replaced by its finite-difference analog on an M x N (M + 1, N are powers of two) grid. The algorithm is composed of parallel fast sine transform and cyclic odd-even reduction blocks and runs in a fully parallel fashion. Its computational complexity is O(MN log L/n2), where L = max(M + 1, N). A parallel proposal of QI~ factorization by the Householder method zeros all subdiagonal elements in each column and updates all elements of the given submatrix in parallel. For the second method with Givens rotations, the parallel scheme of the Sameh and Kuck was chosen where the disjoint rotations can be computed simultaneously.
    [Show full text]
  • Relaxed Modulus-Based Matrix Splitting Methods for the Linear Complementarity Problem †
    S S symmetry Article Relaxed Modulus-Based Matrix Splitting Methods for the Linear Complementarity Problem † Shiliang Wu 1, *, Cuixia Li 1 and Praveen Agarwal 2,3 1 School of Mathematics, Yunnan Normal University, Kunming 650500, China; [email protected] 2 Department of Mathrematics, Anand International College of Engineering, Jaipur 303012, India; [email protected] 3 Nonlinear Dynamics Research Center (NDRC), Ajman University, Ajman, United Arab Emirates * Correspondence: [email protected] † This research was supported by National Natural Science Foundation of China (No.11961082). Abstract: In this paper, we obtain a new equivalent fixed-point form of the linear complementarity problem by introducing a relaxed matrix and establish a class of relaxed modulus-based matrix split- ting iteration methods for solving the linear complementarity problem. Some sufficient conditions for guaranteeing the convergence of relaxed modulus-based matrix splitting iteration methods are presented. Numerical examples are offered to show the efficacy of the proposed methods. Keywords: linear complementarity problem; matrix splitting; iteration method; convergence MSC: 90C33; 65F10; 65F50; 65G40 1. Introduction Citation: Wu, S.; Li, C.; Agarwal, P. Relaxed Modulus-Based Matrix In this paper, we focus on the iterative solution of the linear complementarity problem, Splitting Methods for the Linear abbreviated as ‘LCP(q, A)’, whose form is Complementarity Problem. Symmetry T 2021, 13, 503. https://doi.org/ w = Az + q ≥ 0, z ≥ 0 and z w = 0, (1) 10.3390/sym13030503 × where A 2 Rn n and q 2 Rn are given, and z 2 Rn is unknown, and for two s × t matrices Academic Editor: Jan Awrejcewicz G = (gij) and H = (hij) the order G ≥ (>)H means gij ≥ (>)hij for any i and j.
    [Show full text]
  • QR Decomposition on Gpus
    QR Decomposition on GPUs Andrew Kerr, Dan Campbell, Mark Richards Georgia Institute of Technology, Georgia Tech Research Institute {andrew.kerr, dan.campbell}@gtri.gatech.edu, [email protected] ABSTRACT LU, and QR decomposition, however, typically require ¯ne- QR decomposition is a computationally intensive linear al- grain synchronization between processors and contain short gebra operation that factors a matrix A into the product of serial routines as well as massively parallel operations. Achiev- a unitary matrix Q and upper triangular matrix R. Adap- ing good utilization on a GPU requires a careful implementa- tive systems commonly employ QR decomposition to solve tion informed by detailed understanding of the performance overdetermined least squares problems. Performance of QR characteristics of the underlying architecture. decomposition is typically the crucial factor limiting problem sizes. In this paper, we focus on QR decomposition in particular and discuss the suitability of several algorithms for imple- Graphics Processing Units (GPUs) are high-performance pro- mentation on GPUs. Then, we provide a detailed discussion cessors capable of executing hundreds of floating point oper- and analysis of how blocked Householder reflections may be ations in parallel. As commodity accelerators for 3D graph- used to implement QR on CUDA-compatible GPUs sup- ics, GPUs o®er tremendous computational performance at ported by performance measurements. Our real-valued QR relatively low costs. While GPUs are favorable to applica- implementation achieves more than 10x speedup over the tions with much inherent parallelism requiring coarse-grain native QR algorithm in MATLAB and over 4x speedup be- synchronization between processors, methods for e±ciently yond the Intel Math Kernel Library executing on a multi- utilizing GPUs for algorithms computing QR decomposition core CPU, all in single-precision floating-point.
    [Show full text]
  • Numerical Solution of Saddle Point Problems
    Acta Numerica (2005), pp. 1–137 c Cambridge University Press, 2005 DOI: 10.1017/S0962492904000212 Printed in the United Kingdom Numerical solution of saddle point problems Michele Benzi∗ Department of Mathematics and Computer Science, Emory University, Atlanta, Georgia 30322, USA E-mail: [email protected] Gene H. Golub† Scientific Computing and Computational Mathematics Program, Stanford University, Stanford, California 94305-9025, USA E-mail: [email protected] J¨org Liesen‡ Institut f¨ur Mathematik, Technische Universit¨at Berlin, D-10623 Berlin, Germany E-mail: [email protected] We dedicate this paper to Gil Strang on the occasion of his 70th birthday Large linear systems of saddle point type arise in a wide variety of applica- tions throughout computational science and engineering. Due to their indef- initeness and often poor spectral properties, such linear systems represent a significant challenge for solver developers. In recent years there has been a surge of interest in saddle point problems, and numerous solution techniques have been proposed for this type of system. The aim of this paper is to present and discuss a large selection of solution methods for linear systems in saddle point form, with an emphasis on iterative methods for large and sparse problems. ∗ Supported in part by the National Science Foundation grant DMS-0207599. † Supported in part by the Department of Energy of the United States Government. ‡ Supported in part by the Emmy Noether Programm of the Deutsche Forschungs- gemeinschaft. 2 M. Benzi, G. H. Golub and J. Liesen CONTENTS 1 Introduction 2 2 Applications leading to saddle point problems 5 3 Properties of saddle point matrices 14 4 Overview of solution algorithms 29 5 Schur complement reduction 30 6 Null space methods 32 7 Coupled direct solvers 40 8 Stationary iterations 43 9 Krylov subspace methods 49 10 Preconditioners 59 11 Multilevel methods 96 12 Available software 105 13 Concluding remarks 107 References 109 1.
    [Show full text]
  • Trigonometric Transform Splitting Methods for Real Symmetric Toeplitz Systems
    View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Universidade do Minho: RepositoriUM TRIGONOMETRIC TRANSFORM SPLITTING METHODS FOR REAL SYMMETRIC TOEPLITZ SYSTEMS ZHONGYUN LIU∗, NIANCI WU∗, XIAORONG QIN∗, AND YULIN ZHANGy Abstract. In this paper we study efficient iterative methods for real symmetric Toeplitz systems based on the trigonometric transformation splitting (TTS) of the real symmetric Toeplitz matrix A. Theoretical analyses show that if the generating function f of the n × n Toeplitz matrix A is a real positive even function, then the TTS iterative methods converge to the unique solution of the linear system of equations for sufficient large n. Moreover, we derive an upper bound of the contraction factor of the TTS iteration which is dependent solely on the spectra of the two TTS matrices involved. Different from the CSCS iterative method in [19] in which all operations counts concern complex op- erations when the DFTs are employed, even if the Toeplitz matrix A is real and symmetric, our method only involves real arithmetics when the DCTs and DSTs are used. The numerical experiments show that our method works better than CSCS iterative method and much better than the positive definite and skew- symmetric splitting (PSS) iterative method in [3] and the symmetric Gauss-Seidel (SGS) iterative method. Key words. Sine transform, Cosine transform, Matrix splitting, Iterative methods, Real Toeplitz matrices. AMS subject classifications. 15A23, 65F10, 65F15. 1. Introduction. Consider the iterative solution to the following linear system of equa- tions Ax = b (1.1) by the two-step splitting iteration with alternation, where b 2 Rn and A 2 Rn×n is a symmetric positive definite Toeplitz matrix.
    [Show full text]