Restructuring the Tridiagonal and Bidiagonal QR Algorithms for Performance

Restructuring the Tridiagonal and Bidiagonal QR Algorithms for Performance FIELD G. VAN ZEE, The University of Texas at Austin ROBERT A. VAN DE GEIJN, The University of Texas at Austin GREGORIO QUINTANA-ORT´ı, Universitat Jaume I We show how both the tridiagonal and bidiagonal QR algorithms can be restructured so that they be- come rich in operations that can achieve near-peak performance on a modern processor. The key is a novel, cache-friendly algorithm for applying multiple sets of Givens rotations to the eigenvector/singular vector matrix. This algorithm is then implemented with optimizations that (1) leverage vector instruction units to increase floating-point throughput, and (2) fuse multiple rotations to decrease the total number of memory operations. We demonstrate the merits of these new QR algorithms for computing the Hermitian eigenvalue decomposition (EVD) and singular value decomposition (SVD) of dense matrices when all eigenvectors/singular vectors are computed. The approach yields vastly improved performance relative to the traditional QR algorithms for these problems and is competitive with two commonly used alternatives— Cuppen’s Divide and Conquer algorithm and the Method of Multiple Relatively Robust Representations— while inheriting the more modest O(n) workspace requirements of the original QR algorithms. Since the computations performed by the restructured algorithms remain essentially identical to those performed by the original methods, robust numerical properties are preserved. Categories and Subject Descriptors: G.4 [Mathematical Software]: Efficiency General Terms: Algorithms; Performance Additional Key Words and Phrases: eigenvalues, singular values, tridiagonal, bidiagonal, EVD, SVD, QR algorithm, Givens rotations, linear algebra, libraries, high-performance 1. INTRODUCTION The tridiagonal (and/or bidiagonal) QR algorithm is taught in a typical graduate-level numerical linear algebra course, and despite being among the most accurate1 methods for performing eigenvalue and singular value decompositions (EVD and SVD, re- spectively), it is not used much in practice because its performance is not competitive [Watkins 1982; Golub and Loan 1996; Stewart 2001; Dhillon and Parlett 2003]. The reason for this is twofold: First, classic QR algorithm implementations, such as those in LAPACK, cast most of their computation (the application of Givens rotations) in terms of a routine that is absent from the BLAS, and thus is typically not available in optimized form. Second, even if such an optimized implementation existed, it would 1Notable algorithms which exceed the accuracy of the QR algorithm include the dqds algorithm (a variant of the QR algorithm) [Fernando and Parlett 1994; Parlett and Marques 1999] and the Jacobi-SVD algorithm by [Drmacˇ and Veselic´ 2008a; 2008b]. Authors’ addresses: Field G. Van Zee and Robert A. van de Geijn, Institute for Computational Engineer- ing and Sciences, Department of Computer Science, The University of Texas at Austin, Austin, TX 78712, {field,rvdg}@cs.utexas.edu. Gregorio Quintana-Ort´ı, Departamento de Ingenier´ıa y Ciencia de Computa- dores, Universitat Jaume I, Campus Riu Sec, 12.071, Castellon,´ Spain, [email protected]. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is per- mitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212) 869-0481, or [email protected]. °c 2012 ACM 0098-3500/2012/01-ART00 $10.00 DOI 10.1145/0000000.0000000 http://doi.acm.org/10.1145/0000000.0000000 ACM Transactions on Mathematical Software, Vol. 0, No. 0, Article 00, Publication date: January 2012. 00:2 F. G. Van Zee et al. not matter because the QR algorithm is currently structured to apply Givens rotations via repeated instances of (n2) computation on (n2) data, effectively making it rich in level-2 BLAS-like operations,O which inherentlyO cannot achieve high performance because there is little opportunity for reuse of cached data. Many in the numerical linear algebra community have long speculated that the QR algorithm’s performance could be improved by saving up many sets of Givens rotations before applying them to the matrix in which eigenvectors or singular vectors are being accumulated. In this paper, we show that, when computing all eigenvectors of a dense Hermitian matrix or all singular vectors of a dense general matrix, dramatic improvements in performance can indeed be achieved. This work makes a number of contributions to this subject: — It describes how the traditional QR algorithm can be restructured so that computation is cast in terms of an operation that applies many sets of Givens rotations to the matrix in which the eigen-/singular vectors are accumulated. This restructuring preserves a key feature of the original QR algorithm: namely, the approach requires only linear ( (n)) workspace. An optional optimization to the restructured algorithm that requiresO (n2) workspace is also discussed and tested. — It proposes anO algorithm for applying many sets of Givens rotations that, in theory, exhibits greatly improved reuse of data in the cache. It then shows that an implementation of this algorithm can achieve near-peak performance by (1) utilizing vector instruction units to increase floating-point operation throughput, and (2) fusing multiple rotations so that data can be reused in-register, which decreases costly memory operations. — It exposes and leverages the fact that lower computational costs of both the method of Multiple Relatively Robust Representations (MRRR) [Dhillon and Parlett 2004; Dhillon et al. 2006] and Cuppen’s Divide-and-Conquer method (D&C) [Cuppen 1980] are partially offset by an (n3) difference in cost between the former methods’ back- transformations and the correspondingO step in the QR algorithm. — It demonstrates performance of EVD via the QR algorithm that is competitive with that of D&C- and MRRR-based EVD, and QR-based SVD performance that is com- parable to D&C-based SVD. — It makes the resulting implementations available as part of the open source libflame library. The paper primarily focuses on the complex case; the mathematics then trivially sim- plify to the real case. We consider these results to be significant in part because we place a premium on simplicity and reliability. The restructured QR algorithm presented in this paper is simple, gives performance that is almost as good as that of more intricate algorithms (such as D&C and MRRR) and does so using only (n) workspace, and without the worry of what might happen to numerical accuracy inO pathological cases such as tightly clustered eigen-/singular values. It should be emphasized that the improved performance we report is not so pro- nounced that the QR algorithm becomes competitive with D&C and MRRR when performing standalone tridiagonal EVD or bidiagonal SVD (that is, when the input matrix is already reduced to condensed form). Rather, we show that in the context of the dense decompositions, which include two other stages of (n3) computation, the restructured QR algorithm provides enough speedup over the moreO traditional method to facilitate competitive overall performance. ACM Transactions on Mathematical Software, Vol. 0, No. 0, Article 00, Publication date: January 2012. Restructuring the Tridiagonal and Bidiagonal QR Algorithms for Performance 00:3 2. COMPUTING THE SPECTRAL DECOMPOSITION OF A HERMITIAN MATRIX Given a Hermitian matrix A Cn×n, its eigenvalue decomposition (EVD) is given by A = QDQH , where Q Cn×n∈is unitary (QH Q = I) and D Rn×n is diagonal. The eigenvalues of matrix A∈can then be found on the diagonal of∈D while the correspond- ing eigenvectors are the columns of Q. The standard approach to computing the EVD proceeds in three steps [Stewart 2001]: Reduce matrix A to real tridiagonal form T via H T unitary similarity transformations: A = QAT QA ; Compute the EVD of T : T = QT DQT ; H and back-transform the eigenvectors of T : Q = QAQT so that A = QDQ . Let us dis- cuss these in more detail. Note that we will use the general term “workspace” to refer to any significant space needed beyond the n n space that holds the input matrix A and the n-length space that holds output eigenvalues× (ie: the diagonal of D). 2.1. Reduction to real tridiagonal form The reduction to tridiagonal form proceeds as the computation and application of a sequence of Householder transformations. When the transformations are defined as reflectors [Golub and Loan 1996; Van Zee et al. 2012; Stewart 2001], the tridiagonal H reduction takes the form Hn−2 H1H0AH0H1 Hn−2 = QA AQA = T , a real-valued, tridiagonal matrix.2 ··· ··· 4 3 The cost of the reduction to tridiagonal form is 3 n floating-point operations (flops) 4 3 if A is real and 4 3 n flops if it is complex-valued. About half of those computations are in symmetric× (or Hermitian) matrix-vector multiplications (a level-2 BLAS operation [Dongarra et al. 1988]), which are inherently slow since they perform (n2) computations on (n2) data. Most of the remaining flops are in Hermitian rank-Ok up- dates (a level-3 BLASO operation) which can attain near-peak performance on modern architectures. If a flop that is executed as part of a level-2 BLAS operation is K2 times slower than one executed as part of a level-3 BLAS operation, we will give the effective 2 3 cost of this operation as (K2 + 1) 3 n , where K2 1 and typically K2 1.

Load more