Communication-Avoiding QR Decomposition for Gpus

Communication-Avoiding QR Decomposition for GPUs Michael Anderson Grey Ballard James Demmel Kurt Keutzer Electrical Engineering and Computer Sciences University of California at Berkeley Technical Report No. UCB/EECS-2010-131 http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-131.html October 8, 2010 Copyright © 2010, by the author(s). All rights reserved. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission. Communication-Avoiding QR Decomposition for GPUs Michael Anderson, Grey Ballard, James Demmel and Kurt Keutzer UC Berkeley: Department of Electrical Engineering and Computer Science Berkeley, CA USA fmjanders,ballard,demmel,[email protected] Abstract—We describe an implementation of the Matrices with these extreme aspect ratios would seem Communication-Avoiding QR (CAQR) factorization that like a rare case, however they actually occur frequently in runs entirely on a single graphics processor (GPU). We show applications of the QR decomposition. The most common that the reduction in memory traffic provided by CAQR allows us to outperform existing parallel GPU implementations of example is linear least squares, which is ubiquitous in nearly QR for a large class of tall-skinny matrices. Other GPU all branches of science and engineering and can be solved implementations of QR handle panel factorizations by either using QR. Least squares matrices may have thousands of sending the work to a general-purpose processor or using rows representing observations, and only a few tens or entirely bandwidth-bound operations, incurring data transfer hundreds of columns representing the number of parameters. overheads. In contrast, our QR is done entirely on the GPU using compute-bound kernels, meaning performance is Another example is stationary video background subtraction. good regardless of the width of the matrix. As a result, we This problem can be solved using many QR decompositions outperform CULA, a parallel linear algebra library for GPUs, of matrices on the order of 100,000 rows by 100 columns [1]. by up to 13x for tall-skinny matrices. An even more extreme case of tall-skinny matrices are found We also discuss stationary video background subtraction as a in s-step Krylov methods [2]. These are methods for solving motivating application. We apply a recent statistical approach, which requires many iterations of computing the singular value a linear equation Ax = b by generating an orthogonal 2 n decomposition of a tall-skinny matrix. Using CAQR as a first basis for the Krylov sequence fv; Av; A v; :::A vg for a step to getting the singular value decomposition, we are able to starting vector v. In s-step methods, multiple basis vectors get the answer 3x faster than if we use a traditional bandwidth- are generated at once and can be orthogonalized using a QR bound GPU QR factorization tuned specifically for that matrix factorization. The dimensions of this QR factorization can size, and 34x faster than if we use Intel’s Math Kernel Library (MKL) QR factorization. be millions of rows by less than ten columns. These applications demand a high-performance QR rou- Keywords-GPGPU; Linear Algebra; QR Decomposition; Ro- tine. Extracting the foreground from a 10-second surveil- bust PCA; lance video, for example, requires over a teraflop of compu- tation [1]. Unfortunately, existing GPU libraries do not pro- I. INTRODUCTION vide good performance for these applications. While several One of the fundamental problems in linear algebra is the parallel implementations of QR factorization for GPUs are QR decomposition, in which a matrix A is factored into a currently available [3] [4] [5], they all use generally the same product of two matrices Q and R, where Q is orthogonal approach, tuned for large square matrices, and thus have up and R is upper triangular. The QR decomposition is most to an order of magnitude performace degredation for tall- well known as a method for solving linear least squares skinny problems. The loss in performance is largely due to problems. It can additionally be used to find eigenvalues the communication demands of the tall-skinny case. We aim and eigenvectors of matrices, and is used commonly across to supplement the existing libraries with a QR solution that all of dense linear algebra. performs well over all matrix sizes. In terms of getting good performance, a particularly Communication-Avoiding QR (CAQR) is a recent algo- challenging case of the QR decomposition is tall-skinny rithm for solving QR decomposition which is optimal with matrices. These are matrices where the number of rows regard to the amount of communication performed [6]. This is much larger than the number of columns. For example means that the algorithm minimizes amount of data that the ratio of rows to columns can be 3 to 1, 1,000 to 1, must be sent between processors in the parallel setting, or or even 100,000 to 1 in some cases. QR decompositions alternatively the amount of data transmitted to and from of this shape matrix, more than any other, require a large global memory. As a result, the CAQR algorithm is a natural amount of communication between processors in a parallel fit for the tall-skinny case where communication is usually setting. This means that most libraries, when faced with this the bottleneck. problem, employ approaches that are bandwidth-bound and In this work we discuss the implementation and perfor- cannot saturate the floating-point capabilities of processors. mance of CAQR for a single graphics processor (GPU). It is a distinguishing characteristic of our work that the entire tionally expensive. However, previous work in large-scale decomposition is performed on the GPU using compute- distrbuted environments focuses on different scales and types bound kernels. Despite their increasing general-purpose ca- of problems than ours. More recently, CAQR has also been pabilities, it is still a very challenging task to map the applied to multi-core machines [10], resulting in speeups entirety of this particular algorithm to the GPU. In doing of up to 10x over Intel’s Math Kernel Library (MKL). so, however, we are able to leverege the enormous compute Unfortunatley, this code is not available for us to compare capibility of GPUs while avoiding potentially costly CPU- to. Note, however, that implementing CAQR on GPUs is a GPU transfers in the inner loop of the algorithm. The bene- much different problem than on multi-core and it is likely fits can most clearly be seen for very skinny matrices where that both will be needed in future libraries and applications. communication demands are large relative to the amount of The paper is organized as follows. In Section I we floating-point work. As a result, we can outperform existing survey previous algorithms and implementations of the QR libraries for a large class of tall-skinny matrices. In the more decomposition. Next we discuss the high-level problem of extreme ratios of rows to columns, such as 110,592 by 100, mapping CAQR to heterogenous systems in Section III. Sec- we saw speedups of up to 13x over CULA, a popular linear tion IV describes the specifics of our GPU implementation algebra library parallelized for GPUs. It is important to note and tuning process. Section V contains our performance that everything we compare to is parallel. The speedups compared to currently available software. In Section VI we we show are a result of using the parallel hardware more outline our motivating video processing application and the efficiently. performance benefits of using CAQR. Section VII draws We do not directly use available Basic Linear Algebra conclusions. Subroutine (BLAS) libraries as a building block. The algorithm requires many hundreds or thousands of small QR II. BACKGROUND ON QR APPROACHES decompositions and other small BLAS and LAPACK [7] There are several algorithms to find the QR decomposition operations to be performed in parallel. This is not currently of a matrix. For example, one can use Cholesky QR, the supported in the vendor’s BLAS library [8]. Consequently, Gram-Schmidt process, Givens rotations, or Householder we had to do significant low-level tuning of these very reflectors [13]. It is well-known however that Cholesky QR small operations to achieve sufficient performance. We will and the Gram-Schmidt process are not as numerically stable, discuss and quantify the effect of specific low-level opti- so most general-purpose software for QR uses either Givens mizations that were critical to improving performance. We rotations or Householder reflectors. CAQR is a form of the also highlight new tradeoffs the GPU introduces, which Householder approach, where the Householder vectors are differ from previous CAQR work on clusters and multi-core broken up in such a way that communication is minimized. architectures [9] [10]. Finally, we discuss the application of CAQR to stationary A. Householder QR video background subtraction, which was a motivating ap- Most existing implementations of QR for GPUs have used plication for this work. Recent work in statistics has focused the Householder approach, as does LAPACK. One favorable on sparsity, and specifically the use of `1 minimization as a property of the Householder algorithm is that it can be means of exploiting sparsity to solve certain problems. One organized in such a way that it makes use of BLAS3 (matrix- of these problems is stationary video background subtrac- matrix) operations. Specifically, the trailing matrix updates tion, which can be solved using Robust Principal Component for several Householder vectors can be delayed and done Analysis (PCA) [1], which in turn uses `1 minimization to all at once using matrix-multiply.

Load more