Parallel Band Two-Sided Matrix Bidiagonalization for Multicore Architectures Ltaief, Hatem and Kurzak, Jakub and Dongarra, Jack
Total Page:16
File Type:pdf, Size:1020Kb
Parallel Band Two-Sided Matrix Bidiagonalization for Multicore Architectures Ltaief, Hatem and Kurzak, Jakub and Dongarra, Jack 2009 MIMS EPrint: 2009.5 Manchester Institute for Mathematical Sciences School of Mathematics The University of Manchester Reports available from: http://eprints.maths.manchester.ac.uk/ And by contacting: The MIMS Secretary School of Mathematics The University of Manchester Manchester, M13 9PL, UK ISSN 1749-9097 Parallel Band Two-Sided Matrix Bidiagonalization for Multicore Architectures LAPACK Working Note # 209 Hatem Ltaief1, Jakub Kurzak1, and Jack Dongarra1;2;3? 1 Department of Electrical Engineering and Computer Science, University of Tennessee, Knoxville 2 Computer Science and Mathematics Division, Oak Ridge National Laboratory, Oak Ridge, Tennessee 3 School of Mathematics & School of Computer Science, University of Manchester fltaief, kurzak, [email protected] Abstract. The objective of this paper is to extend, in the context of multicore architectures, the concepts of algorithms-by-tiles [Buttari et al., 2007] for Cholesky, LU, QR factorizations to the family of two- sided factorizations. In particular, the bidiagonal reduction of a general, dense matrix is very often used as a pre-processing step for calculating the singular value decomposition. Furthermore, in the last Top500 list from June 2008, 98% of the fastest parallel systems in the world were based on multicores. The manycore trend has increasingly exacerbated the problem, and it becomes critical to efficiently integrate existing or new numerical linear algebra algorithms suitable for such hardware. By exploiting the concept of algorithms-by-tiles in the multicore environment (i.e., high level of parallelism with fine granularity and high performance data representation combined with a dynamic data driven execution), the band bidiagonal reduction presented here achieves 94 Gflop/s on a 12000 × 12000 matrix with 16 Intel Tigerton 2:4 GHz processors. 1 Introduction The objective of this paper is to extend, in the context of multicore architectures, the concepts of algorithms-by-tiles by Buttari et al. [7] for Cholesky, LU, QR factorizations to the family of two-sided factorizations i.e., Hessenberg, Tridiago- nalization, Bidiagonalization. In particular, the Bidiagonal Reduction (BRD) of a general, dense matrix is very often used as a pre-processing step for calculating the Singular Value Decomposition (SVD) [14, 28]: A = XΣY T ; with A 2 IRm×n;X 2 IRm×m ;Σ 2 IRm×n;Y 2 IRn×n: ? Research reported here was partially supported by the National Science Foundation and Microsoft Research. The necessity of calculating SVDs emerges from various computational sci- ence disciplines, e.g., in statistics where it is related to principal component analysis, in signal processing and pattern recognition, and also in numerical weather prediction [10]. The basic idea is to transform the dense matrix A to an upper bidiagonal form B by applying successive distinct transformations from the left (U) as well as from the right (V ) as follows: B = U T × A × V; A 2 IRn×n ;U 2 IRn×n ;V 2 IRn×n;B 2 IRn×n: The most commonly used algorithm to perform this two-sided reduction is the Golub-Kahan bidiagonalization [15]. Although this algorithm works for any matrix size, it adds extra floating point operations for rectangular matrices and thus, faster methods such as the Lawson-Hanson-Chan bidiagonalization are preferred [8]. Here, only square matrices are consider, and performance result comparisons of different bidiagonalization algorithms for rectangular matrices will appear in a companion paper. Also, we only look at the first stage of BRD, which goes from the origi- nal dense matrix A to a band bidiagonal matrix Bb, with b being the num- ber of upper-diagonals. The second stage, which annihilates those additional b upper-diagonals, has been studied especially by Lang [21] and is not examined in this paper. This two-stage transformation process is also explained by Grosser et al. [16]. Although expensive, orthogonal transformations are accepted tech- niques and commonly used for this reduction because they guarantee stability, as opposed to Gaussian Elimination [28]. The two common transformations are based on Householder reflectors and Givens rotations. Previous work by the au- thors [22] demonstrates the effectiveness of Householder reflectors over Givens rotations. Therefore, this two-sided band BRD is done by using Householder reflectors. Furthermore, in the last Top500 list from June 2008 [1], 98% of the fastest parallel systems in the world were based on multicores. The many-core trend has exacerbated the problem even more and it becomes judicious to efficiently integrate existing or new numerical linear algebra algorithms suitable for such hardware. As discussed in [7], a combination of several parameters is essential to match the architecture associated with the cores: (1) fine granularity to reach a high level of parallelism and to fit the cores' small caches; (2) asynchronicity to prevent any global barriers; (3) Block Data Layout (BDL), a high performance data representation to perform efficient memory access; and (4) dynamic data driven scheduler to ensure any enqueued tasks can immediately be processed as soon as all their data dependencies are satisfied. While (1) and (3) represent im- portant items for one-sided and two-sided transformations, (2) and (4) are even more critical for two-sided transformations because of the tremendous amount of tasks generated by the right transformation. Indeed, as a comparison, the algo- rithmic complexity for the QR factorization is 4=3 n3, while it is 8=3 n3 for the BRD algorithm, with n being the matrix size. On the other hand, previous work done by Kurzak et al. [19, 20] show how the characteristics of tiled algorithms 2 perfectly match even the architectural features of modern multicore processors such as the Cell Broadband Engine processor. The remainder of this document is organized as follows: Section 2 recalls the standard BRD algorithm. Section 3 describes the implementation of the parallel tiled BRD algorithm. Section 4 outlines the pros and cons of static and dynamic scheduling. Section 5 presents performance results. Comparison tests are run on shared-memory architectures against the state of the art, high performance dense linear algebra software libraries, LAPACK [3] and ScaLAPACK [9]. Section 6 gives a detailed overview of previous projects in this area. Finally, section 7 summarizes the results of this paper and presents the ongoing work. 2 The Standard Bidiagonal Reduction In this section, we review the original BRD algorithm of a general, dense matrix. 2.1 The Sequential Algorithm The standard BRD algorithm of A 2 IRn×n based on Householder reflectors combines two factorizations methods, i.e. QR (left reduction) and LQ (right reduction) decompositions. The two phases are written as follows: Algorithm 1 Bidiagonal Reduction with Householder reflectors 1: for j = 1 to n do 2: x = Aj:n;j 3: uj = sign(x1) jjxjj2 e1 + x 4: uj = uj = jjuj jj2 ∗ 5: Aj:n;j:n = Aj:n;j:n − 2 uj (uj Aj:n;j:n) 6: if j < n then 7: x = Aj;j+1:n 8: vj = sign(x1) jjxjj2 e1 + x 9: vj = vj = jjvj jj2 ∗ 10: Aj:n;j+1:n = Aj:n;j+1:n − 2 (Aj:n;j+1:n vj ) vj 11: end if 12: end for Algorithm 1 takes as input a dense matrix A and gives as output the upper bidiagonal decomposition. The reflectors uj and vj can be saved in the lower and upper parts of A, respectively, for storage purposes and used later if necessary. The bulk of the computation is located in line 5 and in line 10 in which the reflectors are applied to A from the left and then from the right, respectively. Four flops are needed to annihilate one element of the matrix, which makes the total number of operations for such algorithm 8=3 n3 (the lower order terms are neglected). It is obvious that Algorithm 1 is not efficient as is, especially because it is based on matrix-vector Level-2 BLAS operations. Also, a single 3 entire column/row is reduced at a time, which engenders a large stride access to memory. The main contribution described in this paper is to transform this algorithm to work on tiles instead to generate, as many as possible, matrix- matrix Level-3 BLAS operations. First introduced by Berry et al. in [5] for the reduction of a nonsymmetric matrix to block upper-Hessenberg form and then revisited by Buttari et al. in [7], this idea considerably improves data locality and cache reuse. 3 The Parallel Band Bidiagonal Reduction In this section, we present the parallel implementation of the band BRD algo- rithm based on Householder reflectors. 3.1 Fast Kernel Descriptions There are eight overall kernels implemented for the two phases, four for each phase. For phase 1 (left reduction), the first four kernels are identical to the ones used by Buttari et al. [7] for the QR factorization, in which the reflectors are stored in column major form. DGEQRT is used to do a QR blocked factorization using the WY technique for efficiently accumulating the Householder reflectors [26]. The DLARFB kernel comes from the LAPACK distribution and is used to apply a block of Householder reflectors. DTSQRT performs a block QR factorization of a matrix composed of two tiles, a triangular tile on top of a dense square tile. DSSRFB updates the matrix formed by coupling two square tiles and applying the resulting DTSQRT transformations. Buttari et al. gives a detailed description of the different kernels [7]. For phase 2 (right reduction), the reflectors are now stored in rows. DGELQT is used to do a LQ blocked factorization using the WY technique as well. DT- SLQT performs a block LQ factorization of a matrix composed of two tiles, a triangular tile beside a dense square tile.