Parallelizing the Qr Algorithm for the Unsymmetric

PARALLELIZING THE QR ALGORITHM FOR THE UNSYMMETRIC ALGEBRAIC EIGENVALUE PROBLEM: MYTHS AND REALITY y GREG HENRY AND ROBERTVAN DE GEIJN Abstract. Over the last few years, it has been suggested that the popular QR algorithm for the unsymmetric eigenvalue problem does not paral lelize. In this paper, we present both positive and negative results on this subject: In theory, asymptotical ly perfect speedup can be obtained. In practice, reasonable speedup can be obtained on a MIMD distributed memory computer, for a relatively smal l number of processors. However, we also show theoretical ly that it is impossible for the standard QR algorithm to bescalable. Performanceofaparal lel implementation of the TM LAPACK DLAHQR routine on the Intel Paragon system is reported. 1. Intro duction. Distributed memory parallel algorithms for the unsymmetric eigenvalue problem have b een allusive. There are several matrix multiply-based metho ds currently b eing studied. Auslander and Tsao [2] and Lederman, Tsao, and Turnbull [24] have a matrix multiply- based parallel algorithm, which uses a p olynomial mapping of the eigenvalues. Bai and Demmel [4] have another parallel algorithm based on bisection with the matrix sign function. Matrix tearing metho ds for nding the eigensystem of an unsymmetric Hessenb erg matrix have b een prop osed by Dongarra and Sidani [9]. These involve doing a rank one change to the Hessenb erg matrix to maketwo separate submatrices and then nding the eigenpairs of each submatrix, followed by p erforming Newton's metho d on these values to nd the solution to the original problem. None of the ab ove algorithms are particularly comp etitive in the serial case. They su er from requiring more oating p oint op erations ops and/or yield a loss of accuracy when compared to ecient implementations of the QR algorithm. Ecient sequential and shared memory parallel implementations of the QR algorithm use a blo cked version of the Francis double implicit shifted algorithm [13] or a variant thereof [20]. There have also b een attempts at improving data reuse by increasing the numb er of shifts ei- ther by using a multi-implicited shifted QR algorithm [3] or pip elining several double shifts simultaneously [31, 32 ]. Anumb er of attempts at parallelizing the QR algorithm have b een made see Boley and Maier [7], Geist, Ward, Davis, and Funderlic [14], and Stewart [27 ]. Distributing the work evenly amongst the pro cessors has proven dicult for conventional storage schemes, esp ecially when compared to the parallel solution of dense linear systems [22 , 23]. Communication also b ecomes a more signi cant b ottleneck for the parallel QR algorithm. As noted byvan de Geijn [29] and van de Geijn and Hudson [30], the use of a blo ck Hankel-wrapp ed storage scheme can alleviate some of the problems involved in parallelizing the QR algorithm. In this pap er, we presenta numb er of results of theoretical signi cance on the sub ject. We reexamine the results on the Hankel-wrapp ed storage schemes in the setting of a parallel im- Intel Sup ercomputer Systems Division, 14924 N.W. Greenbrier Pkwy, Beaverton, OR 97006, [email protected] y Department of Computer Sciences, The UniversityofTexas, Austin, Texas 78712, [email protected] plementation of a state-of-the-art sequential implementation. Theoretically we can show that under certain conditions the describ ed approach is asymptotically 100 ecient: if the num- b er of pro cessors is xed and the problem size grows arbitrarily large, p erfect sp eedup can b e approached. However, we also show that our approach is not scalable in the following sense: To maintain a given level of eciency, the dimension of the matrix must grow linearly with the numb er of pro cessors. As a result, it will b e imp ossible to maintain p erformance as pro cessors are added, since memory requirements grow with the square of the dimension, and physical memory grows only with the numb er of pro cessors. While this could b e a de ciency attributable to our implementation, we also show that for the standard implementations of the sequential QR algorithm, it is imp ossible to nd an implementation with b etter scalability prop erties. Finally, we show that these techniques can indeed b e incorp orated into a real co de by giving details of a prototyp e distributed memory implementation of the serial algorithm DLAHQR [1], the LAPACK version of the double implicit shifted QR algorithm. Full functionality of the LAPACK co de can b e supp orted. That is, the techniques can b e extended to allow for the cases of computing the Schur vectors, computing the Schur decomp osition of H , or just computing the eigenvalues alone. Wehave implemented a subset of this functionality, for which the co de is describ ed and p erformance results are given. Thus this pap er makes four contributions: It describ es a data decomp osition that allows, at least conceptually, straight-forward implementation of the QR algorithm; It gives theoretical limitations for parallelizing the standard QR algorithm; It describ es a parallel implementation based on the prop osed techniques; It rep orts p erformance results of this pro of-of-concept imple- TM mentation obtained on the Intel Paragon system. 2. Sequential QR Algorithm. While we assume the reader of this pap er to b e fully versed in the intricate details of the QR algorithm, we brie y review the basics in this section. The Francis double implicit shifted QR algorithm has b een a successful serial metho d for T computing the Schur decomp osition H = QT Q . Here T is upp er pseudo-triangular with 1x1 or 2x2 blo cks along the diagonal and Q is orthogonal. We assume for simplicity that our initial matrix H is Hessenb erg. The parallelization of the reduction to Hessenb erg form is a well understo o d problem, and unlike the eigenvalue problem, the Hessenb erg reduction has b een shown to parallelize well [5, 10]. One step of the Francis double shift Schur decomp osition is in Figure 1. Here, the Householder matrices are symmetric orthogonal transforms of the form: T vv P = I 2 i T v v n where v 2< and 0 if j<i+1 or j>i+3 v = j 1 if j = i +1 We assume the Hessenb erg matrix is unreduced, and if not, nd the largest unreduced submatrix of H . Supp ose this submatrix is H k : l; k : l. We then apply the Francis HQR Step to the rows Francis HQR Step e = eigH n 1:n; n 1:n Let x =H e1I H e2I n n nn Let P 2< b e a Householder matrix 0 s.t. P x is a multiple of e . 0 1 H P HP 0 0 for i =1;:::;n 2 Compute P so that i P H has zero i +2;i and i i +3;ientries. Up date H P HP i i Up date Q QP i endfor Fig. 1. Sequential Francis HQR Step and columns of H corresp onding to the submatrix; that is, H k : l; : P H k : l; : i H :;k : l H:;k : lP i The double implicit shifts in this case are chosen to b e the eigenvalues of e = eigH l 1:l; l 1:l In practice, after every couple of iterations, some of the sub diagonals of H will b ecome nu- merically zero, and at this p oint the problem de ates into smaller problems. 3. Parallel QR Algorithms. 3.1. Data Decomp osition. We brie y describ e the storage scheme presented in [30 ]. Sup- nn p ose A 2< where n = mhp for some integers h and m and p is the numb er of pro cessors. Supp ose A is partitioned as follows 2 3 A A A A 1;1 1;2 1;mp1 1;mp 6 7 6 7 A A A A 2;1 2;2 2;mp1 2;mp 6 7 6 7 A A A A 3;1 3;2 3;mp1 3;mp A = 6 7 ; 6 7 . 6 7 . 4 5 A A A A mp;1 mp;2 mp;mp1 mp;mp k hh where A 2< .We denote pro cessor k owning blo ck A with a sup erscript A . The one i;j i;j ij dimensional blo ck Hankel-wrapp ed storage for m = 1 assigns submatrix A to pro cessor i;j i + j 2 mo d p: 00000 11111 22222 33333 00000 11111 22222 33333 0000 11111 22222 33333 000 11111 22222 33333 00 11111 22222 33333 33333 00000 1 22222 33333 00000 22222 33333 00000 2222 B 33333 00000 222 BB 33333 00000 22 3 00000 11111 00000 11111 0000 11111 000 11111 00 11111 1 22222 22222 2222 222 22 Fig. 2. Data decomposition and chasing of the bulge. That is, the distribution amongst the pro cessors is as follows [30 ]: 3 2 0 1 2 p2 p1 A A A A A 1;1 1;2 1;3 1;mp1 1;mp 7 6 1 2 3 p1 0 7 6 A A A A A 7 6 2;1 2;2 2;3 2;mp1 2;mp 7 6 3 4 0 1 7 6 A A A A 3;2 3;3 3;mp1 3;mp 7 6 7 6 . 7 . 6 . 5 4 p3 p2 A A mp;mp1 mp;mp where the sup erscript indicates the pro cessor assignment.

Load more