SIAM J. ANAL. APPL. c 2002 Society for Industrial and Applied Mathematics Vol. 23, No. 4, pp. 929–947

THE MULTISHIFT QR ALGORITHM. PART I: MAINTAINING WELL-FOCUSED SHIFTS AND LEVEL 3 PERFORMANCE∗

KAREN BRAMAN† , RALPH BYERS† , AND ROY MATHIAS‡

Abstract. This paper presents a small-bulge multishift variation of the multishift QR algorithm that avoids the phenomenon ofshiftblurring, which retards convergence and limits the number of simultaneous shifts. It replaces the large diagonal bulge in the multishift QR sweep with a chain of many small bulges. The small-bulge multishift QR sweep admits nearly any number ofsimultaneous shifts—even hundreds—without adverse effects on the convergence rate. With enough simultaneous shifts, the small-bulge multishift QR algorithm takes advantage ofthe level 3 BLAS, which is a special advantage for computers with advanced architectures.

Key words. QR algorithm, implicit shifts, level 3 BLAS, eigenvalues, eigenvectors

AMS subject classifications. 65F15, 15A18

PII. S0895479801384573

1. Introduction. This paper presents a small-bulge multishift variation of the multishift QR algorithm [4] that avoids the phenomenon of shift blurring, which re- tards convergence and limits the number of simultaneous shifts that can be used effec- tively. The small-bulge multishift QR algorithm replaces the large diagonal bulge in the multishift QR sweep with a chain of many small bulges. The small-bulge multishift QR sweep admits nearly any number of simultaneous shifts—even hundreds—without adverse effects on the convergence rate. It takes advantage of the level 3 BLAS by organizing nearly all the arithmetic workinto matrix-matrix multiplies. This is par- ticularly efficient on most modern computers and especially efficient on computers with advanced architectures. The QR algorithm is the most prominent member of a large and growing family of bulge-chasing algorithms [11, 14, 15, 17, 28, 30, 51, 39, 40, 44, 57, 59]. It is remarkable that after thirty-five years, the original QR algorithm [25, 26, 35] with few modifications is still the method of choice for calculating all eigenvalues and (optionally) eigenvectors of small, nonsymmetric matrices. Despite being a dense matrix method, it is arguably still the method of choice for computing all eigenvalues of moderately large, nonsymmetric matrices, i.e., at this writing, matrices of order greater than 1,000. Its excellent rounding error properties and convergence behavior are both theoretically and empirically satisfactory [9, 16, 19, 32, 41, 42, 52, 62, 63, 64] despite surprising convergence failures [8, 10, 19, 56, 58].

∗Received by the editors February 5, 2001; accepted for publication (in revised form) by D. Boley May 15, 2001; published electronically March 27, 2002. This work was partially supported by the National Computational Science Alliance under award DMS990005N and utilized the NCSA SGI/Cray Origin2000. http://www.siam.org/journals/simax/23-4/38457.html †Department ofMathematics, University ofKansas, 405 Snow Hall, Lawrence, KS 66045 ([email protected], [email protected]). The research ofthese authors was partially sup- ported by National Science Foundation awards CCR-9732671, DMS-9628626, CCR-9404425, and MRI-9977352 and by the NSF EPSCoR/K*STAR program through the Center for Advanced Scien- tific Computing. ‡Department ofMathematics, College ofWilliam & Mary, Williamsburg, VA 23187 (mathias@ wm.edu). The research ofthis author was partially supported by NSF grants DMS-9504795 and DMS-9704534. 929 930 KAREN BRAMAN, RALPH BYERS, AND ROY MATHIAS

It has proven difficult to implement the QR algorithm in a way that takes full advantage of the potentially high execution rate of computers with advanced architectures—particularly hierarchical memory architectures. This is so much the case that some high-performance algorithms workaround the slow but reliable QR algorithm by using faster but less reliable methods to tear or split off smaller sub- matrices on which to apply the QR algorithm [2, 5, 6, 7, 23, 33]. An exception is the successful high-performance pipelined Householder QZ algorithm in [18]. Al- though this paper is not directly concerned with distributed memory computation, it is worth noting that there are distributed memory implementations of the QR algorithm [31, 45, 48, 50]. Readers of this paper need to be familiar with the double implicit shift QR al- gorithm [25, 26, 35]. See, for example, any of the textbooks [20, 27, 47, 53, 62]. Familiarity with the multishift QR algorithm [4] as implemented in LAPACK ver- sion 2 [1] is helpful. We will refer to the iterative step from bulge-introduction to bulge-disappearance as a QR sweep. 1.1. Notation and the BLAS. 1.1.1. The BLAS. The basic linear algebra subprograms (BLAS) are a set of frequently required elementary matrix and vector operations introduced first in [38, 37] and later extensively developed [21, 22]. The level 3 BLAS are a small set of matrix- matrix operations like matrix-matrix multiply [22]. The level 2 BLAS are a set of matrix-vector operations like matrix-vector multiplication. The level 1 BLAS are a set of vector-vector operations like dot-products and scaled vector addition (x ← ax+y). Because they are relatively simple, have regular patterns of memory access, and have a high ratio of arithmetic workto data, the level 3 BLAS can be organized to makenear optimal use of hierarchical cache memory [20, section 2.6], execution pipelining, and parallelism. It is only a small exaggeration to say that executing level 3 BLAS is what modern computers do best. Many manufacturers supply hand-tuned, extraordinarily efficient implementations of the BLAS. Automatically tuned versions of the BLAS [61] also perform well. It is the ability to exploit matrix-matrix multiplies that makes spectral splitting methods attractive competitors to the QR algorithm [2, 5, 6, 7]. The small-bulge multishift QR algorithm which we propose attains much of its efficiency through the level 3 BLAS. 1.1.2. Notation. Throughout this paper we use the following notation and def- initions. 1. We will use the “colon notation” to denote submatrices: Hi:j,k:l is the subma- trix of matrix H in rows i–j and columns k–l inclusively. The notation H:,k:l indicates the submatrix in columns k–l inclusively (and all rows). The notation Hi:j,: indicates the submatrix in rows i–j inclusively (and all columns). 2. A quasi- is a real, blocktriangular matrix with 1-by-1 and 2-by-2 blocks along the diagonal. n×n 3. A matrix H ∈ R is in Hessenberg form if hij = 0 whenever i>j+1. The matrix H is said to be unreduced if, in addition, the subdiagonal entries are nonzero, i.e., hij = 0 whenever i = j +1. 4. Following [27, p. 19], we define a “flop” as a single floating point operation, i.e., either a floating point addition or a floating point multiplication together with its associated subscripting. The Fortran statement

C(I, J)=C(I, J)+A(I, K) ∗ B(K, J) MULTISHIFT QR ALGORITHM 931

involves two flops. In some of the examples in section 3, we report an automatic hardware count of the number of floating point instructions executed. Note that a trinary multiply-add instruction counts as just one executed instruction in the hardware count even though it executes two flops. Thus, depending on the compiler and optimization level, the above Fortran statement could be executed using either one floating point instruc- tion or two floating point instructions (perhaps along with some integer subscripting calculations) even though it involves two flops. 2. A small-bulge multishift QR algorithm. If there are many simultaneous shifts, then most of the workin a QR sweep may be organized into matrix-vector and matrix-matrix multiplies that take advantage of the higher level BLAS [4]. However, to date, the performance of the large-bulge multishift QR has been disappointing [24, 55, 56, 58]. Accurate shifts are essential to accelerate the convergence of the QR algorithm. In the large-bulge multishift QR algorithm rounding errors “blur” the shifts and retard convergence [56, 58]. The ill effects of shift blurring grow rapidly with the size of the diagonal bulge, effectively limiting the large-bulge multishift QR algorithm to roughly 10 simultaneous shifts—too few to take full advantage of level 3 BLAS. See Figure 1. Watkins explains much of the mechanics of shift blurring with the next theorem [56, 58]. Theorem 2.1 (see [58]). Consider a Hessenberg-with-a-bulge matrix that occurs during a multishift QR sweep using m pairs of simultaneous shifts. Let B˜ be the 2(m +1)-by-2(m +1) principal submatrix containing the bulge. Obtain B from B˜ by dropping the first row and the last column. Let N be a (2m +1)-by-(2m +1) Jordan block with eigenvalue zero. The shifts are the 2m finite eigenvalues of the bulge pencil

(2.1) B − λN.

The theorem shows that the shift information is transferred through the matrix by the bulge pencils (2.1). Watkins observes empirically that the eigenvalues of B − λN tend be ill-conditioned and that the ill-conditioning grows rapidly with the size of the bulge. The observation suggests that in order to avoid shift blurring, the bulges must be confined to very small order. This is borne out by many successful small bulge versions of the QR algorithm including the one described in this paper. See, for example, [17, 24, 31, 29, 34, 36, 54, 55]. In order to apply complex conjugate shifts in real arithmetic, one must allow bulges big enough to transmit at least two shifts. The smallest such bulges occupy full 4-by-4 principal submatrices. We sometimes refer to 4-by-4 bulges as “double implicit shift bulges” because they are used in the double implicit shift QR algorithm [25, 26], [27, Algorithm 7.5.1]. This paper uses double implicit shift bulges exclusively. The QR algorithm needs to use many simultaneous shifts in order to generate substantial level 3 BLAS operations, but, in order to avoid shift blurring, it is re- stricted to small bulges. Small bulges transmit only a few simultaneous shifts. One way to resolve these superficially contradictory demands is to use a chain of m tightly packed two-shift bulges as illustrated in Figure 2. This configuration allows many simultaneous shifts while keeping each pair “well focused” inside a small bulge. The processes of chasing the chain of bulges along the diagonal is illustrated in Figure 3. Near the diagonal, “3-by-3” Householder reflections are needed to chase the bulges one at a time along the diagonal. The reflections are applied individually in the light shaded region near the diagonal in a sequence of level 1 BLAS operations. 932 KAREN BRAMAN, RALPH BYERS, AND ROY MATHIAS

1000−by−1000 2000−by−2000 3000−by−3000 4 40

3 30 100

2 20 50

cpu minutes 1 10

0 0 0 0 50 100 0 50 100 0 50 100 DHSEQR 10 11 11 x 10 x 10 x 10 TTQR 3 3 8

2.5 2.5 6 2 2

1.5 1.5 4 1 1

floating point instructions 0.5 0.5 2 0 50 100 0 50 100 0 50 100 simultaneous shifts

Fig. 1. Effect of varying numbers of simultaneous shifts on the large-bulge multishift QR algorithm as implemented in DHSEQR, from LAPACK version 2 and on our experimental implemen- tation of the small-bulge multishift QR, TTQR. The two programs calculated the Schur decompositions (including both the orthogonal and quasi-triangular factors) of 1,000-by-1,000, 2,000-by-2,000 and 3,000-by-3,000 pseudorandom Hessenberg matrices. (The pseudorandom test matrices, computa- tional environment, and other details are described in section 3.) The first row plots number of simultaneous shifts versus cpu-minutes. The second row plots number of simultaneous shifts versus a hardware count of floating point instructions executed. (A trinary multiply-add instruction counts as one instruction but executes two flops.) The first, second, and third columns report data from matrices of order n = 1,000, n = 2,000 and n = 3,000, respectively.

Periodically, a group of reflections are accumulated into a thickly banded . The orthogonal matrix is then used to update the darkshaded region in Figure 3 using matrix- or other level 3 BLAS operations. We refer to this as a “two-tone” QR sweep. The level 1 BLAS operations form one tone; the level 3 operations form the other. Using the implicit Q theorem [27, Theorem 7.4.3], it is easy to show that a two- tone small-bulge multishift QR sweep with m pairs of simultaneous shifts, (sj,tj) ∈ C×C, j =1,2,3,...,m, is equivalent to m Francis double implicit shift QR sweeps, T m i.e., a two-tone QR sweep overwrites H by Q HQ where j=1(H − sj)(H − tj)= QR is a QR (orthogonal-triangular) factorization. An examination of the detailed description below shows that, in exact arithmetic, the two-tone QR sweep generates the same bulges and the same sequence of similarity transformations by “3-by-3” Householder reflections as m double implicit QR sweeps using the same shifts. Modified QR algorithms that chase more than one bulge at a time have been proposed several times [17, 24, 31, 29, 34, 36, 54, 55]. In particular, Henry, Watkins, and Dongarra [31] recently developed a distributed memory parallel QR algorithm MULTISHIFT QR ALGORITHM 933 XXXXXXXXXXXXXXXXXXXXXXXXXXX X X XXXXXXXXXX X XXXXXXXXXX XXXXX XXXXXX X XXXXXXXXXXXXX X X XXX XXXXXXXXX X X XXXXX XXXX X XXXXX X X XXXXXXXXXXX XXXXXX X X XXXX XXX X XXXXX XX XXXXXXXXXXX X X X X X X X X X X X X X X X X X X X X X X XXXXXXX X X XXXXXXXXXXXXX XXXX X X X X X X X X X X X X X X X X X X XXXXXXXXXXXXXXXXXXX X XXXXX X X X X X X X X X X X X X X X XXXXXXXXX X XXXXXXX X XXX X X X X X X X X X X X X X XXXXXXXX X XXX XXXX XXX X X X X X X X X X X X XXXXXX XXXXXX X X X X X X X X X X X X XXXXXXXXX XXX XXXXXXX X XXX X X X X X X XXXXXXX X X X X X X XXX X XXX XXX X X XXX X X

Fig. 2. A chain of m =3tightly packed double implicit shift bulges in rows r =5through s = r +3m =14. that chases many small bulges independently. Lang [36] recently described symmetric QL and QR algorithms that use rotations to simultaneously chase many small bulges. His algorithms also achieve some parallelism and good cache reuse by accumulating the rotations in groups. The algorithm proposed in the next section uses Householder reflections and applies to nonsymmetric matrices.

2.1. Detailed description. This section describes the two-tone small-bulge multishift QR sweep. Let H ∈ Rn×n be an unreduced and let (sj,tj) ∈ C × C, j =1, 2, 3, ..., m, be a set of m pairs of shifts. To avoid complex arithmetic, we require each pair to be closed under complex conjugation, i.e., either s¯j = tj or (sj,tj) ∈ R × R. There are three phases in a two-tone QR algorithm. In the first phase, a tightly packed chain of 4-by-4, double implicit shift bulges is introduced into the upper left- hand corner of a Hessenberg matrix. In the second phase, the chain of bulges is chased down the diagonal from the upper left-hand corner to the lower right-hand corner by a sequence of reflections. In the final phase, the matrix is returned to Hessenberg form by chasing the chain off the bottom row.

2.1.1. Phase 2: The central computation. The bulkof the workis in the second phase. Suppose that H ∈ Rn×n is an upper Hessenberg matrix with a packet of m double implicit shift bulges in rows r through s = r +3m. See Figure 3. Using a process called “bulge chasing” that has been known at least since the double implicit shift algorithm [35, 25, 26, 27] was introduced in 1961, one may use a sequence of similarity transformations by “3-by-3” Householder reflections to chase the chain of bulges down k rows, one bulge at a time, starting with the lowest bulge. If U˜ ∈ Rn×n 934 KAREN BRAMAN, RALPH BYERS, AND ROY MATHIAS

r s s+k

r

s

s+k

Fig. 3. Chasing a packet of m 4-by-4 double implicit shift bulges from rows r through s = r+3m down to rows r + k through s + k.

is the product of this sequence of Householder reflections, then U˜ has the form

r 3m + k − 1 n − (s + k)+1   rI00 (2.2) U˜ = 3m + k − 10 U 0 , n − (s + k)+1 0 0 I

˜ where U = Ur+1:s+k−1,r+1:s+k−1 is itself an orthogonal matrix. For notational conve- nience, define Uˆ ∈ R(3m+k+1)×(3m+k+1) by

13m + k − 11   1 100 Uˆ = 3m + k − 10 U 0 , 1 001

i.e., the r =1,n = s + k case of (2.2). In terms of U and Uˆ, the multibulge chase mentioned above consists of two parts. The first part is a similarity transformation on the the lightly shaded region in Figure 3, i.e.,

ˆ T ˆ (2.3) Hr:s+k,r:s+k ← U Hr:s+k,r:s+kU.

The second part is the matrix multiplication of the dark, horizontal slab by Uˆ T from the left, i.e.,

ˆ T (2.4) Hr:s+k,s+k+1:n ← U Hr:s+k,s+k+1:n, MULTISHIFT QR ALGORITHM 935

s r

s

r

Fig. 4. Left: introducing m bulges into row 1 through row s =1+3m. Right: chasing m bulges off the bottom from row r = n − (1 + 3m) through row n.

and the multiplication of the dark, vertical slab by Uˆ from the right, i.e., ˆ (2.5) H1:r−1,r:s+k ← H1:r−1,r:s+kU. The similarity transformation (2.3) has the effect of moving the packet of bulges from rows r through s down to rows r +k through s+k. Using a sequence of “3-by-3” reflections, this small similarity transformation does not take advantage of the level 3 BLAS, but if m  n and k  n, then the workof updating the lightly shaded region in Figure 3 is small compared to the workof updating the two darkslabs. Updating the darkhorizontal and vertical slabs in Figure 3 is a matrix multiplication by the orthogonal matrix U in (2.2). In order to make this into a level 3 BLAS matrix- matrix multiplication, it is necessary to form explicitly the orthogonal submatrix U from the “3-by-3” reflections used to chase the bulges. Accumulating U does not take advantage of the level 3 BLAS either, but this is also a small amount of arithmetic workcompared to updating the darkslabs. Using a sequence of reflections to update the lightly shaded region in Figure 3 but using matrix-matrix multiplication to update the darkbands results in a high ratio of level 3 BLAS workto level 1 BLAS work[13]. 2.1.2. Phases 1 and 3: Getting started and finishing up. The two-tone structure of the algorithm is slightly simpler during the initial bulge-introduction phase and during the final chasing-bulges-off-the-bottom phase than during most of the computation. See Figure 4. A modified version of the Francis double implicit shift algorithm may be used to introduce the m bulges into the upper left-hand corner of the Hessenberg matrix. One at a time, bulges are introduced and chased down to their proper position in the chain. This is equivalent to doing 3m(m − 1)/2 similarity transformations by “3-by-3” reflections on H1:3m+1,1:3m+1, the lightly shaded region on the left of Figure 4. The matrix U has zero structure similar to the (1, 1) blockof Figure 5. Chasing the packet of m bulges off the bottom at the end of the QR sweep is similar. It is equivalent to doing a similarity transformation by a sequence of Householder reflections followed by a orthogonal matrix multiplication on the dark vertical slab on the right of Figure 4. This operation also requires roughly the same number of flops as does the operation of introducing a packet of m bulges. 936 KAREN BRAMAN, RALPH BYERS, AND ROY MATHIAS

0

5

10

15

20

25

30

35

40

45 0 5 10 15 20 25 30 35 40 45

Fig. 5. The zero structure of the orthogonal submatrix U in (2.2) in the special case m =8, k =22.

2.1.3. Level 3 BLAS and exploiting the zero structure of U. The or- thogonal matrix U has more structure than a generic, dense orthogonal matrix. The amount of arithmetic workneeded to update the darkslabs can, theoretically, be substantially reduced by taking advantage of the zero structure of U. The example in Figure 5 is 38% zeros. However, this adds some complexity to the computation which offsets the reduced flop count. Depending on the computing environment, taking ad- vantage of some or all of the zeros may add more workthan it saves. Nevertheless, it is likely to be worth taking advantage of at least some of the zeros either by treating U as a thickband matrix or as a 2-by-2 blockmatrix. As illustrated in Figure 5, U has a thickband structure with lower bandwidth kl = 2 min(m, k) and upper bandwidth ku = k + 1. The band is fairly dense, so, theoretically, the thick, banded matrix-matrix multiply is a level 3 operation. How- ever, except for triangular matrix multiplication, banded matrix-matrix multiplication software is usually implemented as a level 2 (matrix-vector) operation. In order to take advantage of both band structure and the current set of level 3 BLAS, matrix multiplication by U may be broken up into two dense-by-dense matrix multiplications and two triangular-by-dense matrix multiplications. If 3m − 2 ≥ k ≥ m + 2, then one way to do this is to partition U into a 2-by-2 blockmatrix with two nearly dense, rectangular diagonal blocks and two triangular off-diagonal blocks. See Figure 5. In addition, at least one of the triangular blocks has a thick band of zeros along the diagonal, so the workof multiplying by this triangular matrix can be further reduced. For a fixed choice of the number of simultaneous shifts 2m, k ≈ 3m approximately minimizes the number of flops in an n-by-n two-tone multibulge QR sweep regardless of the logical zero structure of U. See Table 1. With k ≈ 3m, the algorithm uses between 1.6 and 2.4 times as many flops as m double implicit shift sweeps, depending MULTISHIFT QR ALGORITHM 937

Table 1 The ratio of the number of flops performed by two-tone small-bulge multishift QR sweep with m pairs of shifts chased k rows at a time to 10mn2 the number of flops needed for m double implicit shift QR sweeps. It is assumed that m is small relative to the order of the Hessenberg matrix n.

flops × 10mn2 Logical U Ratio for Approximately Ratio for structure k =3m optimal k optimal k Banded 1.57 k =2.8m 1.57 2-by-2 block 1.63 k ≈ 2.54m 1.62 Dense 2.40 k =3m 2.40

Table 2 Flops used by an n-by-n two-tone multibulge QR sweep chasing m bulges, k rows at a time, assuming k  n and m  n. (These estimates do not include the work for accumulating the orthogonal matrix of Schur vectors Q. Accumulating Q approximately doubles the flop count.)

Logical U structure Flops + O(m2n)   2 2-by-2 block k =3m − 2 49m −39m+12 n2 3m−2   k = pm 2 2 m>kor (2p +6p+13)m −pm+2 n2 (3 + p)m even pm   k = pm 2 2 3m − 2 ≥ k ≥ m +1 (2p +6p+13)m +(p−4)m+3 n2 (3 + p)m odd pm   2 Dense k =3m 2(6m−1) n2 3m   2 k = pm 2((p+3)m−1) n2 pm

on how much of the zero structure of U is exploited. The version of the large-bulge multishift QR algorithm which most closely resembles the two-tone small-bulge multi- shift QR algorithm aggregates groups of p transformations in order to take advantage of the level 3 BLAS. A 2m shift sweep of the aggregated large-bulge multishift QR algorithm needs roughly 1.5+p/(8m) times as many flops as m double implicit shift sweeps [4]. (The version of the large-bulge multishift QR algorithm implemented in DHSEQR from LAPACK version 2.0 [1] does not aggregate groups of transformations and takes advantage of only the level 1 and 2 BLAS. It needs roughly the same num- ber of flops to chase a single 2m shift bulge as m double implicit shift sweeps [4, section 6].) Detailed algorithms and mathematical flop counts are given in [13] and are summarized here in Tables 2 and 1. In section 3, numerical examples demon- strate that the avoidance of shift blurring by using small bulges and the ability to use level 3 BLAS by using many simultaneous shifts per QR sweep more than overcome the greater flop counts per sweep displayed in Tables 2 and 1.

2.2. Deflation and choice of shifts. It is inexpensive to monitor the sub- diagonal entries during the course of a two-tone QR sweep and take advantage of a small-subdiagonal deflation should one occur. Note that this includes taking advan- tage of any small subdiagonal entries that may appear between the bulges during a two-tone QR sweep. In [55], this is called “vigilant deflation.” Ordinarily, the bulges above a vigilant deflation collapse as they encounter the newly created subdiagonal zero. This may blockthose shifts so that, for the current QR sweep, the benefit of using many simultaneous shifts may be lost. However, the bulges can be reintroduced in the row in which the new zero subdiagonal appears 938 KAREN BRAMAN, RALPH BYERS, AND ROY MATHIAS

using the same methods as are used to introduce bulges at the upper left-hand corner [27, p. 377], [62, p. 530]. In this way, the shift information passes through a zero subdiagonal and the two-tone QR sweep continues with all its shifts. It is well known that a good choice of shifts in the QR algorithm leads to rapid convergence. Shifts selected to be the eigenvalues of a trailing principal submatrix give local quadratic convergence [62, 60]. In the symmetric case, convergence is cubic [64]. Deferred shifts retard convergence [49]. Following the choice of LAPACK version 2 subroutine DHSEQR, our experimental computer programs select the shifts to be the eigenvalues of a trailing principal submatrix.

3. Numerical examples. We compared the execution time of the large- and small-bulge QR algorithm on ad hoc and pseudorandom Hessenberg matrices of order n = 1,000 to n = 10,000 and on nonrandom matrices of similar order taken from a variety of applications in science and engineering [3]. (The two algorithms delivered ˜ ˜ ˜ roughly the same accuracy as measured by the relative√ residual AQ − QT F / A ˜T ˜ ˜ ˜ ˜T and the departure from orthogonality Q Q − I F / n, where A = QT Q is the computed real of A with computed orthogonal factor Q˜ and computed quasi-triangular factor T˜.) We call our experimental Fortran implementation of the two-tone small-bulge multishift QR algorithm TTQR. When using m pairs of simultaneous shifts, each TTQR sweep chases each tightly packed chain of m small bulges k =3m − 2 rows at a time. The implementation exploits the logical 2-by-2 blockstructure of U including the thick band of zeros along the diagonal of the (1,2) blockin Figure 5. Except where noted otherwise, TTQR uses 150 simultaneous shifts, but no particular significance should be attached to this choice. TTQR uses LAPACK subroutine DHSEQR [1], the conventional large-bulge QR algorithm, to reduce diagonal subblocks of order no greater than 1.5 times the number of simultaneous shifts. Following EISPACK [46] and LAPACK [1], a Hessenberg subdiagonal entry hi+1,i is set to zero when |hi+1,i|≤ε (|hii| + |hi+1,i+1|) with ε equal to the unit round. (A new deflation procedure which greatly reduces the arithmetic workand execution time that TTQR needs is reported in [12, 13].) For comparison, we used the implementation of the large-bulge multishift QR algorithm in subroutine DHSEQR from LAPACK version 2 [1], which is widely recog- nized to be an excellent implementation. DHSEQR uses the double implicit shift QR algorithm to reduce subblocks of size MAXB-by-MAXB or smaller. In the examples re- ported here, MAXB is chosen to be the greater of 50 and the number of simultaneous shifts. Except where noted otherwise, DHSEQR uses only six simultaneous shifts which Figure 1 shows is approximately optimal. The examples reported here were run on an Origin2000 computer equipped with 400MHz IP27 R12000 processors and 16 gigabytes of memory. Each processor has 32 kilobytes of level 1 instruction cache, 32 kilobytes of level 1 data cache, and 8 megabytes of level 2 combined instruction and data cache. For serial execution, the experimental Fortran implementation of the small bulge multishift QR algorithm was compiled with version 7.30 of the MIPSpro Fortran 77 compiler with options -64 -TARG:platform=ip27-Ofast=ip27-LNO . For comparison purposes the same options were used to compile DHSEQR from LAPACK version 2. (We observe that this compilation of DHSEQR is usually slightly faster than the compilation distributed in the SGI/Cray Scientific Library version 1.2.0.0.) For parallel execution the -mp and -pfa options were added. The programs called optimized BLAS subroutines from the SGI/Cray Scientific Library version 1.2.0.0. In our computational environment, we observed that the measured serial “cpu time” of any particular program with MULTISHIFT QR ALGORITHM 939

its particular data might vary by at most a few percent. We were fortunate to get exclusive use of several processors for the purpose of timing parallel benchmarkruns. Except where otherwise mentioned, n-by-n matrices were stored in n-by-n arrays. We ran the experimental program on both pseudorandomly generated and non- random non-Hermitian matrices of order n = 1,000 to n = 10,000. We selected the entries on the diagonal and upper triangle of pseudorandomly generated Hessenberg matrices to be normally distributed pseudorandom numbers with mean zero and vari- 2 2 ance 1. We set the subdiagonal entry hj+1,j = χn−j, where χn−j is selected from a Chi-squared distribution with n − j degrees of freedom. Hessenberg matrices with the same distribution of random entries can also be generated by applying the reduc- tion to Hessenberg form algorithm [27, Algorithm 7.4.2] to full n-by-n matrices with pseudorandom entries selected from the normal distribution with mean zero variance one. Our source of nonrandom matrices was the Non-Hermitian Eigenvalue Problems (NEP) collection [3]. This is a collection of eigenvalue problems from a variety of applications in science and engineering. We selected 21 real eigenvalue problems of order 1,000-by-1,000 to 8,192-by-8,192. In some of the examples reported below, we report an automatic hardware count of the number of floating point instructions executed. Note that a trinary multiply- add instruction counts as just one executed instruction in the hardware count even though it executes two flops. We also report the floating point execution rate in millions of floating point in- structions per second, or “mega-flops” for short. For comparison purposes, we mea- sured the floating point execution rate of the level 3 BLAS full matrix-matrix multiply subroutine DGEMM and the triangular matrix-matrix multiply subroutine DTRMM from the SGI/Cray Scientific Library version 1.2.0.0 applied to matrix-matrix products sim- ilar to updating the darkbands in Figure 3. In serial execution, in the Origin2000 com- putational environment described above, DGEMM computes the product of the transpose of a 200-by-200 matrix times a 200-by-10,000 slab embedded a 10,000-by-10,000 array at roughly 330 mega-flops. It computes the product of a 10,000-by-200 slab times a 200-by-200 matrix at roughly 325 mega-flops. DTRMM computes the product of the transpose of a triangular 200-by-200 matrix times a 200-by-10,000 slab at roughly 305 mega-flops if it is an upper triangular matrix and at roughly 260 mega-flops if it is a lower triangular matrix. DTRMM computes the product of a 10,000-by-200 slab times a 200-by-200 triangular matrix at roughly 275 mega-flops if it is an upper triangular matrix and at roughly 255 mega-flops if it is a lower triangular matrix. Example 1. Figure 1 displays the serial execution time of DHSEQR and TTQR using various numbers of simultaneous shifts to calculate the Schur decomposition of the pseudorandom Hessenberg matrices described above. Both the quasi-triangular and orthogonal Schur factors were computed. The figure demonstrates that the con- vergence rate of TTQR does not suffer from shift blurring even with hundreds of simul- taneous shifts. The figure also displays the total number of floating point instructions executed by DHSEQR and TTQR to calculate the Schur decompositions of pseudorandom Hessenberg matrices. It also demonstrates that the number of floating point opera- tions needed by TTQR changes little as the number of simultaneous shifts varies in a range that is small compared to n, the order of the Hessenberg matrix. The number of simultaneous shifts used by TTQR may be chosen to fit the cache size and other ma- chine dependent parameters with little effect on the total amount of arithmetic work. However, in this example, TTQR executes between 1.5 and 2 times as many floating point instructions as DHSEQR with 6 simultaneous shifts. A partial explanation for 940 KAREN BRAMAN, RALPH BYERS, AND ROY MATHIAS

1000−by−1000 2000−by−2000 3000−by−3000 300 300 300

250 250 250

200 200 200

150 150 150

100 100 100

50 50 50

millions of floating point operations per second 0 0 0 0 50 100 0 50 100 0 50 100 simultaneous shifts DHSEQR TTQR

Fig. 6. Floating point execution rate of DHSEQR, the large-bulge multishift QR algorithm, and TTQR, the small-bulge multishift QR algorithm, using varying numbers of simultaneous shifts to calculate the Schur decompositions of the pseudorandom Hessenberg matrices described in section 3.

this is that TTQR executes more floating point instructions per shift than DHSEQR. See Table 1. However, in other examples TTQR executes fewer floating point instructions than DHSEQR. See Figures 7 and 10. Figure 6 displays the floating point execution rates achieved on the Origin2000 computer described above. It demonstrates that TTQR attains substantially higher rates of execution of floating point instructions when using many simultaneous shifts. On the Origin2000 computer described above, we observe that, on pseudoran- dom Hessenberg matrices of order roughly n>500, TTQR with forty simultaneous shifts calculates the Schur decomposition of pseudorandom Hessenberg faster than DHSEQR with six simultaneous shifts. (For matrices of order n = 100, TTQR with 40 simultaneous shifts uses 170% of the execution time of DHSEQR with six simultaneous shifts.) Example 2. Figure 7 shows the execution time, floating point execution rate, and number of floating instructions executed to calculate the Schur decompositions of ad hoc Hessenberg matrices of the form of   654321    0.00110000    0.001 2 0 0 0  (3.1) S6 =   .  0.001 3 0 0   0.00140 0.001 5 MULTISHIFT QR ALGORITHM 941

Millions of Floating Point Serial Execution Time Instructions per Second 1500 300

1000 200

500 100 megaflops cpu minutes

0 0 0 5000 10000 0 5000 10000 order n order n 12 x 10 8 DHSEQR TTQR 6

4

2

floating point instructions 0 0 5000 10000 order n

Fig. 7. Execution time, floating point execution rate, and hardware count of floating point instructions executed to calculate the Schur decomposition (including both orthogonal and quasi- triangular factors) of Hessenberg matrices of the form of (3.1).

The eigenvalues of trailing principal submatrices of Sn are extraordinarily close ap- proximations to eigenvalues of the whole matrix Sn [13, 12], so, at least initially, the shifts used by both DHSEQR and TTQR are exceptionally good. Both DHSEQR and TTQR calculate the Schur decompositions of Sn faster than the Schur decomposition of a pseudorandom Hessenberg matrix of the same size. See Figure 8. However, TTQR finishes substantially sooner, maintains a higher floating point execution rate, and executes fewer floating point instructions. Example 3. Figure 8 compares the serial execution time and floating point instruction execution rate of TTQR and DHSEQR applied to pseudorandom Hessenberg matrices of order n = 1,000 to n = 10,000. In this example, TTQR consistently executes more floating point instructions than DHSEQR but overcomes this handicap by maintaining a high floating point execution rate. Note that in other examples TTQR executes fewer floating point instructions than DHSEQR. See Figures 7 and 10. The advantage of TTQR over DHSEQR remains qualitatively similar even if only eigenvalues are computed or if only one of the orthogonal or quasi-triangular Schur factors is computed. Example 4. Figure 9 displays the serial execution times of DHSEQR and TTQR applied to 20 real non-Hermitian eigenvalue problems from the NEP collection [3]. 942 KAREN BRAMAN, RALPH BYERS, AND ROY MATHIAS

Millions of Floating Point Serial Execution Time Instructions per Second 300 2000

1500 200

1000 100 megaflops cpu minutes 500

0 0 0 5000 10000 0 5000 10000 order n order n 12 x 10 15 DHSEQR TTQR

10

5

floating point instructions 0 0 5000 10000 order n

Fig. 8. Serial execution times, floating point execution rates, and hardware counts of float- ing point instructions of DHSEQR, the large-bulge multishift QR algorithm, and TTQR, the small-bulge multishift QR algorithm. The two subroutines calculated Schur decompositions (including both the orthogonal and quasi-triangular factors) of pseudorandom Hessenberg matrices as described in sec- tion 3.

To avoid cache conflicts, each n-by-n matrix was stored in an (n + 7)-by-(n +7) array. Each matrix is identified by an acronym from [3]. The alphabetic part indi- cates the application from which the matrix comes. The numerical part gives the order of the matrix. Summaries of the applications, descriptions of the matrices, and references can be found in [3]. DHSEQR uses six simultaneous shifts which Figure 1 shows is approximately optimal. TTQR uses 60 simultaneous shifts on 1,000-by-1,000 to 1,999-by-1,999 matrices, 116 simultaneous shifts on 2,000-by-2,000 to 2,499-by-2,499 matrices, 150 simultaneous shifts on 2,500-by-2,500 to 3,999-by-3,999 matrices, and 180 simultaneous shifts on 4,000-by-4,000 or larger matrices. Notice that the superiority of TTQR is usually greater for matrices of larger order. We also mention that DW8192 is not reported in Figure 9 only because the execution times are out of scale with the other reported times. DHSEQR calculated the Schur decomposition of the Hessenberg matrix derived from DW8192 in 544 cpu minutes. TTQR calculated the Schur decomposition in 301 cpu minutes. Figure 10 displays hardware counts of the number of floating point instructions executed by DHSEQR and TTQR. Note that neither TTQR nor DHSEQR consistently exe- cutes more floating point instructions. MULTISHIFT QR ALGORITHM 943

18 250 PDE2961 MHD3200A MHD3200B RDB3200L TOLS4000 MHD4800A MHD4800B OLM5000 RW5151 OLM1000 TUB1000 TOLS1090 RDB1250 RDB1250L BWM2000 OLM2000 TOLS2000 DW2048 RDB2048 RDB2048L 16

200 14

12 150 10

8

cpu minutes cpu minutes 100 6

4 50

2

0 0 DHSEQR TTQR

Fig. 9. Serial execution times of DHSEQR, the large-bulge multishift QR algorithm, and TTQR, the small-bulge multishift QR algorithm, applied to 20 non-Hermitian eigenvalue problems from [3]. The acronyms indicate the corresponding matrix from [3]. DHSEQR uses six simultaneous shifts which Figure 1 shows is approximately optimal. TTQR uses 60 simultaneous shifts on 1,000-by-1,000 to 1,999-by-1,999 matrices, 116 simultaneous shifts on 2,000-by-2,000 to 2,499-by-2,499 matrices, 150 simultaneous shifts on 2,500-by-2,500 to 3,999-by-3,999 matrices, and 180 simultaneous shifts on 4,000-by-4,000 or larger matrices. (The execution time of the reduction to Hessenberg form is not included in the above graphs.)

Example 5. Our experimental implementation of the small-bulge QR algorithm is not designed for parallel computation. However, TTQR makes heavy use of the level 3 BLAS—particularly matrix-matrix multiply. Hence, it is not surprising to observe modest but not insignificant speedups when the experimental version of TTQR is compiled for parallel execution and linked with parallel versions of the BLAS. Figure 11 shows wall clockexecution time, parallel speedup, and parallel efficiency of TTQR calculating the Schur decomposition (including both the quasi-triangular and orthogonal factors) of pseudorandom Hessenberg matrices on the Origin2000 computer described above. To avoid cache conflicts, some n-by-n matrices were stored in (n + 1)-by-(n + 1) arrays. (Parallel speedup is the ratio T1/Tp, where T1 is the 1 processor wall clockexecution time and Tp is the p-processor wall clockexecution time. Parallel efficiency is T1/(pTp).) A major serial bottle neckcan be attributed to computations in the lightly shaded region of Figure 3. There is potential for some small grain parallel execution even in this region, but there is not enough parallel workto overcome the overhead associated with synchronizing the processors. A planned production version of TTQR is expected 944 KAREN BRAMAN, RALPH BYERS, AND ROY MATHIAS

11 12 x 10 x 10 2 2 OLM1000 TUB1000 TOLS1090 RDB1250 RDB1250L BWM2000 OLM2000 TOLS2000 DW2048 RDB2048 RDB2048L PDE2961 MHD3200A MHD3200B RDB3200L TOLS4000 MHD4800A MHD4800B OLM5000 RW5151 1.8 1.8

1.6 1.6

1.4 1.4

1.2 1.2

1 1

0.8 0.8

floating point instructions 0.6 floating point instructions 0.6

0.4 0.4

0.2 0.2

0 0 DHSEQR TTQR

Fig. 10. Hardware count of floating point instructions executed by DHSEQR, the large-bulge multishift QR algorithm, and TTQR, the small-bulge multishift QR algorithm, when applied to 20 non-Hermitian eigenvalue problems from [3]. The acronyms indicate the corresponding matrix from [3]. DHSEQR uses six simultaneous shifts which Figure 1 shows is approximately optimal. TTQR uses 60 simultaneous shifts on 1,000-by-1,000 to 1,999-by-1,999 matrices, 116 simultaneous shifts on 2,000-by-2,000 to 2,499-by-2,499 matrices, 150 simultaneous shifts on 2,500-by-2,500 to 3,999-by-3,999 matrices, and 180 simultaneous shifts on 4,000-by-4,000 or larger matrices. (Float- ing point instructions executed during the reduction to Hessenberg form are not included.)

to exhibit better parallel speedup by overlapping computation in the lightly shaded region with computations in the darkbands.

4. Conclusions. The small-bulge two-tone multishift QR algorithm avoids shift blurring so completely that it admits a nearly unlimited number of simultaneous shifts—even hundreds—without adverse effects on the convergence rate. With enough simultaneous shifts, level 3 BLAS operations dominate, allowing both high serial floating point execution rates and at least modest parallel speedup. The possibility of using hundreds of simultaneous shifts makes the choice of how many to use more complicated than for the conventional large-bulge multishift QR algorithm. We observe empirically that the amount of arithmetic workis insensitive to this choice as long as the number of shifts is small compared to the order of the matrix. Hence, the number of shifts may be chosen to fit the cache size and other characteristics of a particular computational environment without compromising the amount of arithmetic work. MULTISHIFT QR ALGORITHM 945

wall clock execution time 1000 6

800 4 600

400

Speedup 2 200 wall clock minutes 0 0 6000 8000 10000 6000 8000 10000 order n order n

1 1 processor 0.8 2 processors 3 processors 0.6 4 processors 8 processors 0.4 Efficiency 0.2

0 6000 8000 10000 order n

Fig. 11. Parallel execution time, speedup, and efficiency of TTQR, the small-bulge multishift QR algorithm, on pseudorandom Hessenberg matrices. The pseudorandom test matrices, computational environment, and program parameters are described in section 3. TTQR used 150 simultaneous shifts. Parallel speedup is the ratio T1/Tp, where T1 is the 1 processor wall clock execution time and Tp is the p-processor wall clock execution time. Parallel efficiency is T1/(pTp).

Acknowledgment. The authors would like to thank David Watkins for helpful discussions and for bringing [55] to their attention.

REFERENCES

[1] E. Anderson, Z. Bai, C. Bischof, J. Demmel, J. Dongarra, J. D. Croz, A. Greenbaum, S. Hammarling, A. McKenney, S. Ostrouchov, and D. Sorensen, LAPACK Users’ Guide, SIAM, Philadelphia, 1992. [2] L. Auslander and A. Tsao, On parallelizable eigensolvers, Adv. in Appl. Math., 13 (1992), pp. 253–261. [3] Z. Bai, D. Day, J. Demmel, and J. Dongarra, A test matrix collection for non-Hermitian eigenvalue problems, Department ofMathematics, University ofKentucky, Lexington, KY. Also available online from http://math.nist.gov/MatrixMarket. [4] Z. Bai and J. Demmel, On a block implementation of Hessenberg QR iteration, Intl. J. of High Speed Comput., 1 (1989), pp. 97–112. Also available online as LAPACK Working Note 8 from http://www.netlib.org/lapack/lawns/lawn08.ps and http://www.netlib.org/ /lawnspdf/lawn08.pdf. [5] Z. Bai and J. Demmel, Design of a parallel nonsymmetric eigenroutine toolbox, part I, in Proceedings of the Sixth SIAM Conference on Parallel Processing for Scientific Computing, Vol. 1, SIAM, Philadelphia, 1993, pp. 391–398. [6] Z. Bai and J. Demmel, Design of a Parallel Nonsymmetric Eigenroutine Toolbox, Part II, 946 KAREN BRAMAN, RALPH BYERS, AND ROY MATHIAS

Tech. Report 95-11, Department ofMathematics, University ofCalifornia,Berkeley, 1995. [7] Z. Bai, J. Demmel, and M. Gu, An inverse free parallel spectral divide and conquer algorithm for nonsymmetric eigenproblems, Numer. Math., 76 (1997), pp. 279–308. [8] S. Batterson, Convergence of the shifted QR algorithm on 3 × 3 normal matrices, Numer. Math., 58 (1990), pp. 341–352. [9] S. Batterson, Convergence of the Francis shifted QR algorithm on normal matrices, Linear Algebra Appl., 207 (1994), pp. 181–195. [10] S. Batterson and D. Day, Linear convergence in the shifted QR algorithm, Math. Comp., 59 (1992), pp. 141–151. [11] A. Bojanczyk, G. H. Golub, and P. Van Dooren, The periodic Schur decomposition; algo- rithms and applications, Proc. SPIE 1770, Bellingham, WA, 1992, pp. 31–42. [12] K. Braman, R. Byers, and R. Mathias, The multishift QR algorithm. Part II: Aggressive early deflation, SIAM J. Matrix Anal. Appl., 23 (2002), pp. 948–973. [13] K. Braman, R. Byers, and R. Mathias, The Multi-Shift QR-Algorithm: Aggressive Defla- tion, Maintaining Well Focused Shifts, and Level 3 Performance, Tech. Report 99-05-01, Department ofMathematics, University ofKansas, Lawrence, KS, 1999. Also available online from http://www.math.ukans.edu/˜reports/1999.html. [14] A. Bunse-Gerstner, R. Byers, and V. Mehrmann, A quaternion QR algorithm, Numer. Math., 55 (1989), pp. 83–95. [15] A. Bunse-Gerstner, R. Byers, and V. Mehrmann, A chart of numerical methods for struc- tured eigenvalue problems, SIAM J. Matrix Anal. Appl., 13 (1992), pp. 419–453. [16] R. R. Burnside and P. B. Guest, A simple proof of the transposed QR algorithm, SIAM Rev., 38 (1996), pp. 306–308. [17] R. Byers, A Hamiltonian QR algorithm, SIAM J. Sci. Statist. Comput., 7 (1986), pp. 212–229. [18] K. Dackland and B. K˚agstrom¨ , Blocked algorithms and software for reduction of a regular matrix pair to generalized Schur form, ACM Trans. Math. Software, 25 (1999), pp. 425– 454. [19] D. Day, How the Shifted QR Algorithm Fails to Converge and How to Fix It, Tech. Report 96-0913, Sandia National Laboratories, Albuquerque, NM, 1996. [20] J. Demmel, Applied , SIAM, Philadelphia, 1997. [21] J. J. Dongarra, J. Du Croz, S. Hammarling, and R. J. Hanson, An extended set of Fortran basic linear algebra subprograms, ACM Trans. Math. Software, 14 (1988), pp. 1–17. [22] J. J. Dongarra, J. Du Croz, S. Hammarling, and I. S. Duff, A set of level 3 basic linear algebra subprograms, ACM Trans. Math. Software, 16 (1990), pp. 1–17. [23] J. J. Dongarra and M. Sidani, A parallel algorithm for the nonsymmetric eigenvalue problem, SIAM J. Sci. Comput., 14 (1993), pp. 542–569. [24] A. Dubrulle, The Multi-Shift QR Algorithm: Is It Worth the Trouble?, Palo Alto Scientific Center Report G320-3558x, IBM Corp., Palo Alto, CA, 1991. [25] J. G. F. Francis, The QR transformation: A unitary analogue to the LR transformation. I, Comput. J., 4 (1961/1962), pp. 265–271. [26] J. G. F. Francis, The QR transformation. II, Comput. J., 4 (1961/1962), pp. 332–345. [27] G. Golub and C. Van Loan, Matrix Computations, 2nd ed., The Johns Hopkins University Press, Baltimore, MD, 1989. [28] G. H. Golub and C. Reinsch, Singular value decomposition and least squares solutions, Nu- mer. Math., 14 (1970), pp. 403–420. [29] D. E. Heller and I. C. F. Ipsen, Systolic networks for orthogonal decompositions, SIAM J. Sci. Statist. Comput., 4 (1983), pp. 261–269. [30] J. J. Hench and A. J. Laub, Numerical solution of the discrete-time periodic Riccati equation, IEEE Trans. Automat. Control, 39 (1994), pp. 1197–1210. [31] G. Henry, D. S. Watkins, and J. Dongarra, A Parallel Implementation of the Non- Symmetric QR Algorithm for Distributed Memory Architectures, Tech. Report CS-97-352, Department ofComputer Science, University ofTennessee, 1997. Also available online as LAPACK Working Note 121 from http://www.netlib.org/lapack/lawns/lawn121.ps and http://www.netlib.org/lapack/lawnspdf/lawn121.pdf. [32] W. Hoffmann and B. N. Parlett, A new proof of global convergence for the tridiagonal QL algorithm, SIAM J. Numer. Anal., 15 (1978), pp. 929–937. [33] E. R. Jessup, A case against a divide and conquer approach to the nonsymmetric eigenvalue problem, Appl. Numer. Math., 12 (1993), pp. 403–420. [34] L. Kaufman, A parallel QR algorithm for the symmetric tridiagonal eigenvalue problem,J. Parallel and Distrib. Comput., 3 (1994), pp. 429–434. [35] V. N. Kublanovskaya, On some algorithms for the solution of the complete eigenvalue prob- lem, U.S.S.R. Comput. Math. and Math. Phys., 3 (1961), pp. 637–657. MULTISHIFT QR ALGORITHM 947

[36] B. Lang, Using level 3 BLAS in rotation-based algorithms, SIAM J. Sci. Comput., 19 (1998), pp. 626–634. [37] C. L. Lawson, R. J. Hanson, D. R. Kincaid, and F. T. Krogh, Algorithm 539: Basic linear algebra subprograms for Fortran usage, ACM Trans. Math. Software, 5 (1979), pp. 324–325. [38] C. L. Lawson, R. J. Hanson, D. R. Kincaid, and F. T. Krogh, Basic linear algebra subpro- grams for Fortran usage, ACM Trans. Math. Software, 5 (1979), pp. 308–323. [39] R. Mathias and G. W. Stewart, A block QR algorithm and the singular value decomposition, Linear Algebra Appl., 182 (1993), pp. 91–100. [40] C. B. Moler and G. W. Stewart, An algorithm for generalized matrix eigenvalue problems, SIAM J. Numer. Anal., 10 (1973), pp. 241–256. [41] B. N. Parlett, Global convergence of the basic QR algorithm on Hessenberg matrices, Math. Comp., 22 (1968), pp. 803–817. [42] B. N. Parlett and W. G. Poole, Jr., A geometric theory for the QR, LU, and power iterations, SIAM J. Numer. Anal., 10 (1973), pp. 389–412. [43] R. V. Patel, A. J. Laub, and P. Van Dooren, eds., Numerical Linear Algebra Techniques for Systems and Control, IEEE Press, New York, 1994. [44] H. Rutishauser, Solution of eigenvalue problems with the LR transformation, in Nat. Bur. Standards Appl. Math. Ser. 49, 1958, pp. 47–81. [45] T. Schreiber, P. Otto, and F. Hofmann, A new efficient parallelization strategy for the QR algorithm, Parallel Comput., 20 (1994), pp. 63–75. [46] B. T. Smith, J. M. Boyle, J. J. Dongarra, B. S. Garbow, Y. Ikebe, V. C. Klema, and C. B. Moler, Matrix Eigensystem Routines—EISPACK Guide, Lecture Notes in Comput. Sci. 6, Springer-Verlag, New York, 1976. [47] G. W. Stewart, Introduction to Matrix Computations, Academic Press, New York, 1973. [48] R. A. van de Geijn, Storage schemes for parallel eigenvalue algorithms, in Numerical Linear Algebra Digital Signal Processing and Parallel Algorithms, G. Golub and P. Van Dooren, eds., Springer-Verlag, Berlin, 1988, pp. 639–648. [49] R. A. van de Geijn, Deferred shifting schemes for parallel QR methods, SIAM J. Matrix Anal. Appl., 14 (1993), pp. 180–194. [50] R. A. van de Geijn and D. G. Hudson, An efficient parallel implementation of the nonsym- metric QR algorithm, in Proceedings ofthe Fourth Conferenceon Hypercube Concurrent Computers and Applications, Monterey, CA, 1989, pp. 697–700. [51] C. F. Van Loan, A general matrix , SIAM J. Numer. Anal., 12 (1975), pp. 819–834. [52] D. S. Watkins, Understanding the QR algorithm, SIAM Rev., 24 (1982), pp. 427–440. [53] D. S. Watkins, Fundamentals of Matrix Computations, John Wiley, New York, 1991. [54] D. S. Watkins, Bidirectional chasing algorithms for the eigenvalue problem, SIAM J. Matrix Anal. Appl., 14 (1993), pp. 166–179. [55] D. S. Watkins, Shifting strategies for the parallel QR algorithm, SIAM J. Sci. Comput., 15 (1994), pp. 953–958. [56] D. S. Watkins, Forward stability and transmission of shifts in the QR algorithm, SIAM J. Matrix Anal. Appl., 16 (1995), pp. 469–487. [57] D. S. Watkins, QR-like algorithms—an overview of covergence theory and practice, in The Mathematics ofNumerical Analysis, Lectures in Appl. Math. 32, J. Renegar, M. Shub, and S. Smale, eds., AMS, Providence, RI, 1996, pp. 879–893. [58] D. S. Watkins, The transmission of shifts and shift blurring in the QR algorithm, Linear Algebra Appl., 241/243 (1996), pp. 877–896. [59] D. S. Watkins and L. Elsner, Chasing algorithms for the eigenvalue problem, SIAM J. Matrix Anal. Appl., 12 (1991), pp. 374–384. [60] D. S. Watkins and L. Elsner, Convergence of algorithms of decomposition type for the eigen- value problem, Linear Algebra Appl., 143 (1991), pp. 19–47. [61] R. C. Whaley and J. J. Dongarra, Automatically tuned linear algebra software, Tech. report, Mathematical Sciences Section, Oak Ridge National Laboratory, Oak Ridge, TN, 1999. Also available online from www.netlib.org/utk/projects/atlas. [62] J. H. Wilkinson, The Algebraic Eigenvalue Problem, Clarendon Press, Oxford, UK, 1965. [63] J. H. Wilkinson, Convergence of the LR, QR, and related algorithms, Comput. J., 8 (1965), pp. 77–84. [64] J. H. Wilkinson, Global convergence of tridiagonal QR algorithm with origin shifts, Linear Algebra Appl., 1 (1968), pp. 409–420. [65] J. H. Wilkinson and C. Reinsch, eds., Linear Algebra, Handbook for Automatic Computa- tion, Vol. 2, Springer-Verlag, New York, 1971.