<<

Chapter 2

DIRECT METHODS—PART II

If we use finite precision arithmetic, the results obtained by the use of direct methods are contaminated with roundoff error, which is not always negligible. 2.1 Finite Precision Computation 2.2 Residual vs. Error 2.3 Pivoting 2.4 Scaling 2.5 Iterative Improvement

2.1 Finite Precision Computation

2.1.1 Floating-point numbers floating-point numbers The essence of floating-point numbers is best illustrated by an example, such as that of a 3-digit floating-point which accepts numbers like the following: 123. 50.4 −0.62 −0.02 7.00 Any such number can be expressed as e ±d1.d2d3 × 10 where e ∈{0, 1, 2}. (2.1) Here

40 t := precision = 3, [L : U] := exponent range = [0 : 2].

The exponent range is rather limited. If the calculator display accommodates scientific notation, e g., 3.46 3 −1.56 −3 then we might use [L : U]=[−9 : 9]. Some numbers have multiple representations in form (2.1), e.g., 2.00 × 101 =0.20 × 102. Hence, there is a normalization:

• choose smallest possible exponent,

• choose + sign for zero,

e.g., 0.52 × 102 → 5.20 × 101, 0.08 × 10−8 → 0.80 × 10−9, −0.00 × 100 → 0.00 × 10−9. −9 Nonzero numbers of form ±0.d2d3 × 10 are denormalized. But for large-scale scientific computation base 2 is preferred.

binary floating-point numbers This is an important matter because numbers like

0.2

do not have finite representations in base 2:

0.2=(0.001100110011 ···)2.

For the IEEE Standard for Binary Floating Point Arithmetic, the set of machine repre- sentable numbers (single precision) is

e (±b1.b2 ···b24)2 × 2 ,

where bj ∈{0, 1} e ∈ [−126 : 127]

(normalization b1 =1ife ≥−125),

+0, −0,

also := −∞,+∞,NaN.

41 Normally ±0, ±∞ are indistinguishable. The important question is not how the numbers are represented but rather

WHICH NUMBERS ARE EXACTLY REPRESENTABLE.

Answer:

integer × 2power |integer|≤16 777 215 − 149 ≤ power ≤ 104 You do not have to know binary to understand this. For double precision the set of machine representable numbers is

e (±b1.b2 ···b53)2 × 2 , where e ∈ [−1022 : 1023]

general floating-point numbers A set of floating-point numbers =: machine numbers can be specified by four parameters:

β =base,t= precision,

L = lower limit of exponent,U= upper limit of exponent. Thesetofmachinenumbersis e ±d1.d2 ···dt × β where di ∈ [0 : β − 1],e∈ [L : U]. (We call e the exponent.) It is a finite set of numbers; e.g., β =2,t =3,L = −2, U =1are “3-bit numbers”:

−7/2 −3 −2 −1−1/2 −1/4 0 1/4 1/2 1 2 3 7/2

denormalized

2.1.2 Rounding rounding round to nearest; in a tie, round away from zero

round-to-even round to nearest: in a tie, round to number whose last digit is even

chopping round toward zero

42 e.g., β =10 t =3 L = −9 U =9 number chopping rounding round–to–even 99.9499 99.9 99.9 99.9 99.95 99.9 100 100 −99.95 −99.9 −100 −100 99.85 99.8 99.9 99.8 9.995109 9.99109 overflow overflow 9.99510−10 0.9910−9 1.10−9 1.10−9 99.851 99.8 99.9 99.9 We use fl(x) to denote the result of rounding x.Iffl(x) =6 x,thenwehavearoundoff error. definition of error (for floating-point computation) (The number of exactly correct digits is for most purposes irrevelant.) There is yet another measure of accuracy which is meaningful and useful in a fixed precision context: e Supposeα ˜ = d1.d2 ···dt−1dt × β ( normalized). Define 1 ulp = 0.0 ···01 × βe = βe−t+1 1 unit in the last place. α˜ − α error in ulps := βe−t E.g., in a 3-digit environment, π ≈ 3.14 has an error −.1592 ulps. Clearly |error|≤0.5ulps represents perfection and a routine that delivers sin(x) to within 1 ulp is doing very well. a single roundoff error Let x be a representable real, meaning that its rounded value has an exponent that is in range. If x is a machine number, then δ = 0. Assume then that x is not a machine number so that, in particular, it is nonzero and can be expressed in normalized form x = βe × f 1 ≤|f| <β. As shown below x lies between two consecutive machine numbers (if for the moment we ignore out-of-range exponents) which differ by one unit in the t-th place of x.

bracketing | | machine numbers e ←−←−←−β × (0.00 ···01)β −→ −→ −→ = βe−t+1

43 The number x is rounded to the nearer of the two bracketing numbers, so the rounding error is at most half the distance between them: 1 |error|≤ × βe−t+1. 2 Equality occurs when x is in the middle and gets rounded in either direction. It is the relative error (|error/x|) that we wish to bound, so we need a lower bound on x which can be obtained using the normalization assumption:

|x| = βe|f| >βe = βe.

The strict inequality is justified because x is not a machine number. Combine the two inequalities to get

1 × βe−t+1 1 |relative error| < 2 = β1−t e.g., 2−24 in s. prec. βe 2 This can be expressed fl(x)=x(1 + δ) where the relative roundoff error δ satisfies1  β1−t, chopping, |δ|

2.1.3 Floating-point arithmetic Let ◦ denote one of +, –, *, /. If a, b are machine numbers, then the floating-point operation ◦ˆ is defined by a◦ˆb =fl(a ◦ b). E.g., 9.99 −ˆ .00501 = fl(9.98499) = 9.98. Thus in principle we first compute the result to infinite precision; in practice, computers do it some other way but the end result is the same. (The details of how the computers do it is an entirely different topic.) We have a◦ˆb =fl(a ◦ b)=(a ◦ b)(1 + δ), |δ|≤u. Thus a single operation a◦ˆb is always performed very accurately. Also input/ouput, e.g., 0.1F=0.1(1 + 2−26).

1for normalized numbers. This can be modified to include denormalized numbers.

44 cancellation occurs when the result has fewer digits than the operands and hence the result is a machine number:

CANCELLATION =⇒ NO ROUNDOFF ERROR.

Elementary functions can be evaluated with an error of less than 1 ulp.

2.1.4 Algorithms and numerical instability algorithms problem → subproblem; subproblem e.g.,

Solve for eliminate solve for solve ; −→ if (n>1) { ; ; } ; x1,x2,...,xn x1 x2,...,xn for x1

set up system polynomial solve system of ; −→ of linear equa- ; ; interpolation linear equations tions In general

Rm −→f Rn −→ Rm −→f1 Rp −→f2 Rn

y = f(x) z = f1(x); y = f2(z)

f = f2 ◦ f1

Thus, an algorithm is the repeated decomposition of problems (mappings) into smaller sub- problems until all subproblems can be solved by existing hardware and software. Different decompositions yield different algorithms. propagated and accumulated roundoff error In the diagram below FE and FEc denote, respectively, the use of forward elimination using exact and 3-digit rounded arithmetic. Similary BS and BSc denote back substitution.

45   3.00 4.13 15.4 1.00 1.37 5.15

FE FEc   the 0.02% diff   3.00 4.13 15.4 3.00 4.13 15.4 is the −.00666 ··· .016666 ··· −.∅100 .∅200 FE error

BS BS BSc the 9% diff     the1%diff   8.575 is the 7.88666 ··· 7.90 is the −2.5 propagated −2 −2.00 BS error FE error

←− the 9% diff is the accumulated error −→

For comparison the unit roundoff error u =0.5%. (Question: why do we consider FE;c BS rather than FE; BS?c Answer: because the 3-digit arithmetic used in BSc is defined only for 3-digit operands.) catastrophic cancellation (1000+1)ˆ −ˆ 1000 exact approx 1001 − 1000 1000−ˆ 1000 exact exact approx 100 The subtraction was performed exactly. The 0.1% addition error was amplified to a 100% error in the result because of cancellation. Another example: 1.07׈ 1.07−ˆ 1.14 exact approx 1.1449 − 1.14 1.14−ˆ 1.14 exact exact approx 0.0049 0 0

−.0049 −.0049 The relative error 1.1449 is amplified to become .0049 . CANCELLATION =⇒ ERROR AMPLIFICATION.

46 Often an expression has several arrangements which are mathematically equivalent. Example Suppose x2 ≈ x1 ≈ x0

formula 1: x2 − 2 ∗ x1 + x0 in 3-digit precision

.723 − 2 ∗ 7.11 + .701 = .723 − 1.42 + .701 = −.697 + .701 = .004

formula 2: (x2 − x1) − (x1 − x0) in 3-digit precision

(.723 − .711) − (.711 − .701) = .012 − .010 = .002 exact!

Moral: cancellation is good if the operands have no computational error, e.g., they are inputs to the algorithm. x − y x + y Example: Change sinx − siny to2sin cos . (There is a calculus of finite dif- 2 2 ferences that enables one to perform such transformations.) Therefore if cancellation cannot be avoided, then it should occur at the beginning of the computation.

numerical instability For a given problem y = x2 − 1 there may be several algorithms:

1. y =(x׈ x)−ˆ 1,

2. y =(x−ˆ 1)׈ (x+1).ˆ

Both of these algorithms introduce a few very tiny roundoff errors, and yet algorithm 1 can sometimes yield a result y with a relative error far greater than the unit roundoff error. This is because the roundoff error in the multiplication can be greatly amplified (in a relative sense) by cancellation. Because there exists a different reasonable algorithm which avoids this error amplification, we say that number 1 is numerically unstable. However, for most ill– conditioned problems, some amplification of computational errors is practically unavoidable. Example 1. ξ2 − 6.43ξ + .00947 = 0. Formula 1:

2 1/2 ξ1 =(6.43 − (6.43 − 4 × .00947) )/2 =(6.43 − (41.3 − .0379)1/2)/2 =(6.43 − 41.31/2)/2=(6.43 − 6.43)/2=0.

47 Formula 2: 2 × .00947 ξ1 = 6.43 + (6.432 − 4 × .00947)1/2 .0189 = = .00147. 12.9 Exact answer: .00147312··· We say that the first algorithm is numerically unstable because the roundoff errors are un- acceptably amplified in the result. The roundoff errors are − 2 − × 1/2 (6.43 |{z}(6.43 |{z} |{z} 4 |{z} .00947) |{z}) |{z}/2 . 0% 0.1% 0.1% 0.2% 0.05% 0%

Example 2 (the example of Gaussian elimination given earlier). relative error (max norm) = 8%. This error amplification is regarded as acceptable because there is no reasonable algorithm which is significantly better (unless partial double precision is used). Unstable algorithm: problem −→ subproblem; sensitive (or ill conditioned) subproblem Example compute z =3.01/x − 1.00 ALGORITHM 1 y =3.01/x; z = y − 1.00

z

y−1.00 accum. comp. error y x 3.01/x 3.01/x−1.00

48 ALGORITHM 2 y =3.01 − x; z = y/x

z accum. y/x comp. y error (3.01−x)/x x 3.01/x Classic example determining algebraic finding the the coeffs of eigenvalue ; −→ ; roots of a ; the characteris- problem polynomial tic polynomial often much more ill- conditioned than the original problem This was recognized in the 1950’s, in the early days of computing. Finding polynomial roots is an uncommon procedure today.

2.1.5 Interval computation Intervals were introduced as a way to bound data error; they can also be used for computa- tional error.

finite precision interval arithmetic The number of significant digits that can be carried is limited and it is necessary to round the results of arithmetic operations. In the case of an operation between two intervals, we want to round the resulting interval outward. (Why?) An example: fl([1.01, 1.02] × [2.01, 2.02]) = [2.03, 2.07] assuming 3-digit precision. Such arithmetic is part of the IEEE standard for binary floating point arithmetic.

49 techniques of interval analysis • outward rounding

• iterative refinement/defect correction

• Taylor series +| remainder{z } interval

For example, given x ∈ [−1, 1]

x3 x5 sin x = x − + cos ξ, ξ ∈ [−1, 1]. 6 120 Without interval analysis we cannot be certain how much accuracy we have. Indeed, numerical software is deliberately designed to fail miserably occasionally. And it is scandalous that users are not alerted to this weakness.

Review questions 1. What is a machine number?

2. Specify the set of machine numbers given the precision and exponent range.

3. What is the rule used for rounding to a machine number?

4. What is a good upper bound on the relative error of rounding for floating-point com- putation with n-bit mantissas (including the nonstored leading bit)?

5. Give a precise definition for a floating-point arithmetic operation.

6. What can we say about the accuracy of a floating-point arithmetic operation when there is cancellation?

7. What is the definition of numerical instability?

8. What is the cause of numerical instability?

9. What role does cancellation play in numerical instability?

10. What is the definition of an operation ◦ on two intervals I1 and I2?

11. How is rounding performed for an interval operand?

12. How can error be contained for a Taylor expansion?

50 Exercises 1. Below is a very tiny piece of the real line with consecutive 8-digit precision base 10 floating-point numbers marked above the line and consecutive 24-bit precision base 2 floating-point numbers marked below the line. x

Given that 1 ≤ X ≤ 10, determine what X must be. 2. For a floating-point number system with base β and precision t define the “equivalent (base 10) digits of precision” s to be such that the worst case relative roundoff error is the same for s base 10 digits as it is for t base β digits. Find a formula for s in terms of t and β without worrying about whether or not s is an integer. Apply this to β =2, t =24andtoβ =2,t = 53. 3. Consider the algorithms (A) 1/ˆ(1+ˆ x)+1ˆ /ˆ(1−ˆ x), (B) 2/ˆ((1+ˆ x)׈ (1−ˆ x)), (C) 2/ˆ(1−ˆ x׈ x) for computing the expression 2/(1 − x2)wherex is a machine number and correctly rounded arithmetic is used. For each algorithm answer the following question: For what values of x, if any, is poor (relative) accuracy in the value of the expression expected? (The answer will be of the form “for values of x near such and such points” where “such points” are on the extended real line [−∞, +∞].) Explain.

4. Give an example of a numberx, ˜ 1 ≤ x˜ ≤ 2, such that fl3(x) is undetermined even −100 though it is known that |x˜ −x|≤10 .(fl3 denotes rounding to 3 (significant digits.) 5. Explain why the following algorithm terminates in practice:

sum =0.;term =1.;n =0; while (sum + term =6 sum) { sum = sum + term; n =n +1; term = term ∗ x/(float)n; }

6. Assuming the use of 3 (significant) digit decimal floating-point numbers, give an ac- curate algorithm for computing x − π. Note the algorithm x − 3.14 is unacceptable because the error could be as great as 100%.

51 7. Let y = x1x2 ···xn,andletˆy denote the value of y computed with floating-point multiplication ˆ∗ where a∗ˆb =(a ∗ b)(1 + δ) for some δ (depending on a ∗ b) such that |δ|≤u. Neglecting terms of order u2 and higher, perform a forward error analysis by finding the smallest bound on ∆ in the expressiony ˆ = y(1 + ∆).

8. If ζ is the result of computing xTy in floating-point arithmetic (from left to right), show that ζ = xTy + ε where |ε|≤u|xT| diag (n, n, n − 1,...,3, 2)|y| + O(u2). The absolute value of a matrix is defined componentwise.

9. If C is the result of computing AB in floating-point arithmetic, show that

C = AB + E

where |E|≤nu|A||B| + O(u2). Absolute values and inequalities are to be applied component by component.

10. Supposing that it is known that

|x − 1.236|≤0.011,

enumerate the possible values of fl3(x)? (Here fln denotes rounding to the nearest number having n significant digits;inthecaseofaperfect tiewechoosethenumber whose nth significant digit is even.)

11.Whatdoesitmeanif1.F + rcond == 1.F in floating-point arithmetic, where rcond is the reciprocal of the condition number of a matrix A?2 More precisely determine a tight upper and lower bound on the condition number. 1 ≤ ≤ 12. Show in the case of base ten floating-point arithmetic that if 2 x/y 2, the difference x − y is computed exactly. (Your argument should be convincing to a person, but not a mechanical proof checker.) Show that this is untrue if (say) 0.4 ≤ x/y ≤ 2.5.

2The letter F denotes single precision.

52 2.2 Residual vs. Error

Two ways to assess accuracy forx ˆ = A\b computed in finite precision are

1. the residual kb − Axˆk, e.g., in interpolation, and

2. the error kxˆ − xk, e.g., networks.

The norm may be a weighted norm, e.g., kvk := kdiag(w1,w2,...,wn) · v||∞ (prove this is a norm) so that more important components are weighted more heavily. We could assume that equations or variables are scaled so that an unweighted norm is adequate. In finite precision both the residual and error are generally nonzero for all machine representable vectorsx ˆ.Moreover,

smallest error 6⇔ smallest residual.

In other words, the machine representable vector which

minimizes kxˆ − xk is generally different from that which

minimizes kb − Axˆk.

Example In the graph below those vectors which are machine representable are denoted by +. The level curves kb − Axk2 = constant are plotted. The ratio of major axis to minor axis is κ2(A). Both kerrork and kresidualk vanish at the center of the ellipses.

R E

The “R” marks the “machine vector” having the smallest residual. The “E” marks the “ma- chine vector” closest to the true solution.

53 THEOREM (Dahlquist & Bj¨orck Thm 5.5.2) The solutionx ˆ computed by Gaussian elimination satisfies 3 2 kb − Axˆk∞ ≤ (n +3n )gukAk∞kxˆk∞ where the growth factor3 (k) max |aij | g = i,j,k . max |aij| i,j

(0) (k) (k−1) Recall from sec. 1.4 that A = A, A = MkA . This error bound is partly a posteriori because g is generally not known until after the computation. growth factor for s.p.d. matrices THEOREM For symmetric positive definite matrices the growth factor g =1. Proof.LetA be a real symmetric positive definite matrix of dimension greater than 1 and partition it as   αaT A = . a Aˆ The multipliers for the first elimination step are a/α and   10 M1 = − a . α I Hence   αaT M1A = ˆ − aaT . 0 A α Recall that a symmetric positive definite matrix has its largest component on the diagonal. Hence, the theorem follows by induction on the dimension of the matrix if we prove that

(i) Aˆ − aα−1aT is symmetric positive definite, and

(ii) the maximum diagonal element of Aˆ − aα−1aT is less than or equal to the maximum diagonal element of A.

The first of these is exercise 1. To show (ii), we note that

2 ai1 aii − ≤ aii. 2 a11

3 Trefethen and Bau (1997) and Heath (2002) define g =max|uij |/max |aij |. i,j i,j

54 Consider the following example with β = 10, t = 3, fl = round-to-even:

  multiplier   .0001 1 1 = 10000 .0001 1 1 −→ , 110 0 −10000 −10000

xˆ2 =(−10000)/ˆ(−10000) = 1, xˆ1 =(1−ˆ 1׈ 1)/.ˆ 0001 = 0,

x2 =1.00010001 ···,x1 = −1.00010001 ···

In this computation cancellation occurred once in the subtraction 1−ˆ 1, but this was per- formed exactly. In fact, there was only one tiny rounding error:     .0001 1 1 .0001 1 1 −→ . 01−ˆ 10000 −10000 0 −10000 −10000

It obliterated the 1 in the second row and second column, which is equivalent to changing .0001 1 1 original problem to . This is typical. Adding a huge number to a much 100 smaller number can destroy information in the smaller number, which may or may not be serious. In Gaussian elimination we do not want to create reduced systems with large elements because a large computed result may entail a roundoff error that although tiny compared to the result is nonetheless significant compared to other elements in the matrix.   0 Note the residual b − Axˆ = is large in this example. −1

3 2 norm of the error Let kb − Axˆk∞ =: cukAk∞kxˆk∞. Then c ≤ (n +3n )g.Inpractice c ≈ g. The relative residual

kb − Axˆk∞ 2 ≤ cuκ∞(A)+O(u ). kbk∞

In practice the relative residual is somewhat smaller than guκ∞(A). The error

xˆ − x = −A−1(b − Axˆ), −1 kxˆ − xk∞ ≤kA k∞kb − Axˆk∞ −1 = cukA k∞kAk∞kxˆk∞ ,

so  kxˆ − xk∞ 2 ≤ cuκ∞(A)+O (uκ∞(A)) . kxk∞

55 Note that this bound is very crude. Consider A = diagonal matrix.

The LAPACK routine 1 DGECON ≡ DGETRF factorization + rcond(= estimate of ), κ(A) so that one can use ukxˆk/rcond to estimate error inx ˆ and one can use 1. + rcond == 1. to see if A is too close to being singular.

Review questions 1. Define the residual for an approximate solution of a system of linear equations.

2. Give an example of an application problem where it is the residual not the error that is of interest.

3. Does the vector x ∈ Rn which minimizes the norm of the error of Ax = b,det(A) =0,6 also minimize the norm of the residual? Is this true if A, b,andˆx are machine numbers?

4. Define the growth factor for Gaussian elimination.

5. A bound on the norm of the error for Gaussian elimination due to floating-point arith- metic is linearly proportional to which quantities?

Exercises 1. Assuming that   αaT A = a Aˆ is symmetric positive definite, show Aˆ − aα−1aT is symmetric positive definite.

2. This question concerns the solution of a linear system Ax = b for a certain b ∈ R2 and A ∈ R2×2. In the graph below, crosses denote points in R2 which are machine representable and the ellipse is defined by kb − Axk2 = constant. What interesting point is illustrated by this diagram?

56 2.3 Pivoting mathematical pivoting     111 01 Gaussian elimination fails for . It also fails for  110 . The elements that we 10 100 attempt to divide by are called pivots. These are the diagonal elements of U. Whenever we encounter a zero pivot, we could interchange that row with some row below it:       111 111 111  110 −→  00−1  −→  0 −1 −1  . 100 0 −1 −1 00−1   100 Interchanging rows 2 and 3 can be accomplished by premultiplying by  001 (which is 010 the result of interchanging rows 2 and 3 of the identity matrix). More generally interchanging

57 rows i and j can be accomplished by an elementary permutation matrix   1    ..   .     1     01    1   .  Pi =  ..  .    1     10    1    .  ..  1 (Is this an elementary matrix?) Any permutation matrix is a product of such matrices. Note −1 Pi = Pi. Mathematical pivoting cannot really fail because even if   ×××××××    ××××××    ××××× (k−1)   A =  0 ××× ,    0 ×××  0 ××× 0 ××× the reduction to upper triangular form can continue. However, backsubstitution will fail. partial pivoting “If division by zero is disastrous in exact arithmetic, then division by a small number is dangerous in finite precision arithmetic.” This argument is dubious but the conclusion is correct for Gaussian elimination. Assume 3-digit arithmetic. (This means that the result of each operation is rounded to 3 significant digits, which is very different from 3 decimal places.) Recall the example     .0001 1 1 −→ .0001 1 1 110 0 −10000 −10000

back substitutex ˆ1 =1 exact x2 =1.00010001 ··· xˆ1 =0 solution : x1 = −1.00010001 ··· (How does Cramer’s rule fare?) But if we first interchange rows,     110 −→ 110 backx ˆ2 =1 .0001 1 1 011 subst :x ˆ1 = −1

58 (Is this the solution correctly rounded to 3 digits?) Example           123 412 412 412 412      1   7 5   7 5  212 −→ 212 −→ 0 2 1 −→ 0 4 2 −→ 0 4 2 7 5 1 2 412 123 0 4 2 0 2 1 00 7

More generally at each stage

Partial pivoting forces multipliers to be ≤ 1 in absolute value.

ALGORITHM partial pivoting for k =1,2,..., n − 1 do { find p in {k, k +1,...,n} which maximizes |apk|; rk = p; /∗ interchange kth row with pth row, including multipliers ∗/ for j =1,2,..., n do interchange akj with apj; /∗ subtract multiples of kth row from remaining rows ∗/ for i = k +1,k +2,..., n do { µ = aik/akk; aik = µ; /∗ subtract µ × kth row from ith row ∗/ for j = k +1,k +2,..., n do aij = aij − µ · akj; } } END

The output is

59 r1

r2

aij m1

m2

rn−1

mn−1

For example,

  5 1st interchange rows 1 and 5  2  then no interchange r =    4  then interchange rows 3 and 4 5 finally interchange rows 4 and 5 operation count 1 3 In addition to the 3 n multiplications etc. of plain Gaussion elimination, there are also comparison operations:

for the kth stage (n − k) comparisons

Altogether there are n 2 (n − 1) + (n − 2) + ···+1 ≈ comparisons 2 partial pivoting in matrix notation This an algorithm that performs row interchanges on the multipliers as well the elements of the reduced matrix, which is what DGETRF does. The algorithm produces an evolving factorization:

A = L0U0 → L1U1 →···→Ln−1Un−1 = LU where LkUk is the factorization after the kth stage of the algorithm. Initially,

L0 = I, U0 = A.

Assume that Lk−1 and Uk−1 have been calculated. The interchange of row k with some other row gives 0 0 Lk−1 = PkLk−1Pk,Uk−1 = PkUk−1.

60 The transformation on the left interchanges two rows of multipliers without affecting the 1’s on the diagonal. Then elimination produces

0 −1 0 Lk = Lk−1Mk ,Uk = MkUk−1.

The transformation on the left inserts new multipliers into Lk. It is straightforward to verify the loop invariant LkUk = Pk ···P2P2A and hence to conclude that

LU = PA where P = Pn−1 ···P2P1.

To compute A−1b, one computes

−1 U (Mn−1(···(M2(M1(Pn−1(···(P2(P1b)) ···)))) ···)). storage

1 U 2 n(n +1) float values Mk n − k float values Pk 1 int value complete pivoting At each stage

largest absolute value

Interchanging columns: postmultiplying by the permutation matrix

Mn−1Pn−1(···(M2P2(M1P1AQ1)Q2) ···)Qn−1 = U.

Factorization: A = ··· Backsolving: A−1b = ···

61 Disadvantage: n3/3 comparisons. E.g.,           123 987 987 978 978      − 1 − 2   − 4 − 2   − 4 − 2  456 −→ 654 −→ 0 3 3 −→ 0 3 3 −→ 0 3 3 − 2 − 4 − 2 − 1 789 321 0 3 3 0 3 3 000 matrix inversion

LU factorization : n3/3mults −1 2 2 3 forward elimination L ej, j =1, 2,...,n : n /2+(n − 1) /2+···+1/2 ≈ n /6mults −1 −1 2 2 2 3 back substitution U (L ej), j =1, 2,...,n : n /2+n /2+···+ n /2 ≈ n /2mults n3 mults growth factor Recall the bound on the residual

3 2 kb − Axˆk∞ ≤ (n +3n )gukAk∞kxˆk∞.

CONJECTURE: g ≤ n for complete pivoting. Recently disproved. THEOREM (Wilkinson): g ≤ 2n−1 for partial pivoting.     1111      −11 1  12      −1 −11 1  14   −→    ......   .. .   . . . . .   . .   −1 −1 −1 ··· 11  12n−2  −1 −1 −1 ··· −11 2n−1

However empirical√ evidence (Trefethen & Bau, 1997, p. 168) shows that it is extremely rare for gn to exceed n. IN SUM the residual is practically always (relatively) miniscule but the error can be (relatively) large.

Review questions 1. What is an elementary permutation matrix? 2. What is the inverse of an elementary permutation matrix? 3. Write an algorithm for Gaussian elimination with partial pivoting applied to a system Ax = b of n equations in n unknowns.

62 4. What storage scheme is used by Gaussian elimination with partial pivoting?

5. In what sense is the result of Gaussian elimination with partial pivoting a factorization of the original matrix?

6. What is the idea of complete pivoting?

7. How many multiplications is required for an efficient implementation of matrix inver- sion by Gaussian elimination?

8. What is the worst case growth factor for Gaussian elimination with partial pivoting?

Exercises 1. (a) Use Gaussian elimination with partial pivoting to reduce the following matrix to upper triangular form and show each step of computation:   1030  013−1    .  3 −30 6 0246

(b) The reduction process in part (a) can be expressed as a factorization of the matrix. Write down this factorization as the product of several 4 × 4 matrices.

2. Suppose the partial pivoting algorithm returns   124    1  2 A = 2 12 ,r= . 1 1 3 4 2 1

This information represents a factorization of the original matrix. What is this factor- ization?

3. The algorithm given for partial pivoting on page 59 can be modified in the obvious way to include complete pivoting. An additional int array c of dimensional n − 1is needed to record column interchanges. Suppose the modified algorithm returns       42−1 3 2 r = ,c= ,A=  1/4 −23/2  . 3 3 1/2 −1/413/8

(a) This information represents a factorization of the original matrix (as a product of seven 3 × 3 matrices). What is this factorization?

63 (b) Determine the solution of   −1 (original matrix)x =  4  . −3

Do not re-form the original matrix! (If the components of x are not integers, then you made a mistake.)

4. Fully explain the purpose of partial pivoting.

5. The following alleged factorization of a matrix could not possibly have been produced by the partial pivoting algorithm given in these notes:   40−31.20.9   2  01.221.22    5  A =  −0.5 −0.900.53 r =      5   0.80.300.3 −0.7  3 1.10.80.10.70

List all the reasons why these values could not have been produced by the partial pivoting algorithm (assuming the computation was performed with infinite precision). There are three distinct reasons.

6. The following alleged factorization of a matrix could not possibly have been produced by the partial pivoting algorithm:     31.2 −0.81.40.9 4  0.31.2002  2  contents of   contents of   =  −10.8 −20.90.3  , =  4  . array A   array ipiv    01.10.50−0.7   3  0.9 −0.71−0.50 5

List all the reasons why these values could not have been produced by the partial pivoting algorithm (assuming the computation was performed with infinite precision). There are three distinct reasons.

−1 −1 7. Suppose that a matrix A has a factorization A = M1 P2M2 U where         100 010 100 2 −20         M1 = −1/210 ,P2 = 100 ,M2 = 010,U = 04−2 . 001 001 01/21 001

Without re-forming the original matrix A, use this information to solve ATx = b efficiently where b =[51−4 ]T.

64 8. A square matrix A is upper Hessenberg if aij = 0 for j

9. Modify the partial pivoting algorithm given in this section so that it does complete pivoting. An additional array c of dimension n − 1 is needed to record column inter- changes. (There are two correct answers.)

10. Show that if A is a 2 by 2 matrix, then the growth factor g2 for Gaussian elimination with partial pivoting is bounded by 2.

11. Following is a recursive algorithm for inverting a matrix:

invertR([α]) { α =1 ./α; } αbT invertR( ) { cD c = c/(−α); D = D + cbT; bT = bT/(−α); invertR(D); c = Dc; α =1./α + bTc; bT = bTD; }

where α is a scalar, b and c are column vectors, and D is a square matrix. Give an approximate count for the number of multiplications performed by invertR for an n by n matrix excluding the recursive call, in particular, give the exponent and coefficient of the leading term in the multiplication count.   2 −3 12. What is the growth factor for Gaussian elimination without pivoting applied to ? 34 for Gaussian elimination with partial pivoting? for Gaussian elimination with complete pivoting?

13. Suppose we have reduced a 3 by 3 matrix to upper triangular form using partial pivoting implemented so that when two rows are interchanged so are the multipliers. To do a backsolve, in what order would the following operations be performed?

apply the first column of multipliers apply the second column of multipliers perform the first interchange perform the second interchange

65 use the first row of the reduced matrix use the second row of the reduced matrix use the third row of the reduced matrix

2.4 Scaling row scaling th If we divide i eqn of Ax = b by dii,weget

(D−1A)x = D−1b where D =diag(d11,d22,...,dnn). In finite precision the solutionx ˆ computed by elimination is unaffected by such row scaling if

(i) roundoff error is avoided either by using only powers of β in D or by “implicit scaling,”

(ii) the sequence of pivots is unchanged.

With partial pivoting, however, row scaling may change the choice of pivots and thus indi- rectly affectx ˆ. E.g., with β =10,t=3,

 unscaled system scaled system  1 10000 10000 .1 1000 1000 11 0 11 0 st st 1 pivot from  row 1 1 pivot from row 2 0 −1 xˆ = xˆ = 1 1

In fact (D & B; A p. 181) “for any given sequence of pivots (which does not give a pivot exactly equal to zero) there exists a scaling of the equations such that partial pivoting will select these pivots.” the residual

If a small residual is wanted, the user should scale the equations so that kb − Axˆk∞ is an appropriate measure of accuracy. Partial pivoting without further scaling will practically always ensure that this is minuscule. In both examples above

kb − Axˆk∞ ≈ 10−4. kbk∞

66 the error In the examples above  kxˆ − xk∞ 1 for unscaled system, ≈ −4 kxk∞ 10 for scaled system.

(We assume that the user has scaled the variables so that kxˆ−xk∞ is an appropriate measure of accuracy.) Thus partial pivoting does not necessarily select pivots which give the smallest possible error. Rather the problem of choosing a sequence of pivots has been “reduced” to one of choosing scale factors. Unfortunately (Jennings, p. 119) there is “no simple automatic pre-scaling technique” that works in general. Nonetheless, much is known about scaling:

divide ith equation by dii ≈|ai1x1| + ai2x2| + ···+ |ainxn| (preferably rounded to nearest power of β,

J. ACM 26, p. 494. Of course, x is not known, but we might know its components within an order of magnitude. (If they are roughly of the same size, this is equivalent to

dii = |ai1| + |ai2| + ···+ |ain| which is called row-equilibration because the rows of D−1A have ∞-norms all equal to 1.) Even if we have no idea about x wecanobtainana posteriori check: In the first example

|a11xˆ1| + |a12xˆ2| = 1000 ,

|a21xˆ1| + |a22xˆ2| =1,

st which should warn us that a11 was a poor choice for 1 pivot. When in doubt, use the “natural” scaling of problem. column scaling by powers of β has no effect on partial pivoting but does affect complete pivoting.

Review question 1. To what extent can row scaling affect the possibilities for partial pivoting?

2.5 Iterative Improvement

Suppose we have already done the following:

factor A;

67 x(1) = ( factors of A)−1b;

We have

remainder = x − x(1) = A−1(b − Ax(1))=A−1· residual.

Because division by A is now cheap—n2 mults—let us try  r(1) := b − Ax(1),  (1) −1 (1) 2 d := (factors of A) r ,  2n mults x(2) := x(1) + d(1).

example     1 10000 10000 −1.00010001 ··· ,x= , 11 0 1.00010001 ··· for which   1 10000 L\U = . 1 −10000 3-digit, round-to-even k 1 2 0 −1 (k) x 1 1 0 0 (k) r −1 0 −1 0 (k) d 10−4 0 convergence. example     76.9 34.7 2 ,x= , 44 20 3

k 1 1.67 (k) x 3.33 0 (k) r 0 0 (k) d 0

68 Note how good partial pivoting was in making the residual small. This algorithm was good at reducing error in 1st example, which was badly scaled. It was unsuccessful for 2nd example, which was ill conditioned, because the computed residual was zero. The residual must be computed in higher precision, which costs very little extra on most computers because the entire higher precision calculation can be done in the registers of the CPU and the exact product of two single precision numbers is always available.

ALGORITHM factor A; d = x = (factors of A)−1b; r =fl(b − Ax computed in higher precision); d0 = (factors of A)−1r; while (kd0k < kdk) { d = d0; x = x + d; r =fl(b − Ax computed in higher precision); d0 = (factors of A)−1r; }

The iteration converges if uκ(A)  1. Termination of the iteration means either it is not converging or it has converged to the exact solution within a unit of roundoff error. Example     76.9 34.7 2 ,x= , 44 20 3   76.9 L\U = , .571 .06 3-digit round-to-even arithmetic with 6-digit arithmetic for residual k 1 2 3 1.67 1.98 2.00 (k) x 3.33 3.02 3.00 .033 .002 0 (k) r 0 0 0 .313 .019 (k) d −.313 −.019

Review questions 1. Describe the algorithm for iterative improvement.

2. Does iterative improvement reduce the residual? reduce the error?

69 Exercises 1. Redo the example on page 69 for k =1, 2, 3, 4 with 34.7 and 20 replaced by 34.8 and 20.1, respectively. Show your work clearly.

2. Let     .2 .16667 .14286 .50953 A =  .16667 .14286 .125  and b =  .43453  . .14286 .125 .11111 .37897 Using rounded five-decimal-digit floating-point arithmetic, compute the triangular fac- torization LU of A and the solution x1. Perform one iteration of iterative refinement. Compute each of the following estimates of the condition number:

(i) kd1k∞/(nukd0k∞)(cf.D&B;A eqn. (5.5.18)),

(ii) kAk∞kd1k∞/kr1k∞. 1 × −4 ≈ Note that u = 2 10 . Also κ∞(A) 16454.

70