Chapter 2 DIRECT METHODS—PART II

Chapter 2 DIRECT METHODS—PART II If we use finite precision arithmetic, the results obtained by the use of direct methods are contaminated with roundoff error, which is not always negligible. 2.1 Finite Precision Computation 2.2 Residual vs. Error 2.3 Pivoting 2.4 Scaling 2.5 Iterative Improvement 2.1 Finite Precision Computation 2.1.1 Floating-point numbers decimal floating-point numbers The essence of floating-point numbers is best illustrated by an example, such as that of a 3-digit floating-point calculator which accepts numbers like the following: 123. 50.4 −0.62 −0.02 7.00 Any such number can be expressed as e ±d1.d2d3 × 10 where e ∈{0, 1, 2}. (2.1) Here 40 t := precision = 3, [L : U] := exponent range = [0 : 2]. The exponent range is rather limited. If the calculator display accommodates scientific notation, e g., 3.46 3 −1.56 −3 then we might use [L : U]=[−9 : 9]. Some numbers have multiple representations in form (2.1), e.g., 2.00 × 101 =0.20 × 102. Hence, there is a normalization: • choose smallest possible exponent, • choose + sign for zero, e.g., 0.52 × 102 → 5.20 × 101, 0.08 × 10−8 → 0.80 × 10−9, −0.00 × 100 → 0.00 × 10−9. −9 Nonzero numbers of form ±0.d2d3 × 10 are denormalized. But for large-scale scientific computation base 2 is preferred. binary floating-point numbers This is an important matter because numbers like 0.2 do not have finite representations in base 2: 0.2=(0.001100110011 ···)2. For the IEEE Standard for Binary Floating Point Arithmetic, the set of machine representable numbers (single precision) is e (±b1.b2 ···b24)2 × 2 , where bj ∈{0, 1} e ∈ [−126 : 127] (normalization b1 =1ife ≥−125), +0, −0, also := −∞,+∞,NaN. 41 Normally ±0, ±∞ are indistinguishable. The important question is not how the numbers are represented but rather WHICH NUMBERS ARE EXACTLY REPRESENTABLE. Answer: integer × 2power |integer|≤16 777 215 − 149 ≤ power ≤ 104 You do not have to know binary to understand this. For double precision the set of machine representable numbers is e (±b1.b2 ···b53)2 × 2 , where e ∈ [−1022 : 1023] general floating-point numbers A set of floating-point numbers =: machine numbers can be specified by four parameters: β =base,t= precision, L = lower limit of exponent,U= upper limit of exponent. Thesetofmachinenumbersis e ±d1.d2 ···dt × β where di ∈ [0 : β − 1],e∈ [L : U]. (We call e the exponent.) It is a finite set of numbers; e.g., β =2,t =3,L = −2, U =1are “3-bit numbers”: −7/2 −3 −2 −1−1/2 −1/4 0 1/4 1/2 1 2 3 7/2 denormalized 2.1.2 Rounding rounding round to nearest; in a tie, round away from zero round-to-even round to nearest: in a tie, round to number whose last digit is even chopping round toward zero 42 e.g., β =10 t =3 L = −9 U =9 number chopping rounding round–to–even 99.9499 99.9 99.9 99.9 99.95 99.9 100 100 −99.95 −99.9 −100 −100 99.85 99.8 99.9 99.8 9.995109 9.99109 overflow overflow 9.99510−10 0.9910−9 1.10−9 1.10−9 99.851 99.8 99.9 99.9 We use fl(x) to denote the result of rounding x.Iffl(x) =6 x,thenwehavearoundoff error. definition of error (for floating-point computation) (The number of exactly correct digits is for most purposes irrevelant.) There is yet another measure of accuracy which is meaningful and useful in a fixed precision context: e Supposeα ˜ = d1.d2 ···dt−1dt × β ( normalized). Define 1 ulp = 0.0 ···01 × βe = βe−t+1 1 unit in the last place. α˜ − α error in ulps := βe−t E.g., in a 3-digit environment, π ≈ 3.14 has an error −.1592 ulps. Clearly |error|≤0.5ulps represents perfection and a routine that delivers sin(x) to within 1 ulp is doing very well. a single roundoff error Let x be a representable real, meaning that its rounded value has an exponent that is in range. If x is a machine number, then δ = 0. Assume then that x is not a machine number so that, in particular, it is nonzero and can be expressed in normalized form x = βe × f 1 ≤|f| <β. As shown below x lies between two consecutive machine numbers (if for the moment we ignore out-of-range exponents) which differ by one unit in the t-th place of x. bracketing | | machine numbers e ←−←−←−β × (0.00 ···01)β −→ −→ −→ = βe−t+1 43 The number x is rounded to the nearer of the two bracketing numbers, so the rounding error is at most half the distance between them: 1 |error|≤ × βe−t+1. 2 Equality occurs when x is in the middle and gets rounded in either direction. It is the relative error (|error/x|) that we wish to bound, so we need a lower bound on x which can be obtained using the normalization assumption: |x| = βe|f| >βe = βe. The strict inequality is justified because x is not a machine number. Combine the two inequalities to get 1 × βe−t+1 1 |relative error| < 2 = β1−t e.g., 2−24 in s. prec. βe 2 This can be expressed fl(x)=x(1 + δ) where the relative roundoff error δ satisfies1 β1−t, chopping, |δ| <u:= 1 1−t 2 β , rounding, round-to-even. Often u is called the unit roundoff error. 2.1.3 Floating-point arithmetic Let ◦ denote one of +, –, *, /. If a, b are machine numbers, then the floating-point operation ◦ˆ is defined by a◦ˆb =fl(a ◦ b). E.g., 9.99 −ˆ .00501 = fl(9.98499) = 9.98. Thus in principle we first compute the result to infinite precision; in practice, computers do it some other way but the end result is the same. (The details of how the computers do it is an entirely different topic.) We have a◦ˆb =fl(a ◦ b)=(a ◦ b)(1 + δ), |δ|≤u. Thus a single operation a◦ˆb is always performed very accurately. Also input/ouput, e.g., 0.1F=0.1(1 + 2−26). 1for normalized numbers. This can be modified to include denormalized numbers. 44 cancellation occurs when the result has fewer digits than the operands and hence the result is a machine number: CANCELLATION =⇒ NO ROUNDOFF ERROR. Elementary functions can be evaluated with an error of less than 1 ulp. 2.1.4 Algorithms and numerical instability algorithms problem → subproblem; subproblem e.g., Solve for eliminate solve for solve ; −→ if (n>1) { ; ; } ; x1,x2,...,xn x1 x2,...,xn for x1 set up system polynomial solve system of ; −→ of linear equa- ; ; interpolation linear equations tions In general Rm −→f Rn −→ Rm −→f1 Rp −→f2 Rn y = f(x) z = f1(x); y = f2(z) f = f2 ◦ f1 Thus, an algorithm is the repeated decomposition of problems (mappings) into smaller subproblems until all subproblems can be solved by existing hardware and software. Different decompositions yield different algorithms. propagated and accumulated roundoff error In the diagram below FE and FEc denote, respectively, the use of forward elimination using exact and 3-digit rounded arithmetic. Similary BS and BSc denote back substitution. 45 3.00 4.13 15.4 1.00 1.37 5.15 FE FEc the 0.02% diff 3.00 4.13 15.4 3.00 4.13 15.4 is the −.00666 ··· .016666 ··· −.∅100 .∅200 FE error BS BS BSc the 9% diff the1%diff 8.575 is the 7.88666 ··· 7.90 is the −2.5 propagated −2 −2.00 BS error FE error ←− the 9% diff is the accumulated error −→ For comparison the unit roundoff error u =0.5%. (Question: why do we consider FE;c BS rather than FE; BS?c Answer: because the 3-digit arithmetic used in BSc is defined only for 3-digit operands.) catastrophic cancellation (1000+1)ˆ −ˆ 1000 exact approx 1001 − 1000 1000−ˆ 1000 exact exact approx 100 The subtraction was performed exactly. The 0.1% addition error was amplified to a 100% error in the result because of cancellation. Another example: 1.07×ˆ 1.07−ˆ 1.14 exact approx 1.1449 − 1.14 1.14−ˆ 1.14 exact exact approx 0.0049 0 0 −.0049 −.0049 The relative error 1.1449 is amplified to become .0049 . CANCELLATION =⇒ ERROR AMPLIFICATION. 46 Often an expression has several arrangements which are mathematically equivalent. Example Suppose x2 ≈ x1 ≈ x0 formula 1: x2 − 2 ∗ x1 + x0 in 3-digit precision .723 − 2 ∗ 7.11 + .701 = .723 − 1.42 + .701 = −.697 + .701 = .004 formula 2: (x2 − x1) − (x1 − x0) in 3-digit precision (.723 − .711) − (.711 − .701) = .012 − .010 = .002 exact! Moral: cancellation is good if the operands have no computational error, e.g., they are inputs to the algorithm. x − y x + y Example: Change sinx − siny to2sin cos . (There is a calculus of finite dif- 2 2 ferences that enables one to perform such transformations.) Therefore if cancellation cannot be avoided, then it should occur at the beginning of the computation. numerical instability For a given problem y = x2 − 1 there may be several algorithms: 1. y =(x×ˆ x)−ˆ 1, 2. y =(x−ˆ 1)×ˆ (x+1).ˆ Both of these algorithms introduce a few very tiny roundoff errors, and yet algorithm 1 can sometimes yield a result y with a relative error far greater than the unit roundoff error.

Chapter 2 DIRECT METHODS—PART II

Fujitsu SPARC64™ X+/X Software on Chip Overview for Developers]

Decimal Floating Point for Future Processors Hossam A

A Decimal Floating-Point Speciftcation

Floating Point Numbers

High Performance Hardware Design of IEEE Floating Point Adder in FPGA with VHDL

Names for Standardized Floating-Point Formats

SPARC64™ X / X+ Specification

Numerical Computation Guide

In Using the GNU Compiler Collection (GCC)

Cycle-Accurate Evaluation of Software-Hardware Co-Design of Decimal

Floating Point Arithmetic Chapter 14

Floating Point