<<

Error bounds of complex arithmetic

Michael Baudin June 2011

Abstract In this document, we present error bounds for the complex arithmetic. We present the error bounds which are presented in the biliography. We give the proof of some error bounds for the complex addition, multiplication and division. A new proof for the error bound associated with the complex division is presented. This proof leads to an error bound which is smaller than other published bounds.

Contents

1 Introduction 3

2 Useful results for error bounds3 2.1 Properties of γn ...... 3 2.2 Properties of θj ...... 9 2.3 Exercises...... 17

3 Error bounds for complex arithmetic 17 3.1 Standard arithmetic model...... 18 3.2 Error bound for complex addition...... 18 3.3 Error bound for complex multiplication...... 20 3.4 Other error bounds for complex multiplication...... 22 3.5 Error bound for complex division...... 23 3.6 Other error bounds for the naive complex division algorithm...... 25 3.7 Error bounds for Smith’s complex division algorithm...... 25 3.8 Exercizes...... 26

4 Notes and references 27

5 Acknowledgements 27

6 Answers 28 6.1 Answers for section 2.3...... 28 6.2 Answers for section3...... 28

1 Copyright 2011 - Michael Baudin This file must be used under the terms of the Creative Commons Attribution-ShareAlike 3.0 Unported License:

http://creativecommons.org/licenses/by-sa/3.0

2 Radix β 2 Precision p 53 Exponent Bits 11 Minimum Exponent emin -1022 Maximum Exponent emax 1023 Largest Positive Normal Ω (2 − 2−52) · 21023 ≈ 1.79 × 10308 Smallest Positive Normal µ 2−1022 ≈ 2.22 × 10−308 Smallest Positive Subnormal α 2−1074 ≈ 4.94 × 10324 Epsilon  2−52 ≈ 2.220 × 10−16 Unit roundoff u 2−53 ≈ 1.110 × 10−16

Figure 1: IEEE 754 doubles

1 Introduction

We present in this section the parameters of the floating point which are used in Scilab, that is, we present binary 64 doubles. In Scilab, we use double precision floating point numbers, which corresponds to the binary 64 floating point numbers of the IEEE-2008 [6] standard. In this floating point system, the significand makes use of p = 53 binary digits of precision and 11 binary digits in the exponent. The parameter of the IEEE 754 double precision floating point system are presented in the figure1. The benefits of the IEEE 754 2008 standard [6] is that it increases the portability of programs which are making use of floating point numbers, and especially those written in the Scilab lan- guage. Indeed, on all machines where Scilab is available, the radix, the precision and the of bits in the exponent are the same. As a consequence, the machine epsilon  is always equal to the same value for doubles. This is the same for the largest positive normal Ω and the smallest positive normal µ, which are always equal to the values presented in the figure1. The situation for the smallest positive subnormal α is more complex, but, in this paper, we assume that gradual underflow is available. Doubles have intrinsic limitations which can lead to , overflow or underflow. Indeed, not all mathematical numbers x can be represented as floating point numbers. We denoted by fl(x) the floating point representation of x. In this paper, we are mostly concerned with overflow and underflow. Indeed, if x is a number such that |x| < α, then the floating point representation of x is fl(x) = 0, i.e. x underflows. On the other hand, if x is a number such that |x| > Ω, then the floating point representation of x is fl(x) = Inf, i.e. x overflows.

2 Useful results for error bounds

In this section, we derive propositions wich are useful in the context of error analysis. These results are presented by Higham [5].

2.1 Properties of γn In this section, we assume that u, the unit roundoff, is a positive such that 0 < u < 1. 1 The unit roundoff is half the machine epsilon, i.e. u = 2 .

3 γn γn n nu n nu 1 1.00000000000000020 101 1.00000000000000110 2 1.00000000000000020 102 1.00000000000001110 3 1.00000000000000020 103 1.00000000000011100 4 1.00000000000000040 104 1.00000000000111020 5 1.00000000000000040 105 1.00000000001110220 6 1.00000000000000070 106 1.00000000011102230 7 1.00000000000000070 107 1.00000000111022300 8 1.00000000000000090 108 1.00000001110223050 9 1.00000000000000090 109 1.00000011102231490 10 1.00000000000000110 1010 1.00000111022425720 11 1.00000000000000130 1011 1.00001110235350700 12 1.00000000000000130 1012 1.00011103462978280 13 1.00000000000000160 1013 1.00111145698976610 14 1.00000000000000160 1014 1.01122687358170120 15 1.00000000000000160 1015 1.12488761278269790

γn Figure 2: Approximations of nu to 17 decimal digits. We consider the unit roundoff u which corresponds to binary 64 (double precision) floating point numbers.

Definition 2.1. ( Gamma-n) For n = 1, 2,... such that nu < 1, let us define γn by nu γ = . (1) n 1 − nu

γn The table2 presents a table of approximations of the ratio nu , for increasing values of n. For 15 n ≤ 10 , we have γn ≤ 1.13nu. The following gamman function returns γn which corresponds to the given n. function y = gamman(n) u= %eps /2 y = (n.*u)./(1-n.*u) endfunction

In the following session, we compute the unit roundoff u and compute the ratio γn/u for increasing values of n. -->u=%eps/2 u = 1.110D -16 -->[n gamman(n)/u] ans = 1. 1. 16. 16. 256. 256. 4096. 4096. 65536. 65536. 1048576. 1048576. 16777216. 16777216.

4 2.684D+08 2.684D+08 4.295D+09 4.295D+09 6.872D+10 6.872D+10 1.100D+12 1.100D+12 1.759D+13 1.763D+13 2.815D+14 2.906D+14 4.504D+15 9.007D+15 Notice that the equality1 can be applied even if n is a real number greater than 1. In some results that we are going to derive, we will consider real numbers n, as this does not change the definition. In this section, we will use regularily the following result from calculus which describes the geometric series.

Proposition 2.2. ( Geometric series) Assume that x ∈ R is such that |x| < 1. Therefore, 1 = 1 + x + x2 + x3 + x4 + ... (2) 1 − x Before entering into more complex developments, let us consider more elementary properties of γn.

Proposition 2.3. ( Basic properties of γn) Assume that j, k, n ∈ R are three real numbers such that j, k, n ≥ 1 and ju, ku, nu < 1. If j ≤ k, then

γj ≤ γk (3)

If jku < 1, then

jγk ≤ γjk. (4)

Notice that we have chosen to consider the case where j, k, n are real numbers instead of the more restrictive case where they are integers. As we are going to see, this does not change anything to the results that we are going to prove. Proof. We first prove the inequality3. Assume that j ≤ k. Since u > 0, we have

ju ≤ ku. (5)

The hypothesis ju < 1 implies that 1 − ju is positive, hence we can multiply both sides of the 1 previous inequality by 1−ju and get ju ku ≤ . (6) 1 − ju 1 − ju In turn, this implies −ju ≥ −ku, which implies 1 − ju ≥ 1 − ku. The hypothesis ku < 1 implies that 1 − ku > 0, which allows to invert the inequality and implies 1 1 ≤ . (7) 1 − ju 1 − ku

5 We plug the inequality7 into6 and get ju ku ≤ , (8) 1 − ju 1 − ku which immediately leads to the inequality3. We now prove the inequality4. We have jku jγ = . (9) k 1 − ku The hypothesis j ≥ 1 implies k ≤ kj. Then the inequality u > 0 implies ku ≤ jku. Hence 1 − ku ≥ 1 − jku. The hypothesis jku < 1 implies that 1 − jku > 0. Hence, we can invert the previous inequality, which implies 1 1 ≤ . (10) 1 − ku 1 − jku We plug the inequality 10 into9, which implies jku jγ ≤ . (11) k 1 − jku

Then, the definition of γjk concludes the proof. In most error analysis, we are considering terms of the form 1+δ, where δ is a relative error of small magnitude, i.e. such that |δ| ≤ u. In the following proposition, we prove that terms 1 + δi can be expressed as 1 + θn, where θn is only slightly larger than nu.

Proposition 2.4. ( Bound for a product of (1+δi)) Assume that n = 1, 2,... is such that nu < 1. Assume that δi ∈ R is such that |δi| ≤ u, for i = 1, 2, . . . , n. Then there exists a θn ∈ R such that

(1 + δ1)(1 + δ2) ··· (1 + δn) = 1 + θn, (12) where

|θn| ≤ γn, (13) and where γn is defined by1. Proof. We first check that this proposition is true for n = 1 and n = 2. It is easy to check this proposition for n = 1. We assume that δ1 ∈ R is such that |δ1| ≤ u. Then 1 + δ1 = 1 + θ1, with θ1 = δ1. Moreover, |θ1| = |δ1| ≤ u. But u > 0, by hypothesis. 1 u Therefore, −u < 0, which implies 1 − u < 1. Hence, 1 < 1−u . This leads to |θ1| ≤ u ≤ 1−u . We can also check this proposition for n = 2. We assume that δi ∈ R is such that |δi| ≤ u, for i = 1, 2. We have

(1 + δ1)(1 + δ2) = 1 + δ1 + δ2 + δ1δ2. (14)

Hence,

(1 + δ1)(1 + δ2) = 1 + θ2, (15)

6 where θ2 = δ1 + δ2 + δ1δ2. We have to prove that |θ2| ≤ γ2. We consider the absolute value of θ2|, and get

|θ2| ≤ |δ1| + |δ2| + |δ1||δ2| (16) ≤ u + u + uu (17) ≤ 2u + u2 (18) u ≤ 2u(1 + ). (19) 2 u 1 We can conclude if we can prove that 1 + 2 ≤ 1−2u . But this is true, since u 1 + ≤ 1 + 2u (20) 2 ≤ 1 + 2u + (2u)2 + (2u)3 + ... (21) 1 ≤ , (22) 1 − 2u since, by hypothesis, 2u < 1. This concludes the proof for the case n = 2. In order to prove the general case, we proceed by induction on n. Let us assume that 12 is true for n, and let us prove that it is true for n + 1. Let us denote the products of n terms 1 + δi by Pn, i.e. let Pn be defined by

Pn = (1 + δ1)(1 + δ2) ··· (1 + δn). (23) Hence

Pn+1 = Pn(1 + δn+1). (24)

By hypothesis, Pn = 1 + θn, where |θn| ≤ γn. Hence,

Pn+1 = (1 + θn)(1 + δn+1) (25)

= 1 + δn+1 + θn(1 + δn+1). (26)

Let us define θn+1 by

θn+1 = δn+1 + θn(1 + δn+1). (27)

Then we have Pn+1 = 1 + θn+1 and we must proove that |θn+1| ≤ γn+1. We have

|θn+1| ≤ |δn+1| + |θn||1 + δn+1| (28)

≤ u + γn(1 + |δn+1|) (29) nu ≤ u + (1 + u). (30) 1 − nu It is easy to simplify this fraction, which leads to nu u(1 − nu) + nu(1 + u) u + (1 + u) = (31) 1 − nu 1 − nu u − nu2 + nu + nu2 = (32) 1 − nu u + nu = (33) 1 − nu (n + 1)u = . (34) 1 − nu

7 Therefore, (n + 1)u |θ | ≤ . (35) n+1 1 − nu By hypothesis, we have 0 < nu < (n + 1)u < 1, which implies 1 1 0 < < < 1. (36) 1 − nu 1 − (n + 1)u We plug the inequality 36 into 35, and get: (n + 1)u |θ | ≤ , (37) n+1 1 − (n + 1)u which concludes the proof.

As suggested by the proposition 2.4, γn plays an important role in the derivation of error bounds. Indeed, it is useful to have intermediate proposition on the properties of γn which simplifies these derivations.

Proposition 2.5. ( Sums of γn) Assume that n = 1, 2,... is such that nu < 1. Then,

γj + γk + γjγk ≤ γj+k, (38) for any j, k = 1, 2,..., such that ju < 1, ku < 1 and (j + k)u < 1.

Since γjγk > 0, the previous inequality proves the simpler inequality s

γj + γk ≤ γj+k. (39)

We emphasize that the previous inequality is relatively sharp, because both γj and γk are much smaller than 1 in practice. Hence, in practice, the product γjγk can be neglected compared to γj or γk. 1 u We notice that 1 < 1−u , since u < 1. Therefore, u < 1−u = γ1. The inequality 39 then implies

γj + u < γj + γ1 ≤ γj+1. (40) The previous inequality can be used in some situations. Proof. To derive this inequality, we brutally expand the expressions and try to express the result depending on j + k. From the definition1, we get: ju ku ju ku γ + γ + γ γ = + + (41) j k j k 1 − ju 1 − ku 1 − ju 1 − ku ju(1 − ku) + ku(1 − ju) + (ju)(ku) = (42) (1 − ju)(1 − ku) ju − jku2 + ku − jku2 + jku2 = (43) 1 − ju − ku + jku2 ju + ku − jku2 = (44) 1 − (j + k)u + jku2 (j + k)u − jku2 = . (45) 1 − (j + k)u + jku2

8 On one hand, we have −jku2 ≤ 0, which implies

(j + k)u − jku2 ≤ (j + k)u. (46)

One the other hand, we have jku2 ≥ 0, which implies

1 − (j + k)u + jku2 ≥ 1 − (j + k)u. (47)

The hypothesis (j + k)u < 1 implies that both sides of the previous equation are greater than zero. Hence, we can invert the inequality and get 1 1 ≤ . (48) 1 − (j + k)u + jku2 1 − (j + k)u We plug the inequalities 46 and 48 into 45, and get (j + k)u γ + γ + γ γ ≤ γ ≤ , (49) j k j k j+k 1 − (j + k)u which concludes the proof.

2.2 Properties of θj When two expressions are multiplied in an arithmetic operation, this leads to an error in the form of (1 + θj)(1 + θk). The proposition 2.5 can be used to express these products, as shown in the following proposition.

Proposition 2.6. ( The product of two (1 + θj)s) Assume that n = 1, 2,... is such that nu < 1. Assume that θj ∈ R is such that |θj| ≤ γj, for j = 1, 2,..., where γj is defined by1. Then,

(1 + θj)(1 + θk) ≤ 1 + θj+k, (50) for any j, k = 1, 2,..., such that ju < 1, ku < 1 and (j + k)u < 1. Proof. Indeed, we have

(1 + θj)(1 + θk) = 1 + θj + θk + θjθk. (51)

The previous equation can be written as (1 + θj)(1 + θk) = 1 + θj+k, where θj+k is defined by

θj+k = θj + θk + θjθk. (52)

In order to conclude the proof, we now have to prove that |θj+k| ≤ γj+k. This is easy since, by hypothesis,

|θj+k| ≤ |θj| + |θk| + |θj||θk| (53)

≤ γj| + γk + γjγk. (54) We then apply the proposition 2.5, which immediately concludes the proof. It may happen that an arithmetic operation leads to the sum of two expression, which may lead to an expression such as a(1 + θn) + b(1 + θn), for some real numbers a and b. The following proposition can be used in these cases.

9 Proposition 2.7. ( Linear combination of θjs (part 1)) Assume that j = 1, 2,... is such that 0 0 ju < 1. Assume that θj, θj ∈ R are such that |θj|, |θj| ≤ γj where γj is defined by1. Assume that a, b ∈ R. Therefore,

0 00 aθj + bθj = (|a| + |b|)θj , (55)

00 00 for some θj which satisfies |θj | ≤ γj. Proof. If |a| + |b| = 0, then the proposition is proved trivially, since both a and b are zero. Let us assume that |a| + |b|= 6 0 and let us define

0 aθj + bθ θ00 = j . (56) j |a| + |b| We have 0 00 aθj + bθj |θ | = (57) j |a| + |b| 0 |a||θj| + |b||θ | ≤ j (58) |a| + |b| |a|γ + |b|γ ≤ j j (59) |a| + |b|

≤ γj, (60) which concludes the proof. We can apply the proposition 2.7 in the particular case where both a and b are positive. In this case, we have

0 0 a(1 + θj) + b(1 + θj) = a + b + aθj + bθj (61) 00 = a + b + (a + b)θj , (62) since |a| = a and |b| = b. Therefore,

0 00 a(1 + θj) + b(1 + θj) = (a + b)(1 + θj ), (63) which leads to the following proposition.

Proposition 2.8. ( Combination of 1+θjs (part 2)) Assume that j = 1, 2,... is such that ju < 1. 0 0 Assume that θj, θj ∈ R is such that |θj|, |θj| ≤ γj, where γj is defined by1. Assume that a, b ≥ 0. Therefore,

0 00 a(1 + θj) + b(1 + θj) = (a + b)(1 + θj ), (64)

00 00 for some θj which satisfies |θj | ≤ γj. Proof. If |a| + |b| = 0, then the proposition is proved trivially, since both a and b are zero. Let us assume that |a| + |b|= 6 0 and let us define

0 aθj + bθ θ00 = j . (65) j |a| + |b|

10 We have 0 00 aθj + bθj |θ | = (66) j |a| + |b| 0 |a||θj| + |b||θ | ≤ j (67) |a| + |b| |a|γ + |b|γ ≤ j j (68) |a| + |b|

≤ γj, (69) which concludes the proof.

Proposition 2.9. ( Linear combination of θjs (part 3)) Assume that j, k = 1, 2,... is such that ju < 1 and ku < 1. Assume that θj, θk ∈ R are such that |θj| ≤ γj and |θk| ≤ γk where γj is defined by1. Assume that a, b ∈ R. Therefore,

aθj + bθk = (|a| + |b|)θj+k, (70) for some θj+k which satisfies |θj+k| ≤ γj+k. Proof. If |a| + |b| = 0, then the proposition is proved trivially, since both a and b are zero. Let us assume that |a| + |b|= 6 0 and let us define

aθ + bθ θ = j k . (71) j+k |a| + |b|

We have

aθj + bθk |θj+k| = (72) |a| + |b| |a||θ | + |b||θ | ≤ j k (73) |a| + |b| |a|γ + |b|γ ≤ j k (74) |a| + |b| |a|(γ + γ ) + |b|(γ + γ ) ≤ j k j k (75) |a| + |b| |a| + |b| ≤ (γ + γ ) (76) |a| + |b| j k

≤ γj + γk, (77)

≤ γj+k, (78) which concludes the proof.

In some derivations, we are going to use fractions of expressions of the form 1 + δi. In this case, we can use the following proposition.

11 Proposition 2.10. ( Fraction of (1 + δi)s) Assume that j and k are positive integers such that ju < 1 and ku < 1. Assume that δ1, δ2, . . . , δj ∈ R are such that |δ1|, |δ2|,..., |δj| ≤ u, and that 0 0 0 0 0 0 δ1, δ2, . . . , δk ∈ R are such that |δ1|, |δ2|,..., |δk| ≤ u. Therefore,

(1 + δ1)(1 + δ2) ··· (1 + δj) 0 0 0 = 1 + θj+k, (79) (1 + δ1)(1 + δ2) ··· (1 + δk) for some θj+k ∈ R which satisfies |θj+k| ≤ γj+k, where γj+k is defined by1.

Proof. Let us denote the products of n terms 1 + δi by Pn, i.e. let Pn be defined by

Pj = (1 + δ1)(1 + δ2) ··· (1 + δj). (80)

Similarily, let us introduce

0 0 0 0 Pk = (1 + δ1)(1 + δ2) ··· (1 + δk). (81) We have to prove that

Pj 0 = 1 + θj+k. (82) Pk It is straightforward to see that the proposition 2.4 proves the case where k = 0, i.e., the case where there are not terms in the denominator of 79. In order to prove the equation 82, we proceed by induction on k. Let us assume that the equation 82 is true for a given k, and let us prove that this is true for k + 1, under the hypothesis ju < 1 and (k + 1)u < 1. In other words, let us prove that

Pj 0 = 1 + θj+k+1, (83) Pk+1 for some θj+k+1 which satisfies |θj+k+1| ≤ γj+k+1. 0 0 0 We obviously have Pk+1 = Pk(1 + δk+1), which implies

Pj Pj 1 0 = 0 0 . (84) Pk+1 Pk 1 + δk+1 We recognize the fraction at the induction level k. Therefore, by the induction hypothesis, we have

Pj 1 + θj+k 0 = 0 . (85) Pk+1 1 + δk+1

0 1 By hypothesis, we have |δk+1| < u < 1. Hence, we expand the fraction 0 into its geometric 1+δk+1 series, and get

Pj 0 02 03 0 = (1 + θj+k)(1 − δk+1 + δk+1 − δk+1 + ...) (86) Pk+1 0 02 03 0 02 03 = 1 − δk+1 + δk+1 − δk+1 + ... + θj+k(1 − δk+1 + δk+1 − δk+1 + ...). (87)

= 1 + θj+k+1, (88)

12 where θj+k+1 satisfies the equation

0 02 03 0 02 03 θj+k+1 = −δk+1 + δk+1 − δk+1 + ... + θj+k(1 − δk+1 + δk+1 − δk+1 + ...). (89)

We are now going to bound θj+k+1. To do this, we use the induction hypothesis on θj+k, which states that |θj+k| ≤ γj+k, where γj+k is defined by1. On the other hand, we use the hypothesis 0 0 which gives a bound on δk+1, i.e. we use that |δk+1| ≤ u. We plug these two inequalities into 89 and get

(j + k)u |θ | ≤ u + u2 + u3 + ... + (1 + u + u2 + u3 + ...) (90) j+k+1 1 − (j + k)u (j + k)u ≤ u(1 + u + u2 + ...) + (1 + u + u2 + u3 + ...) (91) 1 − (j + k)u  (j + k)u  ≤ u + (1 + u + u2 + u3 + ...) (92) 1 − (j + k)u  (j + k)u  1 ≤ u + . (93) 1 − (j + k)u 1 − u

The equation 93 is valid because the hypothesis u < 1 implies that we can use the geometric 1 series for the fraction 1−u . We can simplify the left fraction and get u(1 − (j + k)u) + (j + k)u 1 |θ | ≤ (94) j+k+1 1 − (j + k)u 1 − u u − (j + k)u2 + (j + k)u ≤ (95) (1 − (j + k)u)(1 − u) (j + k + 1)u − (j + k)u2 ≤ (96) 1 − (j + k)u − u + (j + k)u2 (j + k + 1)u − (j + k)u2 ≤ . (97) 1 − (j + k + 1)u + (j + k)u2

2 We are almost done to prove the inequality |θj+k+1| ≤ γj+k+1, except the term (j + k)u , which appears both in the numerator and the denominator of the fraction. To conclude, all we have to do is to notice that

(j + k + 1)u − (j + k)u2 ≤ (j + k + 1)u, (98) since j and k are both positive integers. On the other hand, we have 1−(j +k +1)u+(j +k)u2 ≥ 1−(j+k+1)u. Finally, by assumption, we have (j+k+1)u < 1, which implies 1−(j+k+1)u > 0. Hence, we can invert and get 1 1 ≤ . (99) 1 − (j + k + 1)u + (j + k)u2 1 − (j + k + 1)u

We combine the inequalities 98 and 99 into 94, and conclude the proof.

In the following series of proofs, we are interested in fractions such as 1+θk . 1+θj

13 Proposition 2.11. ( Fraction of (1 + θi)s) Assume that j and k are positive integers such that ju < 1 and ku < 1. Moreover, we assume that (k + j)u < 1 and (k + 2j)u < 1. Assume that θj, θk ∈ R are such that |θk| ≤ γk and |θj| ≤ γj. Assume that γj < 1. If j ≥ k, then

1 + θk = 1 + θk+2j, (100) 1 + θj for some θk+2j ∈ R which satisfies |θk+2j| ≤ γk+2j, where γk+2j is defined by1. If j ≤ k, then

1 + θk = 1 + θk+j, (101) 1 + θj for some θk+j ∈ R which satisfies |θk+j| ≤ γk+j, where γk+j is defined by1. Proof. In the first part of the proof, we consider the general case where we do not make any assumption on j or k and leads to the inequality 100. Then we consider the special case where j = k, for which the calculation is very simple and immediatelly leads to 101. Finally, we consider the case where j ≤ k and give a general proof of 101. We have 1 + θ k = 1 + e, (102) 1 + θj where e is defined by 1 + θ e = k − 1, (103) 1 + θj and we search a simplified expression for a bound of e. We have

1 + θ − (1 + θ ) e = k j , (104) 1 + θj θ − θ = k j . (105) 1 + θj

On one hand, the hypothesis |θj| ≤ γj implies θj ≥ −γj. Hence,

1 + θj ≥ 1 − γj. (106)

The hypothesis γj < 1 implies that 1 − γj > 0. Hence both sides of the previous inequality are strictly positive and we invert this inequality. We get 1 1 ≤ . (107) 1 + θj 1 − γj On the other hand, we obviously have

|θk − θj| ≤ |θk| + |θj| (108)

≤ γk + γj. (109)

14 We plug 107 and 109 into 105 and find

|θ − θ | |e| ≤ k j (110) 1 − γj γ + γ ≤ k j . (111) 1 − γj We are now going to search for simplified expressions for the numerator and the denominatory of the previous equation. We have ku ju γ + γ = + (112) k j 1 − ku 1 − ju ku(1 − ju) + ju(1 − ku) = (113) (1 − ku)(1 − ju) ku − jku2 + ju − jku2 = (114) (1 − ku)(1 − ju) (k + j)u − 2jku2 = . (115) (1 − ku)(1 − ju)

On the other hand, the expression 1 − γj can be written as ju 1 − γ = 1 − (116) j 1 − ju (1 − ju) − ju = (117) 1 − ju 1 − 2ju = . (118) 1 − ju We plug the equations 115 and 118 into 111 and get

(k + j)u − 2jku2 1 − ju |e| ≤ (119) (1 − ku)(1 − ju) 1 − 2ju (k + j)u − 2jku2 ≤ , (120) (1 − ku)(1 − 2ju) since the expression 1 − ju can be simplified. We now search for an expanded form of the product (1 − ku)(1 − 2ju). We have

(k + j)u − 2jku2 |e| ≤ (121) 1 − ku − 2ju + 2jku2 (k + j)u − 2jku2 ≤ . (122) 1 − (k + 2j)u + 2jku2

It is now easy to bound e, since 2jku2 ≥ 0. Hence,

(k + j)u |e| ≤ . (123) 1 − (k + 2j)u

15 We emphasize that the previous step is highly accurate, since u is the unit roundoff. Indeed, this implies that u2 is extremely small compared to u in the inequality 122. For any integers j, k ≥ 0, we have k + j ≤ k + 2j, which, combined to the hypothesis u > 0, implies (k + 2j)u |e| ≤ . (124) 1 − (k + 2j)u

The right hand side of the previous equation is γk+2j, which proves the equality 100 in the general case. In the second part of this proof, we consider the special case where j = k. We are going to check that this immediatelly leads to 101. In the case where j = k, the inequality 120 can be written 2ju − 2j2u2 |e| ≤ (125) (1 − ju)(1 − 2ju) 2ju(1 − ju) ≤ (126) (1 − ju)(1 − 2ju) 2ju ≤ (127) (1 − 2ju)

≤ γ2j. (128) Finally, we assume that j ≤ k and give a general proof of 101. We have to prove that (k + j)u − 2jku2 (k + j)u ≤ . (129) 1 − (k + 2j)u + 2jku2 1 − (k + j)u To do this, we prove that (k + j)u (k + j)u − 2jku2 − ≥ 0. (130) 1 − (k + j)u 1 − (k + 2j)u + 2jku2 In other words, we have to prove that (k + j)u (1 − (k + 2j)u + 2jku2) − ((k + j)u − 2jku2) (1 − (k + j)u) ≥ 0. (131) (1 − (k + j)u) ((k + j)u − 2jku2) By hypothesis, we have (k + j)u < 1 and (k + 2j)u < 1 so that the denominator is positive. The numerator is equal to (k + j)u 1 − (k + 2j)u + 2jku2 − (k + j)u − 2jku2 (1 − (k + j)u) (132) = (k + j)u − (k + j)(k + 2j)u2 + 2(k + j)jku3 (133) −(k + j)u + (k + j)2u2 + 2jku2 − 2(k + j)jku3 (134) = −(k + j)(k + 2j)u2 + (k + j)2u2 + 2jku2 (135) since the terms (k + j)u and 2(k + j)jku3 are eliminated. We expand the previous equality and get (−k2 − 2jk − jk − 2j2 + k2 + 2jk + j2 + 2jk)u2 (136) = (jk − j2)u2 (137) = j(k − j)u2. (138) By hypothesis, j ≥ 0 so that the numerator 131 is positive if and only if k−j ≥ 0, which concludes the proof. The figure3 summarizes the results that we have proved in this section.

16 Hypothesis Property 1 j ≤ k γj ≤ γk 2 jku < 1 jγk ≤ γjk 3 (1 + δ1)(1 + δ2) ··· (1 + δn) = 1 + θn 4 γj + γk + γjγk ≤ γj+k 5 γj + γk ≤ γj+k 6 γj + u ≤ γj+1 7 (j + k)u < 1 (1 + θj)(1 + θk) = 1 + θj+k 0 00 8 a, b ∈ R aθj + bθj = (|a| + |b|)θj 0 00 9 a, b ≥ 0 a(1 + θj) + b(1 + θj) = (a + b)(1 + θj ) 10 a, b ∈ R aθj + bθk = (|a| + |b|)θj+k (1+δ1)(1+δ2)···(1+δj ) 11 0 0 0 = 1 + θj+k (1+δ1)(1+δ2)···(1+δk) 1+θk 12 j > k = 1 + θk+2j 1+θj 1+θk 13 j ≤ k = 1 + θk+j 1+θj

Figure 3: Some useful properties for error bounds. In this table we assume that u < 1. We assume that j, k, n are three integers such that j, k, n ≥ 1 and ju, ku, nu < 1. We denote by δj a real numbers such that |δj| ≤ u. We denote by θj a real number such that |θj| ≤ γj.

Operation Error bound Addition, Subtraction √u √ Multiplication 2γ2 ≈ 2 2u ≈ 2.84u Division γ6 ≈ 6u

Figure 4: Error bounds bounds for complex arithmetic.

2.3 Exercises Exercise 2.1 (Products of 1 + δs) The goal of this exercize is to experiment the proposition 2.4 numerically in Scilab. The grand(m,n,"unf",a,b) calling sequence produces a m-by-n matrix of doubles which are randomly sampled uniformly in the interval [a, b]. In order to experiment the proposition 2.4, we can pick random numbers δi in the interval [−u, u], and compute the ratio (1 + δ1)(1 + δ2) ... (1 + δn)/u. Experiment this in Scilab with n = 5.

3 Error bounds for complex arithmetic

In this section, we prove error bounds for complex arithmetic, that is, we compute the relative error for the operations x + y, x ∗ y and x/y, when x and y are complex floating point numbers. These results are presented, for example, by Higham [5] for the complex addition and multiplication. For the complex division, however, the error bound that we present here is new, to our knowledge. The figure4 summarizes the results that we have proved in this section.

17 3.1 Standard arithmetic model In order to derive our error analysis, we must make some assumptions on the basic real arithmetic. Indeed, complex arithmetic is derived from the real arithmetic. We assume that the real arithmetic satisfies the following definition.

Definition 3.1. ( Standard arithmetic model) We denote by u ∈ R the unit roundoff. We denote by F the set of floating point numbers in the current floating point system. The floating point system is associated with a standard arithmetic model if, for any x, y ∈ F , we have

fl(x op y) = (x op y)(1 + δ), (139) where δ ∈ R is such that |δ| ≤ u, for op = +, −, ∗,/. The IEEE 754 floating point arithmetic satisfies the definition 3.1. By contrast, a floating point arithmetic which does not make use of a guard digit may not satisfy this condition. We are going to review the error bounds for complex arithmetic operations one after the other. This might seem tedious at first. On the other hand, it let us feel the error analysis more progressively and makes us introduce the required tools more smoothly. We make the assumption that x, y ∈ C are defined by x = a + ib, y = c + id, (140) where a, b, c, d ∈ R. This might be trivial, but notice that, in the equation 140, the sum of the real and imaginary part is not implemented as a floating point sum: these just are the real and imaginary parts, so that there is no floating point error associated with this. Hence, we obviously have fl(x) = fl(a) + ifl(b), for any a, b ∈ R.

3.2 Error bound for complex addition The first result that we can derive easily is the error bound for the addition.

Proposition 3.2. ( Error bound for complex addition) We assume that the real arithmetic is standard, in the sense of the definition 3.1. Let us consider the sum of two complex numbers x, y, computed by

x + y = (a + c) + i(b + d). (141)

Therefore, there exists some δ ∈ C such that fl(x + y) = (x + y)(1 + δ), (142) with

|δ| ≤ u. (143)

The proof that we present is given by Higham in [5].

18 Proof. By the definition of the complex sum 141, we have: fl(x + y) = fl(a + c) + ifl(b + d). (144) Then we apply the definition of the standard arithmetic model 139 to the addition and get

fl(x + y) = (a + c)(1 + δ1) + i(b + d)(1 + δ2), (145) for some δ1, δ2 ∈ R such that |δ1|, |δ2| ≤ u. This implies

fl(x + y) = a + c + (a + c)δ1 + i(b + d) + i(b + d)δ2 (146)

= x + y + (a + c)δ1 + i(b + d)δ2 (147) = x + y + e, (148) where the error e ∈ C satisfies the equation

e = (a + c)δ1 + i(b + d)δ2. (149) We now compute the module of the error e and get

2 2 2 2 2 |e| = (a + c) δ1 + (b + d) δ2 (150) ≤ (a + c)2u2 + (b + d)2u2, (151) from the bounds on δ1 and δ2. Hence, |e|2 ≤ (a + c)2 + (b + d)2 u2 (152) ≤ |x + y|2u2. (153) This implies |e| ≤ |x + y|u. (154) At this point, we separate two cases, whether x + y is a complex zero or nonzero. First, let us assume that x + y = 0, therefore, the equation 154 implies that e = 0. Hence, the equation 148 implies that fl(x + y) = x + y. Therefore, the equation 142 is satisfied with δ = 0. Second, let us assume that x + y 6= 0. Let us define δ ∈ C by the equation e δ = . (155) x + y The inequality 154 implies |e| |δ| = (156) |x + y| |x + y|u ≤ (157) |x + y| ≤ u. (158) Then the equation 148 implies that fl(x + y) = x + y + (x + y)δ, (159) = (x + y)(1 + δ), (160) which concludes the proof.

19 3.3 Error bound for complex multiplication Before considering the complex multiplication, we consider an easy intermediate property which will be useful in our next error bound.

Proposition 3.3. Assume that a, b, c, d ∈ R. Therefore, (|ac| + |bd|)2 + (|ad| + |bc|)2 ≤ 2(a2 + b2)(c2 + d2). (161)

Proof. We have

(|ac| + |bd|)2 + (|ad| + |bc|)2 = a2c2 + 2|abcd| + b2d2 + a2d2 + 2|abcd| + b2c2 (162) = a2(c2 + d2) + b2(c2 + d2) + 4|abcd| (163) = (a2 + b2)(c2 + d2) + 4|abcd|. (164)

All we have to do is to prove that

4|abcd| ≤ (a2 + b2)(c2 + d2). (165)

Indeed, if we plug the previous inequality into 164, this immediately implies 161. The remaining work is to prove the inequality 165. This is easy, since

0 ≤ (|ac| − |bd|)2 + (|ad| − |bd|)2. (166)

The previous inequality implies

0 ≤ a2c2 − 2|abcd| + b2d2 + a2d2 − 2|abcd| + b2c2 (167)

Therefore,

0 ≤ (a2 + b2)(c2 + d2) − 4|abcd|, (168) which proves that 165 is true and magically concludes this proof. We now consider the case of the complex multiplication.

Proposition 3.4. ( Error bound for complex multiplication) We assume that the real arithmetic is standard, in the sense of the definition 3.1. Let us consider the product of two complex numbers x, y, computed by

x ∗ y = (ac − bd) + i(ad + bc). (169)

Therefore, there exists some δ ∈ C such that fl(x ∗ y) = (xy)(1 + δ), (170) with √ |δ| ≤ 2γ2, (171) where γ2 is defined by1.

20 The proof that we present is given by Higham in [5]. Proof. We consider the equation 169 and apply the standard model 3.1 for the subtraction, then for the product. We get:

fl(x ∗ y) = fl(ac − bd) + ifl(ad + bc) (172)

= (fl(ac) − fl(bd))(1 + δ1) + i(fl(ad) + fl(bc))(1 + δ2) (173)

= (ac(1 + δ3) − bd(1 + δ4))(1 + δ1) (174)

+i(ad(1 + δ5) + bc(1 + δ6))(1 + δ2), (175) (176) for some δi ∈ R such that |δi| ≤ u, for i = 1, 2,..., 6. The equation 176 implies

fl(x ∗ y) = (ac(1 + δ3)(1 + δ1) − bd(1 + δ4)(1 + δ1)) (177)

+i(ad(1 + δ5)(1 + δ2) + bc(1 + δ6)(1 + δ2)). (178)

The previous equation let us discover products of terms (1 + δi). This is why we apply the 0 00 000 proposition 2.4, which implies that there are four θ2, θ2, θ2 , θ2 ∈ R such that

0 00 000 fl(x ∗ y) = (ac(1 + θ2) − bd(1 + θ2)) + i(ad(1 + θ2 ) + bc(1 + θ2 )), (179) where

0 00 000 |θ2|, |θ2|, |θ2 |, |θ2 | ≤ γ2. (180)

The equation 179 implies

fl(x ∗ y) = xy + e, (181) where e is defined by the equation

0 00 000 e = acθ2 − bdθ2 + i(adθ2 + bcθ2 ). (182)

The previous equation implies

2 0 2 00 000 2 e = (acθ2 − bdθ2) + (adθ2 + bcθ2 ) . (183)

We now bound each of the terms in the right hand side of 183. We have

0 0 |acθ2 − bdθ2| ≤ |ac||θ2| + |bd||θ2|, (184) 00 000 00 000 |adθ2 + bcθ2 | ≤ |ad||θ2 | + |bc||θ2 |. (185)

We can bound the θ2 terms by the inequality 180, which implies

0 |acθ2 − bdθ2| ≤ |ac|γ2 + |bd|γ2, (186)

≤ (|ac| + |bd|)γ2, (187) 00 000 |adθ2 + bcθ2 | ≤ |ad|γ2 + |bc|γ2 (188)

≤ (|ad| + |bc|)γ2 (189)

21 We plug the inequalities 187 and 189 into 183 and get

2 2 2 2 2 e ≤ (|ac| + |bd|) γ2 + (|ad| + |bc|) γ2 (190) 2 2 2 ≤ (|ac| + |bd|) + (|ad| + |bc|) γ2 . (191) We now use the proposition 3.3, which implies

2 2 2 2 2 2 e ≤ 2(a + b )(c + d )γ2 . (192) We recognize in a2 + b2 the square of the modulus of x and in c2 + d2 the square of the modulus of y. Hence,

2 2 2 2 e ≤ 2|x| |y| γ2 . (193) Therefore, √ |e| ≤ 2|x||y|γ2. (194) We now consider two cases, depending if xy is zero or nonzero. If xy = 0, then the inequality 194 implies that e = 0. Hence, the equation 181 shows that fl(x ∗ y) = xy exactly. Therefore, the equation 170 is satisfied with δ = 0, which concludes the first part of the proof. On the other hand, if xy 6= 0, we introduce e δ = . (195) xy √ The inequality 194 implies that |δ| ≤ 2γ2. The equation 170, combined with 195, implies that fl(x ∗ y) = xy + xyδ = xy(1 + δ), which concludes the proof. √ In order to get a rough idea of what means this error bound, we can see that 2γ2 < 2.84u. -->sqrt(2)*gamman(2)/u ans = 2.8284271

3.4 Other error bounds for complex multiplication In [1], Brent, Percival and Zimmerman prove an accurate error bound for the complex multipli- 1 −5 cation. They consider u = 2 , the unit roundoff. Therefore, if u < 2 , then, provided that no overflow or underflow occurs, then

fl(x ∗ y) = (xy)(1 + δ), (196) with √ |δ| ≤ 5u. (197) √ This error bound is very close to 5u ≈ 2.24u. √ The authors emphasize [1] that√ their error bound 5u is tighter than the error bound given by 171, which is very close to 2 2u ≈ 2.84u. Their proof is relatively more complicated than Higham’s and will not be presented here.

22 3.5 Error bound for complex division We can finally consider the objective of this section, the error bound for the naive algorithm of the complex division. To our knowledge, the proof associated with this error bound is new.

Proposition 3.5. ( Error bound for complex division) We assume that the real arithmetic is standard, in the sense of the definition 3.1. Let us consider the division of two complex numbers x, y, computed by ac + bd bc − ad x/y = + i . (198) c2 + d2 c2 + d2

Therefore, there exists some θ6 ∈ C such that

fl(x/y) = (x/y)(1 + θ6), (199) with

|θ6| ≤ γ6, (200) where γ6 is defined by1. Proof. We first compute an error bound for the denominator c2 +d2. By definition of the standard model 3.1, we have

2 2 2 2 fl(c + d ) = (fl(c ) + fl(d ))(1 + δ1) (201) 2 2 = (c (1 + δ2) + d (1 + δ3))(1 + δ1) (202) 2 2 = c (1 + δ2)(1 + δ1) + d (1 + δ3)(1 + δ1), (203) for some δ1, δ2, δ3 ∈ R such that |δ1|, |δ2|, |δ3| ≤ u. The proposition 2.4 then implies that there are 0 two θ2, θ2 ∈ R such that

2 2 2 2 0 fl(c + d ) = c (1 + θ2) + d (1 + θ2), (204)

0 2 2 where |θ2|, |θ2| ≤ γ2. Since c and d are both positive, we can apply the proposition 2.8, which 00 implies that there exists θ2 ∈ R such that

2 2 2 2 00 fl(c + d ) = (c + d )(1 + θ2 ), (205)

00 with |θ2 | ≤ γ2. We then apply the error bound for the complex product, that is, we apply the proposition 3.4. By definition, we have

(a + ib)(c − id) fl(x/y) = fl . (206) c2 + d2

Hence,

fl((a + ib)(c − id)) fl(x/y) = (1 + δ ), (207) fl(c2 + d2) 1

23 where |δ1| ≤ u, because we divide separately the real and imaginary parts by the real number c2 + d2. This implies

fl((a + ib)(c − id)) fl(x/y) = 2 2 00 (1 + δ1), (208) (c + d )(1 + θ2 )

00 where |θ2 | ≤ γ2, because of the equality 205. We now apply the proposition 3.4 to the complex multiplication (a + ib)(c − id) and get

(a + ib)(c − id)(1 + δ) fl(x/y) = 2 2 00 (1 + δ1), (209) (c + d )(1 + θ2 ) (210) √ where |δ| ≤ 2γ2. Therefore,

(1 + δ)(1 + δ1) fl(x/y) = (x/y) 00 . (211) 1 + θ2 We now use the proposition 2.3 in order to simplify the error terms. We have √ √ |δ| ≤ 2γ2 ≤ γ2 2 = γ2.82... < γ3. (212)

Let us introduce θ3 = δ. We obviously have |θ3| = |δ| ≤ γ3. Let us introduce θ1 = δ1. We obviously have |θ1| = |δ1| ≤ u < γ1. Hence, we can write

(1 + θ3)(1 + θ1) fl(x/y) = (x/y) 00 , (213) 1 + θ2

00 where |θ3| ≤ γ3, |θ1| ≤ γ1 and |θ2 | ≤ θ2. We first apply the proposition 2.6, which bounds a product of two (1 + θj)s. Hence,

1 + θ4 fl(x/y) = (x/y) 00 , (214) 1 + θ2 where |θ4| ≤ γ4. We now apply the proposition 2.11, which bounds a ratio of (1 + θi)s in the case where j ≤ k. Hence,

fl(x/y) = (x/y)(1 + θ6), (215) where |θ6| ≤ γ6, and concludes the proof. We could easily integrate the results for the error bound of the complex multiplication [1] into the proof√ of the proposition 3.5. Their result would lead to the equation 210 where δ is so that |δ| ≤ 5u. However, the inequalities 212 are not√ tight enough to take advantage of this accurate bound, because our method is based on |δ| ≤ 5u < γ2.23... < γ3. The same method would then lead to the same result. All in all, it appears that the inequalities that we derive in 212 are much less tight than possible. A possible method may be to extend the properties of γn to non-integer values of n.

24 3.6 Other error bounds for the naive complex division algorithm

What makes the proposition 3.5 interesting is that it is√ simply based on the error bound for the complex multiplication. If we integrate√ the error bound 5u is integrated in the proof, therefore the more accurate (p(5) + 1 + 2)u = ( 5 + 3)u ≈ 5.24u can be found. In [7], Muller et al. present in the section 4.5.3 ”Complex division”, an error bound for the naive algorithm of the complex division. They assume that no underflow or overflow occur and that the default IEEE round-to-nearest rounding mode is used. They assume that the unit roundoff u satisfies u < 1/8. Then, √ |fl(x/y) − x/y| ≤ 5 2(1 + 6u)u|x/y|. (216) √ The proof is given in [7]. This bound is very close to 5 2u ≈ 7.07u. This bound is higher than the bound γ6 ≈ 6u presented in the proposition 3.5. In [5], Higham provides an error bound for the naive complex division algorithm. Assume that u is a positive real number, and that δ is a complex number such that |δ| ≤ u. Then, for n = 1, 2,... such that nu < 1, let us define γn by nu γ = . (217) n 1 − nu In the section 3.6 ”Complex Arithmetic” of [5], Higham states that

fl(x/y) = (x/y)(1 + δ), (218) √ √ where |δ| ≤ 2γ4 ≈ 4 2u ≈ 5.65u. Unfortunately, the proof has a small technical problem which makes the result wrong. A revised√ proof√ based on the same technique is presented in the exercize 3.2, where the error bound is 2γ5 ≈ 5 2u ≈ 7.07u. This is the same bound achieved by Muller et al in [7]. Higham emphasizes that δ is a complex number, so that we cannot conclude from the error bound that the real and imaginary parts are obtained to high relative accuracy – only that they are obtained to high accuracy relative to |fl(x/y)|.

3.7 Error bounds for Smith’s complex division algorithm An analysis of Hough, cited by Coonen [2] and Stewart [8] shows that when Smith’s algorithm works, it returns a computed value z satisfying

|fl(x/y) − x/y| ≤ η|z|, (219) where z is the exact complex division result and η is of the same order of magnitude as the unit roundoff u for the arithmetic in question. In [3], Demmel analyzes Smith’s algorithm. Under the hypothesis that graceful underflow is available, Demmel states that, if both |x| and |y| are greater than α, then

fl(x/y) = (x/y)(1 + er) + ea, (220) where the relative error er is √ |er| ≤ 7 2, (221)

25 with  the machine epsilon, and the absolute error is √ ea = 2α. (222)

In the section 3.6 ”Complex Arithmetic” of [5], Higham states that the error in Smith’s algo- rithm is

fl(x/y) = (x/y)(1 + δ), (223) √ √ where |δ| ≤ 2γ7. This error bound is very close to 7 2u ≈ 9.90u.

3.8 Exercizes

Exercise 3.1 (Norm of x/y) Assume that a, b, c, d ∈ R and consider x = a + ib and y = c + id. Prove that |x|2 |x/y|2 = . (224) |y|2

Exercise 3.2 (Another error bound for x/y) The goal of this exercize is to prove another error bound for the complex division. We assume that the real arithmetic is standard, in the sense of the definition 3.1. Let us consider the division of two complex numbers x, y, computed by ac + bd bc − ad x/y = + i . (225) c2 + d2 c2 + d2

Therefore, there exists some δ ∈ C such that fl(x/y) = (x/y)(1 + δ), (226) with √ |δ| ≤ 2γ5, (227) where γ5 is defined by1. To perform this exercize, prove the following intermediate results. 1. Prove that

fl(Re(x/y)) = Re(x/y) + e1, (228)

where the error e1 is defined by acθ + bdθ0 e = 5 5 , (229) 1 c2 + d2

0 for some θ5, θ5 ∈ R. 2. Then prove that |ac| + |bd| |e | ≤ γ . (230) 1 c2 + d2 5

26 3. Similarily, compute the error e2 for the imaginary part Im(x/y). 4. Notice that 2 2 2 |fl(x/y − x/y| = |e1| + |e2| and use the proposition 3.3 to prove that

|x|2 |fl(x/y − x/y|2 ≤ 2γ2. (231) |y|2 5

5. Use the exercize 3.1 and conclude.

4 Notes and references

In [4], Dunham shows that the use of double precision in single precision complex multiplication and division can significantly reduce rounding errors in these operations. Use of double precision accumulation of inner products in complex multiplication gives very close to the best possible error bound.

5 Acknowledgements

We express our gratitude to Nicolas Higham for his advices.

27 6 Answers 6.1 Answers for section 2.3 Answer of Exercise 2.1 (Products of 1 + δs) The following script experiments the proposition 2.4 numerically in Scilab. We use n = 5. stacksize("max") u= %eps /2; for N = floor(logspace(1,6,10)) d = grand(N,5,’unf’,-1,1).*u; d = abs(prod(1+d,"c")-1)/u; mprintf("%d %f\n",N,max(d)) end The previous script produces the following output. 10 4.000000 35 3.000000 129 4.000000 464 4.000000 1668 4.000000 5994 5.000000 21544 5.000000 77426 5.000000 278255 5.000000 1000000 5.000000

6.2 Answers for section3 Answer of Exercise 3.1 (Norm of x/y) By definition, we have

ac + bd2 bc − ad2 |x/y|2 = + (232) c2 + d2 c2 + d2 a2c2 + 2abcd + b2d2 + b2c2 − 2abcd + a2d2 = (233) (c2 + d2)2 a2c2 + b2d2 + b2c2 + a2d2 = (234) (c2 + d2)2 a2(c2 + d2) + b2(c2 + d2) = (235) (c2 + d2)2 (a2 + b2)(c2 + d2) = (236) (c2 + d2)2 a2 + b2 = (237) c2 + d2 |x|2 = , (238) |y|2 which concludes the proof. Answer of Exercise 3.2 (Another error bound for x/y) We now use the approach used by Higham in [5] to

28 derive an error bound for the complex division. We have ac + bd fl(Re(x/y)) = fl (239) c2 + d2 fl(ac + bd) = (1 + δ ) (240) fl(c2 + d2) 1

(fl(ac) + fl(bd))(1 + δ2) = 2 2 00 (1 + δ1) (241) (c + d )(1 + θ2 ) (ac(1 + δ3) + bd(1 + δ4))(1 + δ2) = 2 2 00 (1 + δ1), (242) (c + d )(1 + θ2 ) ac(1 + δ3)(1 + δ2)(1 + δ1) + bd(1 + δ4)(1 + δ2)(1 + δ1) = 2 2 00 , (243) (c + d )(1 + θ2 ) where δ1, δ2, δ3, δ4 ∈ R are such that |δ1|, |δ2|, |δ3|, |δ4| ≤ u. The proposition 2.4 then implies that there are two 000 0000 θ3 , θ3 ∈ R such that 000 0000 ac(1 + θ3 ) + bd(1 + θ3 ) fl(Re(x/y)) = 2 2 00 , (244) (c + d )(1 + θ2 ) 000 0000 000 00 0000 00 0 where |θ3 |, |θ3 | ≤ γ3. The proposition 2.11 implies that (1+θ3 )/(1+θ2 ) = 1+θ5 and (1+θ3 )/(1+θ2 ) = 1+θ5 0 for some θ5, θ5 ∈ R. Therefore ac(1 + θ ) + bd(1 + θ0 ) fl(Re(x/y)) = 5 5 (245) c2 + d2 ac + bd acθ + bdθ0 = + 5 5 (246) c2 + d2 c2 + d2 = Re(x/y) + e1, (247) where the error e1 is defined by acθ + bdθ0 e = 5 5 . (248) 1 c2 + d2

We take the absolute value of e1 and get |ac|γ + |bd|γ |e | ≤ 5 5 (249) 1 c2 + d2 |ac| + |bd| ≤ γ (250) c2 + d2 5 Similarily, the imaginary part Im(x/y) is so that

fl(Im(x/y)) = Im(x/y) + e2, (251) where the error e2 is such that |bc| + |ad| |e | ≤ γ . (252) 2 c2 + d2 5 Therefore, the error for the floating point representation of x/y is

|fl(x/y) − x/y|2 = |fl(Re(x/y)) + ifl(Im(x/y)) − x/y|2 (253) 2 = |Re(x/y) + e1 + i(Im(x/y) + e2) − x/y| (254) 2 = |e1 + ie2| (255) 2 2 = |e1| + |e2| (256) (|ac| + |bd|)2 (|bc| + |ad|)2 ≤ γ2 + γ2. (257) (c2 + d2)2 5 (c2 + d2)2 5 (|ac| + |bd|)2 + (|bc| + |ad|)2 ≤ γ2. (258) (c2 + d2)2 5

29 The proposition 3.3 implies

2(a2 + b2)(c2 + d2) |fl(x/y) − x/y|2 ≤ γ2 (259) (c2 + d2)2 5 (a2 + b2) ≤ 2γ2 (260) c2 + d2 5 |x|2 ≤ 2γ2. (261) |y|2 5

Finally, the proposition 224 implies

2 2 2 |fl(x/y) − x/y| ≤ |x/y| 2γ5 . (262) which implies √ |fl(x/y) − x/y| ≤ |x/y| 2γ5. (263)

References

[1] Richard Brent, Colin Percival, and Paul Zimmermann. Error Bounds on Complex Floating- Point Multiplication. Mathematics of Computation, 76:1469–1481, 2007.

[2] J.T. Coonen. Underflow and the denormalized numbers. Computer, 14(3):75–87, March 1981.

[3] J. W. Demmel. Underflow and the reliability of numerical software. SIAM Journal on Scientific and Statistical Computing, 5:887–919, 1984.

[4] C. B. Dunham. Improvement of complex arithmetic by use of double elements. SIGNUM Newsl., 24:3–7, November 1989.

[5] Nicholas J. Higham. Accuracy and Stability of Numerical Algorithms. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, second edition, 2002.

[6] IEEE Task P754. IEEE 754-2008, Standard for Floating-Point Arithmetic. August 2008.

[7] Jean-Michel Muller, Nicolas Brisebarre, Florent de Dinechin, Claude-Pierre Jeannerod, Vin- cent Lef`evre, Guillaume Melquiond, Nathalie Revol, Damien Stehl´e,and Serge Torres. Hand- book of Floating-Point Arithmetic. Birkh¨auser Boston, 2010. ACM G.1.0; G.1.2; G.4; B.2.0; B.2.4; F.2.1., ISBN 978-0-8176-4704-9.

[8] G. W. Stewart. A note on complex division. ACM Transactions on Mathematical Software, 11(3):238–241, 1985.

30