
Error bounds of complex arithmetic Michael Baudin June 2011 Abstract In this document, we present error bounds for the complex arithmetic. We present the error bounds which are presented in the biliography. We give the proof of some error bounds for the complex addition, multiplication and division. A new proof for the error bound associated with the complex division is presented. This proof leads to an error bound which is smaller than other published bounds. Contents 1 Introduction 3 2 Useful results for error bounds3 2.1 Properties of γn .....................................3 2.2 Properties of θj .....................................9 2.3 Exercises......................................... 17 3 Error bounds for complex arithmetic 17 3.1 Standard arithmetic model............................... 18 3.2 Error bound for complex addition........................... 18 3.3 Error bound for complex multiplication........................ 20 3.4 Other error bounds for complex multiplication.................... 22 3.5 Error bound for complex division........................... 23 3.6 Other error bounds for the naive complex division algorithm............ 25 3.7 Error bounds for Smith's complex division algorithm................ 25 3.8 Exercizes......................................... 26 4 Notes and references 27 5 Acknowledgements 27 6 Answers 28 6.1 Answers for section 2.3................................. 28 6.2 Answers for section3.................................. 28 1 Copyright c 2011 - Michael Baudin This file must be used under the terms of the Creative Commons Attribution-ShareAlike 3.0 Unported License: http://creativecommons.org/licenses/by-sa/3.0 2 Radix β 2 Precision p 53 Exponent Bits 11 Minimum Exponent emin -1022 Maximum Exponent emax 1023 Largest Positive Normal Ω (2 − 2−52) · 21023 ≈ 1:79 × 10308 Smallest Positive Normal µ 2−1022 ≈ 2:22 × 10−308 Smallest Positive Subnormal α 2−1074 ≈ 4:94 × 10324 Epsilon 2−52 ≈ 2:220 × 10−16 Unit roundoff u 2−53 ≈ 1:110 × 10−16 Figure 1: Scilab IEEE 754 doubles 1 Introduction We present in this section the parameters of the floating point numbers which are used in Scilab, that is, we present binary 64 doubles. In Scilab, we use double precision floating point numbers, which corresponds to the binary 64 floating point numbers of the IEEE-2008 [6] standard. In this floating point system, the significand makes use of p = 53 binary digits of precision and 11 binary digits in the exponent. The parameter of the IEEE 754 double precision floating point system are presented in the figure1. The benefits of the IEEE 754 2008 standard [6] is that it increases the portability of programs which are making use of floating point numbers, and especially those written in the Scilab lan- guage. Indeed, on all machines where Scilab is available, the radix, the precision and the number of bits in the exponent are the same. As a consequence, the machine epsilon is always equal to the same value for doubles. This is the same for the largest positive normal Ω and the smallest positive normal µ, which are always equal to the values presented in the figure1. The situation for the smallest positive subnormal α is more complex, but, in this paper, we assume that gradual underflow is available. Doubles have intrinsic limitations which can lead to rounding, overflow or underflow. Indeed, not all mathematical numbers x can be represented as floating point numbers. We denoted by fl(x) the floating point representation of x. In this paper, we are mostly concerned with overflow and underflow. Indeed, if x is a number such that jxj < α, then the floating point representation of x is fl(x) = 0, i.e. x underflows. On the other hand, if x is a number such that jxj > Ω, then the floating point representation of x is fl(x) = Inf, i.e. x overflows. 2 Useful results for error bounds In this section, we derive propositions wich are useful in the context of error analysis. These results are presented by Higham [5]. 2.1 Properties of γn In this section, we assume that u, the unit roundoff, is a positive real number such that 0 < u < 1. 1 The unit roundoff is half the machine epsilon, i.e. u = 2 . 3 γn γn n nu n nu 1 1.00000000000000020 101 1.00000000000000110 2 1.00000000000000020 102 1.00000000000001110 3 1.00000000000000020 103 1.00000000000011100 4 1.00000000000000040 104 1.00000000000111020 5 1.00000000000000040 105 1.00000000001110220 6 1.00000000000000070 106 1.00000000011102230 7 1.00000000000000070 107 1.00000000111022300 8 1.00000000000000090 108 1.00000001110223050 9 1.00000000000000090 109 1.00000011102231490 10 1.00000000000000110 1010 1.00000111022425720 11 1.00000000000000130 1011 1.00001110235350700 12 1.00000000000000130 1012 1.00011103462978280 13 1.00000000000000160 1013 1.00111145698976610 14 1.00000000000000160 1014 1.01122687358170120 15 1.00000000000000160 1015 1.12488761278269790 γn Figure 2: Approximations of nu to 17 decimal digits. We consider the unit roundoff u which corresponds to binary 64 (double precision) floating point numbers. Definition 2.1. ( Gamma-n) For n = 1; 2;::: such that nu < 1, let us define γn by nu γ = : (1) n 1 − nu γn The table2 presents a table of approximations of the ratio nu , for increasing values of n. For 15 n ≤ 10 , we have γn ≤ 1:13nu. The following gamman function returns γn which corresponds to the given n. function y = gamman(n) u= %eps /2 y = (n.*u)./(1-n.*u) endfunction In the following session, we compute the unit roundoff u and compute the ratio γn=u for increasing values of n. -->u=%eps/2 u = 1.110D -16 -->[n gamman(n)/u] ans = 1. 1. 16. 16. 256. 256. 4096. 4096. 65536. 65536. 1048576. 1048576. 16777216. 16777216. 4 2.684D+08 2.684D+08 4.295D+09 4.295D+09 6.872D+10 6.872D+10 1.100D+12 1.100D+12 1.759D+13 1.763D+13 2.815D+14 2.906D+14 4.504D+15 9.007D+15 Notice that the equality1 can be applied even if n is a real number greater than 1. In some results that we are going to derive, we will consider real numbers n, as this does not change the definition. In this section, we will use regularily the following result from calculus which describes the geometric series. Proposition 2.2. ( Geometric series) Assume that x 2 R is such that jxj < 1. Therefore, 1 = 1 + x + x2 + x3 + x4 + ::: (2) 1 − x Before entering into more complex developments, let us consider more elementary properties of γn. Proposition 2.3. ( Basic properties of γn) Assume that j; k; n 2 R are three real numbers such that j; k; n ≥ 1 and ju; ku; nu < 1. If j ≤ k, then γj ≤ γk (3) If jku < 1, then jγk ≤ γjk: (4) Notice that we have chosen to consider the case where j; k; n are real numbers instead of the more restrictive case where they are integers. As we are going to see, this does not change anything to the results that we are going to prove. Proof. We first prove the inequality3. Assume that j ≤ k. Since u > 0, we have ju ≤ ku: (5) The hypothesis ju < 1 implies that 1 − ju is positive, hence we can multiply both sides of the 1 previous inequality by 1−ju and get ju ku ≤ : (6) 1 − ju 1 − ju In turn, this implies −ju ≥ −ku, which implies 1 − ju ≥ 1 − ku. The hypothesis ku < 1 implies that 1 − ku > 0, which allows to invert the inequality and implies 1 1 ≤ : (7) 1 − ju 1 − ku 5 We plug the inequality7 into6 and get ju ku ≤ ; (8) 1 − ju 1 − ku which immediately leads to the inequality3. We now prove the inequality4. We have jku jγ = : (9) k 1 − ku The hypothesis j ≥ 1 implies k ≤ kj. Then the inequality u > 0 implies ku ≤ jku. Hence 1 − ku ≥ 1 − jku. The hypothesis jku < 1 implies that 1 − jku > 0. Hence, we can invert the previous inequality, which implies 1 1 ≤ : (10) 1 − ku 1 − jku We plug the inequality 10 into9, which implies jku jγ ≤ : (11) k 1 − jku Then, the definition of γjk concludes the proof. In most error analysis, we are considering terms of the form 1+δ, where δ is a relative error of small magnitude, i.e. such that jδj ≤ u. In the following proposition, we prove that terms 1 + δi can be expressed as 1 + θn, where θn is only slightly larger than nu. Proposition 2.4. ( Bound for a product of (1+δi)) Assume that n = 1; 2;::: is such that nu < 1. Assume that δi 2 R is such that jδij ≤ u, for i = 1; 2; : : : ; n. Then there exists a θn 2 R such that (1 + δ1)(1 + δ2) ··· (1 + δn) = 1 + θn; (12) where jθnj ≤ γn; (13) and where γn is defined by1. Proof. We first check that this proposition is true for n = 1 and n = 2. It is easy to check this proposition for n = 1. We assume that δ1 2 R is such that jδ1j ≤ u. Then 1 + δ1 = 1 + θ1, with θ1 = δ1. Moreover, jθ1j = jδ1j ≤ u. But u > 0, by hypothesis. 1 u Therefore, −u < 0, which implies 1 − u < 1. Hence, 1 < 1−u . This leads to jθ1j ≤ u ≤ 1−u . We can also check this proposition for n = 2. We assume that δi 2 R is such that jδij ≤ u, for i = 1; 2. We have (1 + δ1)(1 + δ2) = 1 + δ1 + δ2 + δ1δ2: (14) Hence, (1 + δ1)(1 + δ2) = 1 + θ2; (15) 6 where θ2 = δ1 + δ2 + δ1δ2. We have to prove that jθ2j ≤ γ2. We consider the absolute value of θ2j, and get jθ2j ≤ jδ1j + jδ2j + jδ1jjδ2j (16) ≤ u + u + uu (17) ≤ 2u + u2 (18) u ≤ 2u(1 + ): (19) 2 u 1 We can conclude if we can prove that 1 + 2 ≤ 1−2u .
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages30 Page
-
File Size-