Introduction & Floating Point NMMC §1.4.1, Atkinson §1.2 1 Floating

Introduction & Floating Point NMMC §1.4.1, Atkinson §1.2 1 Floating

Introduction & Floating Point NMMC x1.4.1, Atkinson x1.2 1 Floating-point representation, IEEE 754. In binary, 11:01 = 1 × 21 + 1 × 20 + 0 × 2−1 + 1 × 2−2: Double-precision floating-point format has 64 bits: ±; a1; : : : ; a11; b1; : : : ; b52 The associated number is a1···a11−1023 # = ±1:b1 ··· b52 × 2 : If all the a are zero, then the number is `subnormal' and the representation is −1022 # = ±0:b1 ··· b52 × 2 If the a are all 1 and the b are all 0 then it's ±∞ (± 1/0). If the a are all 1 and the b are not all 0 then it's NaN (±0=0). Machine-epsilon is the difference between 1 and the smallest representable number larger than 1, i.e. = 2−52 ≈ 2 × 10−16. The smallest and largest positive numbers are ≈ 10±308. Numbers larger than the max are set to 1. 2 Rounding & Arithmetic. Round-toward-nearest rounds a number towards the nearest floating-point num- ber. For nonzero numbers x within the range of max/min, the relative errors due to rounding are jround(x) − xj ≤ 2−53 ≈ 10−16 = u jxj which is half of machine epsilon. Floating-point arithmetic: add, subtract, multiply, divide. IEEE 754: If you start with 2 exactly- represented floating-point numbers x and y and compute any arithmetic operation `op', the magnitude of the relative error in the result is less than half machine precision. I.e. there is some δ with jδj ≤u such that fl(x op y) = (x op y)(1 + δ): The exception is underflow/overflow, where the result is either larger or smaller than can be represented. Example: 1 + 2−53 = 1 in floating-point, even though both numbers are exactly representable in double precision. Consider 1 + 2−53 + 2−53 = 1 + 2−52. If you add left-to-right you'll get 1. If you add right-to-left you'll get the right answer. Order matters. Catastrophic cancellation: If you subtract two numbers that are very close the result will have very few significant digits. p q r = p − p2 + q = ; p = 12345678; q = 1: p + pp2 + q Left formula in double-precision produces r ≈ −4:0978 × 10−8 right formula produces r ≈ −4:0500 × 10−8: Left formula suffers from catastrophic cancellation, resulting in very few significant digits. The easiest way to check whether your results are sensitive to roundoff errors is to compute the solution using different levels of precision (single, double, quad). 3 Types of error analysis. Let x be the input to some problem (e.g. the coefficient matrix or the RHS) and y be the solution. 1 • Forward error: Let y be the true solution andy ~ be the computed solution; one attempts to bound jy − y~j. • Backward error: One attempts to show thaty ~ is the exact solution to a perturbed problem with data x~, and then provide bounds for jx − x~j. Example: Solve 3y = x where x = 2. The computed solution will bey ~ = (2=3)(1 + δ). The forward error estimate is jy − y~j ≤ (2=3)u. The backward error estimate says that this is the exact solution for x~ = x(1 + δ(x)), with error jx − x~j ≤ jxju. 2.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    2 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us