Lecture 10: Floating Point

Lecture 10: Floating Point

Lecture 10: Floating Point Spring 2019 Jason Tang "1 Topics • Representing floating point numbers! • Floating point arithmetic! • Floating point accuracy "2 Floating Point Numbers • So far, all arithmetic has involved numbers in the set of ℕ and ℤ! • But what about:! • Very large integers, greater than 2n, given only n bits! • Very small numbers, like 0.000123! • Rational numbers in $ but not in ℤ, like ! • Irrational and transcendental numbers, like p 2 and "3 Floating Point Numbers Decimal Point Exponent% (signed magnitude) Significand or Fraction or Mantissa% Magnitude (signed magnitude) Radix or Base • Issues:! • Arithmetic (addition, subtraction, multiplication, division)! • Representation, normal form! • Range and precision! • Rounding! • Exceptions (divide by zero, overflow, underflow) "4 Normal Form • The value 41110 could be stored as 4110 & 10-1, 411 & 100, 41.1 & 101, 4.11 & 102, etc.! • In scientific notation, values are usually normalized such that one non-zero digit to left of decimal point: 4.11 & 102! • Computers numbers use base-2: 1.01 & 21101! • Because the digit to the left of the decimal point (“binary point”?) will always be a 1, that digit is omitted when storing the number! • In this example, the stored significand will be 01, and exponent is 1101 "5 Floating Point Representation • Size of exponent determines the range of represented numbers! • Accuracy of representation depends on size of significand! • Trade-o' between accuracy and range! • Overflow: required positive exponent too large to fit in given number of bits for exponent! • Underflow: required negative exponent too large to fit in given number of bits for exponent "6 IEEE 754 Standard Representation Type Sign Exponent Significand Total Bits Exponent Bias Half 1 5 10 16 15 Single 1 8 23 32 127 Double 1 11 52 64 1023 Quad 1 15 112 128 16383 • Same representation used in nearly all computers since mid-1980s! • In general, a floating point number = (-1)Sign & 1.Significand & 2Exponent! • Exponent is biased, instead of Two’s Complement! • For single precision, actual magnitude is 2(Exponent - 127) "7 Single-Precision Example • Convert -12.62510 to single precision IEEE-754 format! • Step 1: Convert to target base 2: -12.62510 → -1100.1012! • Step 2: Normalize: -1100.1012 → -1.1001012 & 23! • Step 3: Add bias to exponent: 3 → 130 S Exponent Significand 1 1 0 0 0 0 0 1 0 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Leading 1 of significand is implied "8 Single and Double Precision Example • Convert -0.7510 to single and double precision IEEE-754 format! • -0.7510 → (-3/4)10 → (-3/22)10 → -112 & 2-2 → -1.12 & 2-1! S Exponent Significand (23 bits) • Single Precision:! 1 0 1 1 1 1 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 • Double Precision: S Exponent Significand (upper 20 bits) 1 0 1 1 1 1 1 1 1 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Significand (lower 32 bits) 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 "9 Denormalized Numbers • The smallest single precision normalized number is 1.000 0000 0000 0000 0000 0001 & 2-126! • Use denormalized (or subnormal) numbers to store a value between 0 and above number! • Denormal numbers have a leading implicit 0 instead of 1! • Needed when subtracting two values, where the di'erence is not 0 but close to it! • Denormalized are allowed to degrade in significance until it becomes 0 (gradual underflow) "10 Special Values • Some bit patterns are special:! • Negative zero! • ±Infinity, for overflows! • Not a Number (NaN), for invalid operations like 0/0, ( - (, or Single Precision Double Precision Special Value Exponent Significand Exponent Significand 0 0 0 0 0 or -0 0 Nonzero 0 Nonzero ± denormalized 1-254 Anything 1-2046 Anything ± floating point 255 0 2047 0 ± infinity 255 Nonzero 2047 Nonzero NaN "11 Floating Point Arithmetic • Floating point arithmetic di'ers from integer arithmetic in that exponents are handled as well as the significands! • For addition and subtraction, exponents of operands must be equal! • Significands are then added/subtracted, and then result is normalized! • Example: [1].101 & 23 + [1].111 & 24! • Adjust exponents to equal larger exponent: 0.1101 & 24 + [1].1110 & 24! • Sum is thus 10.1011 & 24 → [1].01011 & 25 "12 Floating Point Addition / Subtraction Start • Compute E = Aexp - Bexp! 1. Compare the exponents of the two numbers. Shift the smaller number to the right until its exponent would match the larger exponent • Right-shift Asig to form Asig & 2E 2. Add the significands • Compute R = Asig + Bsig! 3. Normalize the sum, either shifting right and incrementing the exponent or shifting left and decrementing the exponent • Left shift R and decrement E, or right shift Overflow or Yes R and increment E, until MSB of R is underflow? No implicit 1 (normalized form)! Exception 4. Round the significand to the appropriate number of bits • If cannot left shift enough, then keep R No as denormalized Still normalized? Yes Done "13 Floating Point Addition Example • Calculate 0.510 - 0.437510, with only 4 bits of precision! • 0.510 → (1/2)10 → 0.12 & 20 → 1.0002 & 2-1! • 0.437510 → (1/4)10 + (1/8)10 + (1/16)10 → 0.01112 & 20 → 1.1102 & 2-2! • Shift operand with lesser exponent: 1.1102 & 2-2 → 0.1112 & 2-1! • Subtract significands: 1.0002 & 2-1 - 0.1112 & 2-1 = 0.0012 & 2-1! • Normalize: 0.0012 & 2-1 → 1.0002 & 2-4 = (2-4)10 → (1/16)10 → 0.062510! • No overflow/underflow, because exponent is between -126 and +127 "14 Floating Addition Hardware Sign Exponent Significand Sign Exponent Significand Compare Small ALU exponents Exponent difference 0 1 0 1 0 1 Shift smaller Control Shift right number right Add Big ALU 0 1 0 1 Increment or decrement Shift left or right Normalize Rounding hardware Round Sign Exponent Significand "15 Floating Point Multiplication and Division • For multiplication and division, the sign, exponent, and significand are computed separately! • Same signs → positive result, di'erent signs → negative result! • Exponents calculated by adding/subtracting exponents! • Significands multiplied/divided, and then normalized "16 Floating Point Multiplication Example • Calculate 0.510 & -0.437510, with 1. 0 0 0 only 4 bits of precision! & 1.0 1 1 0 0 0 0 0 • 1.0002 & 2-1 & 1.1102 & 2-2! 1 0 0 0 1 0 0 0 • Sign bits di'er, so result is negative + 1 0 0 0 (sign bit = 1)! 1.0 1 1 0 0 0 0 • Add exponents, without bias: (-1) + • Keep result to 4 precision bits: (-2) = -3! 1.1102 & 2-3! • Multiply significands:! • Normalize results: -1.1102 & 2-3 "17 Floating Point Multiplication Start • Compute E = Aexp + Bexp - Bias! 1. Add the biased exponents of the two numbers, subtracting the bias from the sum to get the new biased exponent • Compute S = Asig & Bsig! 2. Multiply the significands 3. Normalize the product if necessary, shifting • Left shift S and decrement E, or right shift S it right and incrementing the exponent and increment E, until MSB of S is implicit 1 Overflow or Yes (normalized form)! underflow? No Exception 4. Round the significand to the appropriate • If cannot left shift enough, then keep S as number of bits denormalized! No Still normalized? • Round S to specified size! Yes 5. Set the sign of the product to positive if the signs of the original operands are the same; if they differ make the sign negative • Calculate sign of product Done "18 Accuracy • Floating-point numbers are approximations of a value in ℝ! • Example: ) in single-precision floating point is 1.100100100001111110110112 & 21! • This floating point value is exactly 3.1415927410125732421875! • Truer value is 3.1415926535897932384626…! • Floating point value is accurate to only 8 decimal digits! • Hardware needs to round accurately after arithmetic operations http://www.exploringbinary.com/pi-and-e-in-binary/ "19 Rounding Errors • Unit in the last place (ulp): number of bits in error in the LSB of significand between actual number and number that can be represented! • Example: store a floating point number in base 10, maximum 3 significand digits! • 3.14 & 101 and 3.15 & 101 are valid, but 3.145 & 101 could not be stored! • ULP is thus 0.01! • If storing ), and then rounding to nearest (i.e., 3.14 & 101), the rounding error is 0.0015926…! • If rounding to nearest, then maximum rounding error 0.005, or 0.5 of a ULP https://matthew-brett.github.io/teaching/floating_error.html "20 Accurate Arithmetic • When adding and subtracting, append extra guard digits to LSB! • When rounding to nearest even, use guard bits to determine to round up or down! • Example: 2.3410 & 100 + 2.5610 & 10-2, with and without guard bits, maximum 3 significand digits 2. 3 4 2. 3 4 0 0 + 0. 0 2 + 0. 0 2 5 6 → rounded to 2.37 & 100 2. 3 6 2. 3 6 5 6 "21 IEEE 754 Rounding Modes • Always round up (towards +()! • Always round down (towards -()! • Truncate! • Round to nearest even (RNE)! • Most commonly used, including on IRS tax forms! • If LSB is 1, then round up (resulting in a LSB of 0); otherwise round down (again resulting in a LSB of 0) "22.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    22 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us