Lecture 10: Floating Point

Lecture 10: Floating Point Spring 2019 Jason Tang "1 Topics • Representing floating point numbers! • Floating point arithmetic! • Floating point accuracy "2 Floating Point Numbers • So far, all arithmetic has involved numbers in the set of ℕ and ℤ! • But what about:! • Very large integers, greater than 2n, given only n bits! • Very small numbers, like 0.000123! • Rational numbers in $ but not in ℤ, like ! • Irrational and transcendental numbers, like p 2 and "3 Floating Point Numbers Decimal Point Exponent% (signed magnitude) Significand or Fraction or Mantissa% Magnitude (signed magnitude) Radix or Base • Issues:! • Arithmetic (addition, subtraction, multiplication, division)! • Representation, normal form! • Range and precision! • Rounding! • Exceptions (divide by zero, overflow, underflow) "4 Normal Form • The value 41110 could be stored as 4110 & 10-1, 411 & 100, 41.1 & 101, 4.11 & 102, etc.! • In scientific notation, values are usually normalized such that one non-zero digit to left of decimal point: 4.11 & 102! • Computers numbers use base-2: 1.01 & 21101! • Because the digit to the left of the decimal point (“binary point”?) will always be a 1, that digit is omitted when storing the number! • In this example, the stored significand will be 01, and exponent is 1101 "5 Floating Point Representation • Size of exponent determines the range of represented numbers! • Accuracy of representation depends on size of significand! • Trade-o' between accuracy and range! • Overflow: required positive exponent too large to fit in given number of bits for exponent! • Underflow: required negative exponent too large to fit in given number of bits for exponent "6 IEEE 754 Standard Representation Type Sign Exponent Significand Total Bits Exponent Bias Half 1 5 10 16 15 Single 1 8 23 32 127 Double 1 11 52 64 1023 Quad 1 15 112 128 16383 • Same representation used in nearly all computers since mid-1980s! • In general, a floating point number = (-1)Sign & 1.Significand & 2Exponent! • Exponent is biased, instead of Two’s Complement! • For single precision, actual magnitude is 2(Exponent - 127) "7 Single-Precision Example • Convert -12.62510 to single precision IEEE-754 format! • Step 1: Convert to target base 2: -12.62510 → -1100.1012! • Step 2: Normalize: -1100.1012 → -1.1001012 & 23! • Step 3: Add bias to exponent: 3 → 130 S Exponent Significand 1 1 0 0 0 0 0 1 0 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Leading 1 of significand is implied "8 Single and Double Precision Example • Convert -0.7510 to single and double precision IEEE-754 format! • -0.7510 → (-3/4)10 → (-3/22)10 → -112 & 2-2 → -1.12 & 2-1! S Exponent Significand (23 bits) • Single Precision:! 1 0 1 1 1 1 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 • Double Precision: S Exponent Significand (upper 20 bits) 1 0 1 1 1 1 1 1 1 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Significand (lower 32 bits) 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 "9 Denormalized Numbers • The smallest single precision normalized number is 1.000 0000 0000 0000 0000 0001 & 2-126! • Use denormalized (or subnormal) numbers to store a value between 0 and above number! • Denormal numbers have a leading implicit 0 instead of 1! • Needed when subtracting two values, where the di'erence is not 0 but close to it! • Denormalized are allowed to degrade in significance until it becomes 0 (gradual underflow) "10 Special Values • Some bit patterns are special:! • Negative zero! • ±Infinity, for overflows! • Not a Number (NaN), for invalid operations like 0/0, ( - (, or Single Precision Double Precision Special Value Exponent Significand Exponent Significand 0 0 0 0 0 or -0 0 Nonzero 0 Nonzero ± denormalized 1-254 Anything 1-2046 Anything ± floating point 255 0 2047 0 ± infinity 255 Nonzero 2047 Nonzero NaN "11 Floating Point Arithmetic • Floating point arithmetic di'ers from integer arithmetic in that exponents are handled as well as the significands! • For addition and subtraction, exponents of operands must be equal! • Significands are then added/subtracted, and then result is normalized! • Example: [1].101 & 23 + [1].111 & 24! • Adjust exponents to equal larger exponent: 0.1101 & 24 + [1].1110 & 24! • Sum is thus 10.1011 & 24 → [1].01011 & 25 "12 Floating Point Addition / Subtraction Start • Compute E = Aexp - Bexp! 1. Compare the exponents of the two numbers. Shift the smaller number to the right until its exponent would match the larger exponent • Right-shift Asig to form Asig & 2E 2. Add the significands • Compute R = Asig + Bsig! 3. Normalize the sum, either shifting right and incrementing the exponent or shifting left and decrementing the exponent • Left shift R and decrement E, or right shift Overflow or Yes R and increment E, until MSB of R is underflow? No implicit 1 (normalized form)! Exception 4. Round the significand to the appropriate number of bits • If cannot left shift enough, then keep R No as denormalized Still normalized? Yes Done "13 Floating Point Addition Example • Calculate 0.510 - 0.437510, with only 4 bits of precision! • 0.510 → (1/2)10 → 0.12 & 20 → 1.0002 & 2-1! • 0.437510 → (1/4)10 + (1/8)10 + (1/16)10 → 0.01112 & 20 → 1.1102 & 2-2! • Shift operand with lesser exponent: 1.1102 & 2-2 → 0.1112 & 2-1! • Subtract significands: 1.0002 & 2-1 - 0.1112 & 2-1 = 0.0012 & 2-1! • Normalize: 0.0012 & 2-1 → 1.0002 & 2-4 = (2-4)10 → (1/16)10 → 0.062510! • No overflow/underflow, because exponent is between -126 and +127 "14 Floating Addition Hardware Sign Exponent Significand Sign Exponent Significand Compare Small ALU exponents Exponent difference 0 1 0 1 0 1 Shift smaller Control Shift right number right Add Big ALU 0 1 0 1 Increment or decrement Shift left or right Normalize Rounding hardware Round Sign Exponent Significand "15 Floating Point Multiplication and Division • For multiplication and division, the sign, exponent, and significand are computed separately! • Same signs → positive result, di'erent signs → negative result! • Exponents calculated by adding/subtracting exponents! • Significands multiplied/divided, and then normalized "16 Floating Point Multiplication Example • Calculate 0.510 & -0.437510, with 1. 0 0 0 only 4 bits of precision! & 1.0 1 1 0 0 0 0 0 • 1.0002 & 2-1 & 1.1102 & 2-2! 1 0 0 0 1 0 0 0 • Sign bits di'er, so result is negative + 1 0 0 0 (sign bit = 1)! 1.0 1 1 0 0 0 0 • Add exponents, without bias: (-1) + • Keep result to 4 precision bits: (-2) = -3! 1.1102 & 2-3! • Multiply significands:! • Normalize results: -1.1102 & 2-3 "17 Floating Point Multiplication Start • Compute E = Aexp + Bexp - Bias! 1. Add the biased exponents of the two numbers, subtracting the bias from the sum to get the new biased exponent • Compute S = Asig & Bsig! 2. Multiply the significands 3. Normalize the product if necessary, shifting • Left shift S and decrement E, or right shift S it right and incrementing the exponent and increment E, until MSB of S is implicit 1 Overflow or Yes (normalized form)! underflow? No Exception 4. Round the significand to the appropriate • If cannot left shift enough, then keep S as number of bits denormalized! No Still normalized? • Round S to specified size! Yes 5. Set the sign of the product to positive if the signs of the original operands are the same; if they differ make the sign negative • Calculate sign of product Done "18 Accuracy • Floating-point numbers are approximations of a value in ℝ! • Example: ) in single-precision floating point is 1.100100100001111110110112 & 21! • This floating point value is exactly 3.1415927410125732421875! • Truer value is 3.1415926535897932384626…! • Floating point value is accurate to only 8 decimal digits! • Hardware needs to round accurately after arithmetic operations http://www.exploringbinary.com/pi-and-e-in-binary/ "19 Rounding Errors • Unit in the last place (ulp): number of bits in error in the LSB of significand between actual number and number that can be represented! • Example: store a floating point number in base 10, maximum 3 significand digits! • 3.14 & 101 and 3.15 & 101 are valid, but 3.145 & 101 could not be stored! • ULP is thus 0.01! • If storing ), and then rounding to nearest (i.e., 3.14 & 101), the rounding error is 0.0015926…! • If rounding to nearest, then maximum rounding error 0.005, or 0.5 of a ULP https://matthew-brett.github.io/teaching/floating_error.html "20 Accurate Arithmetic • When adding and subtracting, append extra guard digits to LSB! • When rounding to nearest even, use guard bits to determine to round up or down! • Example: 2.3410 & 100 + 2.5610 & 10-2, with and without guard bits, maximum 3 significand digits 2. 3 4 2. 3 4 0 0 + 0. 0 2 + 0. 0 2 5 6 → rounded to 2.37 & 100 2. 3 6 2. 3 6 5 6 "21 IEEE 754 Rounding Modes • Always round up (towards +()! • Always round down (towards -()! • Truncate! • Round to nearest even (RNE)! • Most commonly used, including on IRS tax forms! • If LSB is 1, then round up (resulting in a LSB of 0); otherwise round down (again resulting in a LSB of 0) "22.

Lecture 10: Floating Point

Design of 32 Bit Floating Point Addition and Subtraction Units Based on IEEE 754 Standard Ajay Rathor, Lalit Bandil Department of Electronics & Communication Eng

Floating Point Representation (Unsigned) Fixed-Point Representation

Floating Point Numbers and Arithmetic

Floating Point Numbers Dr

On Various Ways to Split a Floating-Point Number

A Decimal Floating-Point Speciftcation

Floating Point Arithmetic

Floating Point Square Root Under HUB Format

AC Library for Emulating Low-Precision Arithmetic Fasi

Chap02: Data Representation in Computer Systems

Names for Standardized Floating-Point Formats

Largest Denormalized Negative Infinity ---Number with Hex