Floating Point Arithmetic Memory Hierarchy and Cache

Lecture 7: Floating Point Arithmetic Memory Hierarchy and Cache 1 Question: Suppose we want to compute using four didecima l arithtiithmetic: ¾ S = 1.000 + 1.000x104 – 1.000x104 ¾ What’s the answer? Ariane 5 rocket ¾ June 1996 explod ed when a 64 bit fl pt number relating to the horizontal velocity of the rocket was converted to a 16 bit signed integer. The number was larger than 32,767, and thus the conversion failed. 2 ¾ $500M rocket and cargo 1 Defining Floating Point Arithmetic Representable numbers ¾ Scientific notation: +/- d.d…d x rexp ¾ sign bit +/- ¾ radix r (usually 2 or 10, sometimes 16) ¾ significand d.d…d (how many base-r digits d?) ¾ exponent exp (range?) ¾ others? Operations: ¾ arithmetic: +,-,x,/,... » how to round result to fit in format ¾ comparison (<, =, >) ¾ conversion between different formats » short to long FP numbers, FP to integer ¾ exception handling » what to do for 0/0, 2*largest_number, etc. ¾ binary/decimal conversion » for I/O, when radix not 10 3 IEEE Floating Point Arithmetic Standard 754 (1985) - Normalized Numbers Normalized Nonzero Representable Numbers: +- 1.d…d x 2exp ¾ Macheps = Machine epsilon = 2-#significand bits = relative error in each operation smallest number ε such that fl( 1 + ε ) > 1 ¾ OV = overflow threshold = largest number ¾ UN = underflow threshold = smallest number Format # bits #significand bits macheps #exponent bits exponent range ---------- -------- ----------------------- ------------ -------------------- ---------------------- Single 32 23+1 2-24 (~10-7) 8 2-126 -2127 (~10+-38) Double 64 52+1 2-53 (~10-16) 11 2-1022 -21023 (~10+-308) Double >=80 >=64 <=2-64(~10-19) >=15 2-16382 -216383 (~10+-4932) Extended (80 bits on all Intel machines) +- Zero: +-, significand and exponent all zero 4 ¾ Why bother with -0 later 2 IEEE Floating Point Arithmetic Standard 754 - “Denorms” Denormalized Numbers: +-0.d…d x 2min_exp ¾ sign bit, nonzero significand, minimum exponent ¾ Fills in gap between UN and 0 Underflow Exception ¾ occurs when exact nonzero result is less than underflow threshold UN ¾ Ex: UN/3 ¾ return a denorm, or zero 5 IEEE Floating Point Arithmetic Standard 754 - +- Infinity +- Infinity: Sign bit, zero significand, maximum exponent Overflow Exception ¾occurs when exact finite result too large to represent accurately ¾Ex: 2*OV ¾return +- infinity Divide by zero Exception ¾return +- infinity = 1/+-0 ¾sign of zero important! Also return +- infinity for ¾3+infinity, 2*infinity, infinity*infinity ¾Result is exact, not an exception! 6 3 IEEE Floating Point Arithmetic Standard 754 - NAN (Not A Number) NAN: Sign bit, nonzero significand, maximum exponent Invalid Exception ¾occur s when exact result ntnot a well-dfinddefined real number ¾0/0 ¾sqrt(-1) ¾infinity-infinity, infinity/infinity, 0*infinity ¾NAN + 3 ¾NAN > 3? ¾Return a NAN in all these cases Two kikidnds of NANs ¾Quiet - propagates without raising an exception ¾Signaling - generate an exception when touched » good for detecting uninitialized data 7 Error Analysis Basic error formula ¾fl(a op b) = (a op b)*(1 + d) where » op one of +,,,-*/ » |d| <= macheps » assuming no overflow, underflow, or divide by zero Example: adding 4 numbers ¾fl(x1+x2+x3+x4) = {[(x1+x2)*(1+d1) + x3]*(1+d2) + x4}*(1+d3) = x1*(1+d1)*(1+d2)*(1+d3) + x2*(1+d1)*(1+d2)*(1+d3) + x3*(1+d2)*(1+d 3) + x4*(1+d3) = x1*(1+e1) + x2*(1+e2) + x3*(1+e3) + x4*(1+e4) where each |ei| <~ 3*macheps ¾get exact sum of slightly changed summands xi*(1+ei) ¾Backward Error Analysis - algorithm called numerically stable if it gives the exact result for slightly changed inputs ¾Numerical Stability is an algorithm design goal 8 4 Backward error Approximate solution is exact solution to modified problem. HHwow larg e a m odif ication to orig inal p roblem is requ ired to give result actually obtained? How much data error in initial input would be required to explain all the error in computed results? Approximate solution is good if it is exact solution to “nearby” problem. f x f(x) Backward error f’ Forward error x’ f’(x) f 9 Sensitivity and Conditioning Problem is insensitive or well conditioned if relative change in input causes commensurate relative change in solution. Problem is sensitive or ill-conditioned, if relative change in solution can be much larger than that in input data. Cond = |Relative change in solution| / |Relative change in input data| = |[f(x’) – f(x)]/f(x)| / |(x’ – x)/x| Problem is sensitive, or ill-conditioned, if cond >> 1. When function f is evaluated for approximate input x’ = x+h instead of true input value of x. Absolute error = f(x + h) – f(x) ≈ h f’(x) Relative error =[ f(x + h) – f(x) ] / f(x) ≈ h f’(x) / f(x) 10 5 Sensitivity: 2 Examples cos(π/2) and 2-d System of Equations a*x1+ bx2 = f Consider problem of computing c*x1+ dx2 = g cosine function for arguments near π/2. Let x ≈ π/2 and let h be small perturbation to x. Then . Abs: f(x + h) – f(x) ≈ h f’(x) Rel: [ f(x + h) – f(x) ] / f(x) ≈ h f’(x) / f(x) absolute error = cos(x+h) – cos(x) ≈ -h sin(x) ≈ -h, relative error ≈ -h tan(x) ≈ ∞ So small change in x near π/2 causes large relative change in cos(x) regardless of method used. cos(1.57079) = 0.63267949 x 10-5 cos(1.57078) = 1.64267949 x 10-5 Relative change in output is a quarter million times greater than 11 relative change in input. Sensitivity: 2 Examples cos(π/2) and 2-d System of Equations a*x1+ bx2 = f Consider problem of computing c*x1+ dx2 = g cosine function for arguments near π/2. Let x ≈ π/2 and let h be small perturbation to x. Then . absolute error = cos(x+h) – cos(x) ≈ -h sin(x) ≈ -h, relative error ≈ -h tan(x) ≈ ∞ So small change in x near π/2 causes large relative change in cos(x) regardless of method used. cos(1.57079) = 0.63267949 x 10-5 . cos(1.57078) = 1.64267949 x 10-5 Relative change in output is a quarter million times greater than relative change in input. 12 6 Exception Handling What happens when the “exact value” is not a real number, or too small or too larggpe to represent accurately ? 5 Exceptions: ¾Overflow - exact result > OV, too large to represent ¾Underflow - exact result nonzero and < UN, too small to represent ¾Divide-by-zero -nonzero/0 ¾Invalid - 0/0, sqrt(-1), … ¾Inexact - yygyou made a rounding error (very common!) Possible responses ¾Stop with error message (unfriendly, not default) ¾Keep computing (default, but how?) 13 Summary of Values Representable in IEEE FP +- Zero +- 0…0 0……………………0 +- Not 0 or NlidNormalized nonzero numbers all 1s anything Denormalized numbers +- 0…0 nonzero +-Infinity +- 1….1 0……………………0 NANs +- 1….1 nonzero ¾Sigggnaling and quiet ¾Many systems have only quiet 14 7 Questions? 15 More on the In-Class Presentations Start at 1:00 on Monday, 5/5/08 Turn in reports on MMdonday, 5/2/08 Presentations roughly 20 minutes each Use powerpoint or pdf Describe your project, perhaps motivate via application DibDescribe your meththd/hod/approach Provide comparison and results See me about your topic 16 8 Cache and Memory 17 Cache and Its Importance in Performance Motivation: ¾ Time to run code = clock cycles running code + clock cycles waiting for memory ¾ For many years, CPU’s have sped up an average of 50% per year over memory chip speed ups. Hence, memory access is the bottleneck to computing fast. 18 9 DRAM 9%/yr. Latency in a Single System (2X/10 yrs) 500 1000 Ratio 400 atio MMemoryemory AAccessccess TimTimee R 100 300 200 10 Time (ns) 100 1 CPU Time Memory to CPU 0 0.1 1997 1999 2001 2003 2006 2009 µProc X-Axis 60%/yr. CPU Clock Period (ns) Ratio (2X/1.5yr) Memory System Access Time Processor-Memory Performance Gap: (grows 50% / year) 19 Here’s your problem Say 2.26 GHz ¾ 2 ops/cycle DP ¾ 4.52 Gflop/s peak FSB 533 MHz ¾ 32 bit data path (4 bytes) or 2.132 GB/s With 8 bytes/word (DP) ¾ 266.5 MW/s from 20 memory 10 Intel Clovertown Quad-core processor Each core does 4 floating point ops/s Say 2.4 GHz ¾ thus 4 ops/core*4 flop/s * 2. 4 GHz = 38.4 Gflop/s peak FSB 1.066 GHz ¾ 1.066 GHz*4B /8 (W/B) = 533 MW/s » There’s your problem 21 Commodity Processor Trends Bandwidth/Latency is the Critical Issue, not FLOPS Got Bandwidth? Annual Typical value Typical value Typical value increase in 2005 in 2010 in 2020 Single-chip floating-point 59% 4 GFLOP/s 32 GFLOP/s 3300 GFLOP/s performance Front-side bus 1 GWord/s 3.5 GWord/s 27 GWord/s 23% bandwidth = 0.25 word/flop = 0.11 word/flop = 0.008 word/flop 70 ns 50 ns 28 ns DRAM latency (5.5%) = 280 FP ops = 1600 FP ops = 94,000 FP ops = 70 loads = 170 loads = 780 loads Source: Getting Up to Speed: The Future of Supercomputing, National Research Council, 222 22 pages, 2004, National Academies Press, Washington DC, ISBN 0-309-09502-6. 11 Solving the Memory Bottleneck Since we cannot make fast enough memories, we itdinvented the memory hierarchy ¾ L1 Cache (on chip) ¾ L2 Cache ¾ Optional L3 Cache ¾ Main Memory ¾ Hard Drive Cache Memories Cache memories are small, fast SRAM-based memories managed automatically in hardware. ¾ Hold frequently accessed blocks of main memory CPU looks first for data in L1, then in L2, then in main memory. Typical bus structure: CPU chip register file L1 ALU cache cache bus system bus memory bus I/O main L2 cache bus interface bridge memory 12 26 What is a cache? Small, fast storage used to improve average access time to slow memory.

Load more